ReAct (prompting)

AI Agents Prompt Engineering Reasoning Models

28 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v6 · 5,514 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

What is ReAct prompting?

ReAct (short for Reasoning and Acting) is a prompting paradigm for large language models that interleaves verbal reasoning traces ("Thoughts") with task-specific actions executed against an external environment, looping Thought, Action, and Observation until the task is solved.^[1] Introduced by Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao in the October 2022 paper "ReAct: Synergizing Reasoning and Acting in Language Models" (arXiv:2210.03629), ReAct lets an LLM alternately generate free-form thoughts and concrete actions, where the thoughts plan and revise behavior while the actions read from search APIs, knowledge bases, or simulated environments.^[1] In the paper's own benchmark results, ReAct outperformed imitation and reinforcement learning methods "by an absolute success rate of 34% and 10%" on the ALFWorld and WebShop interactive decision-making tasks, respectively, "while being prompted with only one or two in-context examples."^[1] The paper was accepted as an Oral (Notable Top 5%) at the International Conference on Learning Representations (ICLR) 2023.^[2]

Yao and Narasimhan were at Princeton University at the time of writing; Zhao, Yu, Du, Shafran, and Cao were at Google Research, Brain team. The work was conducted during Yao's internship at Google.^[1] An accompanying Google Research blog post announced the method on November 8, 2022,^[3] and the authors released open-source PyTorch/Jupyter code on GitHub (now over 3,800 stars) implementing ReAct against both PaLM-540B and GPT-3 text-davinci-002.^[4]

ReAct was one of the first frameworks to demonstrate that combining chain-of-thought reasoning with tool use produces agents that are simultaneously more accurate, more interpretable, and less prone to hallucination than either reasoning-only or action-only baselines.^[1] Its Thought-Action-Observation loop became the default control flow in production agent frameworks including LangChain, LangGraph, LlamaIndex, the Hugging Face transformers agents API, and the underlying logic of many commercial tool-calling and "agentic" runtimes.^[5]^[6]

Background and motivation

Before ReAct, two separate lines of research had shown that large language models possess useful but isolated capabilities. On one hand, chain-of-thought prompting (Wei et al., 2022) demonstrated that asking a model to "think step by step" before answering substantially improves performance on arithmetic, commonsense, and symbolic reasoning tasks.^[7] On the other hand, work on action generation and tool use, including SayCan, Inner Monologue, WebGPT, and later Toolformer, showed that language models can learn to call external APIs, navigate web pages, or control robotic systems when prompted with appropriate action schemas.

These two capabilities had mostly been studied in isolation. Chain-of-thought prompting relies entirely on the model's internal knowledge, which means it can produce plausible-sounding but factually incorrect reasoning chains, what the ReAct authors call "fact hallucination" and "error propagation."^[1] Conversely, action-only approaches let models interact with the world but lack the ability to reason about what action to take next, why a previous action failed, or how to synthesize information gathered across multiple steps.

Yao et al. argued that humans do not separate reasoning from acting in practice. When solving a problem, a person typically thinks about what information is needed, searches for it, reads the results, reasons about what was found, decides on a next step, and continues until the answer becomes clear. ReAct formalizes this intuition by prompting language models to produce interleaved sequences of thoughts and actions. The paper introduces two reciprocal mechanisms: the model can "reason to act" (use thinking to decide what to do next) and "act to reason" (use observations from the environment to update its thinking).^[1] This mutual feedback is what the title's "synergizing" refers to.

How does the Thought-Action-Observation loop work?

The core mechanism of ReAct is a structured loop with three components:

Component	Role	Effect on environment
Thought	Free-form natural language reasoning. The model decomposes the problem, interprets observations, formulates hypotheses, or plans the next step.	None. Thoughts are internal to the model and do not change the external environment.
Action	A task-specific command that interacts with the external environment. Actions follow a predefined format (e.g., `Search[query]`, `Lookup[keyword]`, `Finish[answer]`).	Yes. The action is executed, and the environment returns a result.
Observation	The result returned by the environment after an action is executed. Observations are appended to the model's context so it can reason about them in the next thought step.	None. Observations are inputs to the model, not generated by it.

The loop proceeds as follows:

The model generates a Thought, reasoning about the current state of the problem.
Based on that reasoning, the model generates an Action to execute.
The system executes the action and returns an Observation.
The model reads the observation and generates the next Thought, continuing the cycle.
The loop terminates when the model issues a finishing action (e.g., Finish[answer]).

This structure allows the model to dynamically adjust its plan based on what it discovers at each step, rather than committing to a fixed sequence of actions in advance. Crucially, the loop does not require fine-tuning or any architectural modification; it is implemented entirely through few-shot exemplars in the prompt and an external orchestrator that parses each Action and feeds the resulting Observation back into the next call.^[1]

Example trace

The following example, adapted from the original paper, shows ReAct solving a multi-hop question from the HotpotQA dataset using a simple Wikipedia API.^[1]

Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?

Thought 1: I need to search Colorado orogeny, find the area that the
           eastern sector extends into, then find the elevation range
           of that area.
Action 1:  Search[Colorado orogeny]
Observation 1: The Colorado orogeny was an episode of mountain building
               (an orogeny) in Colorado and surrounding areas.

Thought 2: It does not mention the eastern sector. So I need to look
           up eastern sector.
Action 2:  Lookup[eastern sector]
Observation 2: (Result 1 / 1) The eastern sector extends into the High
               Plains and is called the Central Plains orogeny.

Thought 3: The eastern sector of Colorado orogeny extends into the
           High Plains. So I need to search High Plains and find its
           elevation range.
Action 3:  Search[High Plains]
Observation 3: High Plains refers to one of two distinct land regions.

Thought 4: I need to instead search High Plains (United States).
Action 4:  Search[High Plains (United States)]
Observation 4: The High Plains are a subregion of the Great Plains.
               From east to west, the High Plains rise in elevation
               from around 1,800 to 7,000 ft (550 to 2,130 m).

Thought 5: High Plains rise in elevation from around 1,800 to 7,000 ft,
           so the answer is 1,800 to 7,000 ft.
Action 5:  Finish[1,800 to 7,000 ft]

Several features of this trace illustrate why interleaving helps. The model decomposes the original question into sub-questions (Thought 1), recovers from an uninformative search result (Thought 2), disambiguates a query that returned the wrong entity (Thought 4), and extracts the relevant fact from a longer passage (Thought 5).^[1] These are exactly the kinds of flexible, adaptive behaviors that action-only approaches struggle with because they lack an explicit reasoning channel, and that pure chain-of-thought struggles with because it has no way to ground a guess against the world.

What actions and environments does ReAct use?

The specific actions available to a ReAct agent depend on the task and environment. The original paper defined actions for two categories of tasks.

Knowledge-intensive reasoning tasks

For question answering (HotpotQA) and fact verification (FEVER), the paper used a simple Wikipedia API with three actions:^[1]

Action	Description
`Search[entity]`	Returns the first five sentences of the Wikipedia page for the given entity. If no exact match is found, returns the top five similar entities from the Wikipedia search engine.
`Lookup[string]`	Returns the next sentence on the current page that contains the given string. Simulates the browser's Ctrl+F (find) functionality.
`Finish[answer]`	Terminates the episode and submits the given answer.

Interactive decision-making tasks

For text-based games (ALFWorld) and web navigation (WebShop), the actions were environment-specific commands such as "go to cabinet 1," "take knife 2," "click [Buy Now]," and similar instructions that directly manipulate the environment state.^[1]

In both cases, the key design choice is that the action space is simple and well-defined. The model does not need to generate arbitrary code or API calls; it selects from a small set of action templates and fills in the arguments based on its reasoning. This constraint is what makes few-shot prompting tractable: the model only needs to learn the schema, not invent new tools.

How is a ReAct prompt constructed?

ReAct prompts are constructed using few-shot learning. The prompt contains a small number of human-written trajectories (typically 3 to 6 examples) that demonstrate the desired Thought-Action-Observation format. Each trajectory shows a complete problem-solving episode, including the question, the interleaved thoughts and actions, the observations returned by the environment, and the final answer.^[1]

For knowledge-intensive tasks, the authors manually wrote trajectories that include dense reasoning at every step: each thought explains the model's current understanding and what it plans to do next. For decision-making tasks, the authors used a sparser style, inserting reasoning only at points where it was most useful (for instance, when forming a subgoal or when recovering from a mistake). The model was free to decide the rhythm of reasoning versus acting.^[1]

The prompts in the paper were designed for PaLM-540B, the large language model used in the original experiments. The open-source code release also implements ReAct against GPT-3 text-davinci-002, and the authors note that PaLM and GPT-3 are stronger at different tasks (for example, GPT-3 narrowly outperforms PaLM-540B on HotpotQA exact match, while PaLM-540B is significantly stronger on FEVER accuracy).^[4] No task-specific fine-tuning was required for the prompting experiments; the model learned the ReAct format purely from the in-context examples.

Two implementation approaches have emerged in practice:

Approach	Description	When to use
Few-shot	The prompt includes several complete worked examples demonstrating the Thought-Action-Observation cycle.	When a base model (not instruction-tuned) is used, or when strict format adherence is required.
Zero-shot	The prompt provides detailed written instructions describing the format and available actions, without worked examples.	When an instruction-tuned model (e.g., GPT-4, Claude) is used, since these models can follow abstract instructions reliably.

How well does ReAct perform on benchmarks?

The original paper evaluated ReAct on four benchmarks spanning two task categories. All prompting experiments used PaLM-540B unless otherwise noted.^[1]

Knowledge-intensive reasoning

HotpotQA is a multi-hop question answering dataset that requires synthesizing information from multiple Wikipedia articles.^[8] The paper evaluated on a random subset of 500 examples from the validation set using 6-shot prompting.^[1]

Method	Exact Match (EM)
Standard prompting	25.7
Chain-of-thought (CoT)	29.4
CoT with self-consistency (CoT-SC)	33.4
Act only	25.7
ReAct	27.4
ReAct → CoT-SC (switch on failure)	35.1
CoT-SC → ReAct (switch on failure)	35.1

On HotpotQA, ReAct underperformed pure CoT in raw accuracy. However, the authors showed that the two methods fail on different types of questions. CoT achieves higher scores but suffers from a much higher false-positive rate due to hallucination. ReAct makes fewer hallucination errors because it grounds its reasoning in retrieved Wikipedia content, but it sometimes fails due to search errors or retrieval of irrelevant articles. The best overall performance came from combining the two methods: running ReAct first and falling back to CoT-SC when ReAct fails (or vice versa), reaching 35.1 EM.^[1]

FEVER is a fact verification dataset where the model must classify a claim as "SUPPORTS," "REFUTES," or "NOT ENOUGH INFO" based on Wikipedia evidence. The paper used 3-shot prompting on 500 random validation examples.^[1]

Method	Accuracy
Standard prompting	57.1
CoT	56.3
Act only	58.9
ReAct	60.9
ReAct → CoT-SC	64.6
CoT-SC → ReAct	64.6

On FEVER, ReAct outperformed both CoT and act-only baselines outright, and the combined approach again produced the strongest result at 64.6 accuracy.^[1]

Interactive decision-making

ALFWorld is a text-based game environment requiring household tasks (e.g., "put a clean apple on the counter").^[9] The paper used 1 or 2 in-context examples and ran 6 trials, reporting the best trial.^[1]

Method	Success Rate
BUTLER (imitation learning)	37%
Act only	45%
ReAct	71%

ReAct outperformed the act-only baseline by 26 percentage points and the BUTLER imitation-learning baseline by 34 points, matching the 34-point absolute success-rate gain over imitation/RL methods that the abstract claims for interactive decision-making.^[1] The relative performance gain of ReAct over Act ranged from 33% to 90% across the six trials, averaging 62%. The large gap demonstrates the value of interleaved reasoning: the model uses thoughts to decompose goals into subgoals, track which subgoals have been completed, and determine where to look for objects.^[1]

WebShop is a web navigation benchmark where the agent must find and purchase a product matching a text description by navigating a simulated e-commerce site. The paper reports success rate (finding the exact product) and average reward score.^[1]

Method	Success Rate	Score
Imitation learning (IL)	29.1%	59.9
IL + reinforcement learning	28.7%	62.4
Act only	30.1%	66.6
ReAct	40.0%	66.6
Human expert	59.6%	82.1

ReAct achieved a 40% success rate, a 10-percentage-point improvement over the act-only baseline and the prior best methods (IL, IL+RL), matching the 10% absolute gain the abstract reports for WebShop, despite using only one or two in-context examples compared to the approximately 100,000 training instances used by the learning-based baselines.^[1]

Error analysis

The authors performed a detailed error analysis by randomly sampling 50 correct and 50 incorrect trajectories from each of ReAct and CoT on HotpotQA (200 examples total), then manually categorizing the success and failure modes.^[1]

The key findings were:

Hallucination was a serious problem for CoT but much less so for ReAct. Because ReAct grounds its reasoning in retrieved Wikipedia content, it produces fewer factually unsupported claims; CoT showed a much higher false-positive rate.^[1]
Reasoning errors in ReAct occurred when the model failed to synthesize observations correctly, entered repetitive loops (generating the same thought and action repeatedly), or drew incorrect conclusions from the retrieved text.^[1]
Search errors affected ReAct when queries returned irrelevant articles or when the correct information was not in the first few sentences returned by the API. These errors have no counterpart in CoT, since CoT does not search.^[1]
Complementary strengths: ReAct and CoT fail on different subsets of questions, which explains why switching between them on failure achieves the best combined performance.^[1]

Fine-tuning smaller models

Beyond few-shot prompting, the paper also explored using ReAct-format trajectories to fine-tune smaller language models. The approach worked as follows:^[1]

Use the prompted PaLM-540B model to generate ReAct trajectories on HotpotQA training questions.
Filter for trajectories that led to correct answers.
Use the successful trajectories as supervised training data to fine-tune smaller PaLM models (8B and 62B parameters).

The results showed that fine-tuning consistently and significantly improved EM scores over prompting alone. A fine-tuned PaLM-62B outperformed the prompted PaLM-540B on HotpotQA, demonstrating that the ReAct format transfers effectively to smaller models through distillation. Standard and CoT fine-tuning degraded after relatively few steps, while ReAct and Act methods generally benefited from more training steps and more training data.^[1]

This finding was significant because it suggested a practical pipeline: use a large prompted model to generate high-quality trajectories, then fine-tune a smaller, cheaper model on those trajectories for deployment. The pattern was later picked up by FireAct (Chen et al., 2023), which fine-tuned language models specifically for ReAct-style agent behavior across multiple tasks and tools.^[10]

How does ReAct differ from other prompting methods?

ReAct sits at the intersection of several prompting paradigms. The following table summarizes how it relates to other approaches.

Method	Reasoning	External actions	Grounding	Key limitation
Standard prompting	No	No	Internal knowledge only	Cannot reason through multi-step problems
Chain-of-thought (CoT)	Yes	No	Internal knowledge only	Prone to hallucination; cannot access new information
CoT with self-consistency (CoT-SC)	Yes	No	Internal knowledge only	Higher cost from multiple samples; still no external grounding
Act only	No	Yes	External via tool calls	Cannot reason about what action to take next or synthesize observations
ReAct	Yes	Yes	Both internal and external	Search/retrieval errors; higher token usage per step
Reflexion	Yes	Yes	Both, plus self-evaluation memory	Requires additional self-critique step and episodic memory
Tree of Thoughts (ToT)	Yes (branching)	No (in original)	Internal, with search	Cost of exploring branches; needs evaluator
Toolformer	Implicit	Yes (learned)	External via self-supervised tool calls	Requires training on tool-use traces; fixed API set
ReWOO	Yes (plan only)	Yes	Both, but planning is upfront	Cannot adapt mid-execution if conditions change unexpectedly

The central insight from the paper is that reasoning and acting are complementary, not redundant. Reasoning without acting leads to hallucination because the model has no way to verify its claims. Acting without reasoning leads to inefficient or aimless behavior because the model cannot plan, recover from errors, or synthesize information across steps.^[1]

Relation to chain-of-thought

ReAct is best understood as an extension of chain-of-thought prompting rather than a replacement for it. CoT showed that natural-language reasoning, elicited by exemplars, can improve performance on tasks that require multi-step inference.^[7] ReAct preserves this insight, as the Thought steps in a ReAct trajectory are essentially CoT-style reasoning, but adds an Action channel that lets the model query the world between thoughts. In the limit where the action set is empty, ReAct degenerates to chain-of-thought; in the limit where the thoughts are empty, it degenerates to a pure action policy. The paper's headline finding is that neither extreme is optimal: the combination is strictly better on grounded tasks.^[1]

Relation to Toolformer

Where ReAct teaches a model to use tools through prompting alone, Toolformer (Schick et al., 2023) takes a complementary, training-based approach: it has the model annotate a corpus with self-generated tool calls, filters for calls that reduce the loss on the next token, and fine-tunes on the resulting traces.^[11] The two methods can be combined, as a Toolformer-trained model can be prompted in ReAct format, and they represent two of the dominant paradigms for equipping LLMs with tool use.

Reception and impact

ReAct was accepted at ICLR 2023 as an Oral and designated "Notable Top 5%."^[2] Reviewers highlighted the elegance of the prompting recipe and the consistency of the gains across qualitatively different benchmarks. According to Google Scholar tallies, the paper had accumulated well over 5,000 citations within three years of publication, placing it among the most-cited works in the post-2022 LLM-agent literature.^[12]

Beyond academic citations, the paradigm spread quickly into industrial and open-source software. Andrew Ng, in his 2024 series on "agentic design patterns," listed tool use and planning as two of the four foundational patterns and pointed to ReAct as the canonical demonstration of their combined value.^[13] Major model providers, including Anthropic, OpenAI, and Google, shipped structured tool-calling APIs whose orchestration loops mirror the Thought-Action-Observation cycle, even when the surface syntax differs from raw text prompting.

Shunyu Yao completed his PhD at Princeton in 2024 and subsequently joined the research staff at OpenAI in August of that year, where he contributed to the design of agentic products including Operator and Deep Research.^[14] Across his body of work, which includes ReAct, Tree of Thoughts, and Reflexion, Yao's papers had accumulated over 15,000 total citations by the time of his OpenAI move.^[14] Karthik Narasimhan remains a faculty member at Princeton's Computer Science department.

How did ReAct influence AI agent design?

ReAct was one of the earliest and most influential demonstrations that prompting alone can turn a language model into a functional agent. Its Thought-Action-Observation loop became the conceptual blueprint for a generation of agent frameworks and research directions.

Agent frameworks

Several popular libraries adopted ReAct as a core pattern:

LangChain initially implemented ReAct as langchain.agents.create_react_agent, which takes a model, a list of tools, and an optional prompt template and returns an agent that runs the Thought-Action-Observation loop. LangChain's early agent architecture was built almost entirely around this pattern, and the initialize_agent helper exposed several ReAct variants (zero-shot-react-description, react-docstore, and others).^[5]
LangGraph, LangChain's graph-based successor for agent orchestration, exposed a langgraph.prebuilt.create_react_agent factory as one of its first prebuilt components. In LangGraph v1 (released in late 2025), this prebuilt was deprecated in favor of the more general langchain.agents.create_agent, which runs on LangGraph and adds a configurable middleware system on top of the ReAct loop.^[15]
LlamaIndex offers preconfigured ReAct agent components for building retrieval-augmented reasoning pipelines.
Hugging Face ships ReAct-style agents in its transformers agents API, using the Thought-Action-Observation format as the default architecture.^[6]
CrewAI, AutoGen, and several other multi-agent frameworks use ReAct (or a close variant) as the per-agent reasoning loop, layering coordination and message-passing on top.

Subsequent research

ReAct directly inspired or informed a wave of follow-up works:

Reflexion (Shinn et al., 2023) extends ReAct by adding a self-evaluation step after each episode.^[16] A separate reflection prompt critiques the just-finished trajectory and stores insights in an episodic memory buffer; on subsequent attempts, the agent consults these reflections to avoid repeating mistakes. Reflexion uses ReAct as the inner action loop and adds verbal-reinforcement learning on top.
Tree of Thoughts (Yao et al., 2023), co-authored by Shunyu Yao, generalizes chain-of-thought into a tree-search procedure where the model proposes multiple candidate thoughts at each step, evaluates them, and explores the promising branches with breadth- or depth-first search.^[17] While ToT is primarily reasoning-only, it shares ReAct's framing of intermediate steps as deliberative units that can be inspected and revised.
Language Agent Tree Search (LATS) (Zhou et al., 2023), also co-authored by Yao, integrates Monte Carlo Tree Search with the ReAct framework.^[18] Instead of following a single reasoning path, LATS explores multiple possible trajectories in a tree, using LLM-generated value estimates and self-reflections to guide the search. LATS was published at ICML 2024.
ReWOO (Reasoning Without Observation) (Xu et al., 2023) optimizes ReAct by having the model plan the entire sequence of tool calls in a single pass before executing any of them.^[19] This reduces the number of LLM calls from one per step to one total, at the cost of sacrificing the ability to adapt mid-execution.
Toolformer (Schick et al., 2023) showed that language models can learn to autonomously decide when to invoke external tools through self-supervised training, complementing ReAct's prompt-based approach.^[11]
FireAct (Chen et al., 2023) explored fine-tuning language models specifically for ReAct-style agent behavior, building on the fine-tuning experiments in the original ReAct paper.^[10]
Cognitive Architectures for Language Agents (CoALA) (Sumers et al., 2023), co-authored by Yao and Narasimhan, proposes a general framework for language agents with modular memory, structured action spaces, and generalized decision-making, formalizing many of the ideas that ReAct introduced informally.^[20]
SWE-agent (Yang et al., 2024), also involving Yao, applies agent-computer interface design to software-engineering tasks.^[21] Its design follows the principle that agents need well-shaped action spaces to interact effectively with complex environments, a direct intellectual descendant of the ReAct action-schema choice.

Impact on commercial AI systems

The ReAct paradigm influenced the design of tool-use and agentic capabilities in commercial language-model APIs. OpenAI's function calling, Anthropic's tool use, and Google's Gemini function calling all incorporate elements of the Thought-Action-Observation pattern, though they typically implement it through structured API protocols (JSON tool definitions and tool-result messages) rather than raw text prompting. Many production agent systems use a ReAct-inspired loop internally, even when the implementation details differ from the original paper's prompting approach. The 2024-2026 "agentic" product wave, including OpenAI Operator and Deep Research, Anthropic's computer use, Google's Project Astra/Mariner, and dozens of vertical agent startups, relies on Thought-Action-Observation control as a default scaffold.

What are the limitations of ReAct?

Despite its wide influence, ReAct has several known limitations:

Search and retrieval sensitivity: ReAct's performance depends heavily on the quality of the external tools and the information they return. If a search query returns irrelevant results, the model may reason incorrectly based on misleading observations. The original paper found that search errors were a primary failure mode on HotpotQA.^[1]
Repetitive loops: The model sometimes enters cycles where it generates the same thought and action repeatedly without making progress. The authors categorized this as a form of reasoning error.^[1]
Token efficiency: Because each step requires the model to process the entire history of thoughts, actions, and observations, the context window fills up quickly. This is particularly problematic for long episodes with many steps. ReWOO was designed specifically to address this issue by collapsing the planning phase into a single pass.^[19]
Structural rigidity: The fixed Thought-Action-Observation format can constrain the model in situations where a different control flow would be more natural. For instance, some tasks benefit from generating a full plan upfront (as in Plan-and-Execute architectures) rather than reasoning one step at a time.
Temperature sensitivity: The paper found that ReAct performs much better with temperature 0 (greedy decoding) on most tasks, except for HotpotQA-like fact-based question answering. This limits the diversity of reasoning paths the model can explore.^[1]
Scalability with tools: As the number of available tools and their parameter signatures grows, hallucinated tool calls become more common. The model may generate action formats that do not match the actual tool signatures, or call the wrong tool entirely. This is a recognized challenge in enterprise applications with many functions, and has motivated work on retrieval-augmented tool selection and stricter structured-output decoders.
Implicit cost model: ReAct gives no built-in account of how expensive different actions are. Production deployments typically wrap the loop with explicit budgets (max iterations, max tokens, max tool calls) and rate limits.

Implementation patterns

Practitioners building ReAct agents typically follow one of several patterns.

Basic prompt template

A minimal ReAct prompt includes:

A system instruction describing the task and available actions.
Few-shot examples showing complete Thought-Action-Observation trajectories.
The current question or task, followed by Thought 1: to prompt the model to begin reasoning.

The orchestration system then parses the model's output to detect action commands (using string matching, regular expressions, or structured-output parsing), executes the action, appends the observation to the context, and calls the model again to generate the next thought. A typical control loop in pseudocode looks like:

history = render_few_shot_prompt(examples) + render_question(task)
for step in range(max_steps):
    completion = llm.generate(history, stop=["Observation"])
    history += completion
    action = parse_action(completion)
    if action.name == "Finish":
        return action.args
    obs = tools[action.name](action.args)
    history += f"\nObservation {step+1}: {obs}\n"
return None  # exceeded budget

Framework-based implementation

Using a framework like LangChain, a basic ReAct agent can be created with a few lines of code by specifying a model, a list of tool objects (each with a name, description, and callable function), and optionally a custom prompt. The framework handles the orchestration loop, action parsing, tool execution, and observation injection automatically. In LangGraph v1, the recommended API is langchain.agents.create_agent, which wraps the ReAct loop in a graph node and exposes middleware hooks for logging, validation, and human-in-the-loop interventions.^[15]

Production considerations

Production ReAct agents typically include additional safeguards:

Maximum iteration limits to prevent infinite loops.
Error handling for tool execution failures, including retry logic and fallback actions.
Output validation to ensure the model's action output matches the expected format before attempting execution; many systems use a constrained-decoding or JSON-schema layer rather than relying on regex parsing of raw text.
Conversation memory management to truncate or summarize older context when the token limit is approached.
Logging and observability of each Thought-Action-Observation step for debugging and monitoring. Tools like LangSmith, Langfuse, and Helicone are routinely used to record agent traces.

Relationship to reinforcement learning

The ReAct loop bears a structural resemblance to the state-action-reward loop in reinforcement learning. In RL, an agent observes a state, takes an action, receives a reward, and updates its policy. In ReAct, the model reads the current context (state), generates a thought and action, receives an observation (analogous to a new state), and continues. However, ReAct differs from RL in several important ways: there is no explicit reward signal during the episode, the model is not trained through trial and error, and the "policy" is defined entirely by the prompt and the model's pretrained weights.

The connection to RL became more explicit in LATS, which introduced Monte Carlo Tree Search (a technique from game-playing RL) into the ReAct framework, and in Reflexion, which introduced a form of episodic learning through self-critique.^[16]^[18] More recent reasoning models trained with reinforcement learning from process or outcome rewards (such as DeepSeek-R1 and OpenAI's o-series) can be viewed as internalizing the ReAct loop: their training rewards reasoning trajectories that lead to correct answers, with tool use sometimes embedded directly in the policy.

References

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." *Proceedings of the 11th International Conference on Learning Representations (ICLR 2023)*. arXiv:2210.03629, first posted October 6, 2022. https://arxiv.org/abs/2210.03629 ↩
OpenReview. "ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023 notable top 5%." https://openreview.net/forum?id=WE_vluYUL-X; and ICLR 2023 program: "In-Person Oral presentation / top 5% paper." https://iclr.cc/virtual/2023/oral/12647 ↩
Yao, S. & Cao, Y. (November 8, 2022). "ReAct: Synergizing Reasoning and Acting in Language Models." *Google Research Blog*. https://research.google/blog/react-synergizing-reasoning-and-acting-in-language-models/ ↩
Yao, S. (2022-2025). "ysymyth/ReAct: [ICLR 2023] ReAct: Synergizing Reasoning and Acting in Language Models." GitHub repository (MIT License). https://github.com/ysymyth/ReAct ↩
LangChain documentation. "create_react_agent." https://reference.langchain.com/python/langchain/ ↩
Hugging Face. "Agents and tools." *transformers* documentation. https://huggingface.co/docs/transformers/agents ↩
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *Advances in Neural Information Processing Systems (NeurIPS 2022)*. arXiv:2201.11903. https://arxiv.org/abs/2201.11903 ↩
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." *EMNLP 2018*. arXiv:1809.09600. https://arxiv.org/abs/1809.09600 ↩
Shridhar, M., Yuan, X., Côté, M.-A., Bisk, Y., Trischler, A., & Hausknecht, M. (2021). "ALFWorld: Aligning Text and Embodied Environments for Interactive Learning." *ICLR 2021*. arXiv:2010.03768. https://arxiv.org/abs/2010.03768 ↩
Chen, B., Shu, C., Shareghi, E., Collier, N., Narasimhan, K., & Yao, S. (2023). "FireAct: Toward Language Agent Fine-Tuning." arXiv:2310.05915. https://arxiv.org/abs/2310.05915 ↩
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." *Advances in Neural Information Processing Systems (NeurIPS 2023)*. arXiv:2302.04761. https://arxiv.org/abs/2302.04761 ↩
Semantic Scholar. "ReAct: Synergizing Reasoning and Acting in Language Models, citing papers." https://www.semanticscholar.org/paper/ReAct:-Synergizing-Reasoning-and-Acting-in-Language-Yao-Zhao/99832586d55f540f603637e458a292406a0ed75d ↩
Ng, A. (2024). "Agentic Design Patterns Part 3: Tool Use" and "Agentic Reasoning." *The Batch* (DeepLearning.AI). https://www.deeplearning.ai/the-batch/ ↩
Yao, S. (2024). "Language Agents: From Reasoning to Acting." *Latent Space podcast (interview)*; and Yao, S. "About." https://ysymyth.github.io/ and https://www.latent.space/p/shunyu ↩
LangChain documentation (2025). "LangGraph v1 migration guide" and "create_react_agent" deprecation notice. https://docs.langchain.com/oss/javascript/migrate/langgraph-v1 ↩
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." *Advances in Neural Information Processing Systems (NeurIPS 2023)*. arXiv:2303.11366. https://arxiv.org/abs/2303.11366 ↩
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." *Advances in Neural Information Processing Systems (NeurIPS 2023)*. arXiv:2305.10601. https://arxiv.org/abs/2305.10601 ↩
Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., & Wang, Y.-X. (2024). "Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models." *Proceedings of the 41st International Conference on Machine Learning (ICML 2024)*. arXiv:2310.04406. https://arxiv.org/abs/2310.04406 ↩
Xu, B., Peng, Z., Lei, B., Mukherjee, S., Liu, Y., & Xia, F. (2023). "ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models." arXiv:2305.18323. https://arxiv.org/abs/2305.18323 ↩
Sumers, T. R., Yao, S., Narasimhan, K., & Griffiths, T. L. (2023). "Cognitive Architectures for Language Agents." arXiv:2309.02427. https://arxiv.org/abs/2309.02427 ↩
Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., & Press, O. (2024). "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." *Advances in Neural Information Processing Systems (NeurIPS 2024)*. arXiv:2405.15793. https://arxiv.org/abs/2405.15793 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

ReAct (prompting)

What is ReAct prompting?

Background and motivation

How does the Thought-Action-Observation loop work?

Example trace

What actions and environments does ReAct use?

Knowledge-intensive reasoning tasks

Interactive decision-making tasks

How is a ReAct prompt constructed?

How well does ReAct perform on benchmarks?

Knowledge-intensive reasoning

Interactive decision-making

Error analysis

Fine-tuning smaller models

How does ReAct differ from other prompting methods?

Relation to chain-of-thought

Relation to Toolformer

Reception and impact

How did ReAct influence AI agent design?

Agent frameworks

Subsequent research

Impact on commercial AI systems

What are the limitations of ReAct?

Implementation patterns

Basic prompt template

Framework-based implementation

Production considerations

Relationship to reinforcement learning

See also

References

Improve this article

What links here (24 of 28)

What links here (24 of 28)

What is ReAct prompting?

Background and motivation

How does the Thought-Action-Observation loop work?

Example trace

What actions and environments does ReAct use?

Knowledge-intensive reasoning tasks

Interactive decision-making tasks

How is a ReAct prompt constructed?

How well does ReAct perform on benchmarks?

Knowledge-intensive reasoning

Interactive decision-making

Error analysis

Fine-tuning smaller models

How does ReAct differ from other prompting methods?

Relation to chain-of-thought

Relation to Toolformer

Reception and impact

How did ReAct influence AI agent design?

Agent frameworks

Subsequent research

Impact on commercial AI systems

What are the limitations of ReAct?

Implementation patterns

Basic prompt template

Framework-based implementation

Production considerations

Relationship to reinforcement learning

See also

References

Improve this article

Related Articles

Tree of Thoughts

Self-consistency

Agentic Context Engineering

Context engineering

Reflexion

Agent planning

What links here (24 of 28)

Related Articles

Tree of Thoughts

Self-consistency

Agentic Context Engineering

Context engineering

Reflexion

Agent planning

What links here (24 of 28)