ReAct (short for Reasoning and Acting) is a prompting paradigm for large language models that interleaves verbal reasoning traces with task-specific actions. Introduced by Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao in a 2022 paper published at the International Conference on Learning Representations (ICLR) 2023, ReAct enables LLMs to generate both free-form thinking steps and concrete actions in an alternating loop. The reasoning traces help the model plan, track progress, and handle exceptions, while the actions let the model interact with external sources such as search engines, knowledge bases, or software environments to gather information that is not stored in the model's parameters.
ReAct was one of the first frameworks to demonstrate that combining chain-of-thought style reasoning with tool use produces agents that are more accurate, more interpretable, and less prone to hallucination than either reasoning-only or action-only approaches. The paper has accumulated over 5,000 citations and is widely regarded as a foundational contribution to the field of AI agents. Its Thought-Action-Observation loop has become the default control flow in many agent frameworks, including LangChain, LangGraph, LlamaIndex, and others.
Before ReAct, two separate lines of research had shown that large language models possess useful but isolated capabilities. On one hand, chain-of-thought prompting (Wei et al., 2022) demonstrated that asking a model to "think step by step" before answering a question substantially improves performance on arithmetic, commonsense, and symbolic reasoning tasks. On the other hand, work on action generation and tool use (such as SayCan, Inner Monologue, and later Toolformer) showed that language models can learn to call external APIs, navigate web pages, or control robotic systems when prompted with appropriate action schemas.
However, these two capabilities had mostly been studied in isolation. Chain-of-thought prompting relies entirely on the model's internal knowledge, which means it can produce plausible-sounding but factually incorrect reasoning chains (a problem the authors call "fact hallucination" and "error propagation"). Meanwhile, action-only approaches let models interact with the world but lack the ability to reason about what actions to take next, why a previous action failed, or how to synthesize information gathered across multiple steps.
Yao et al. argued that humans do not separate reasoning from acting in practice. When solving a problem, a person might think about what information is needed, search for it, read the results, reason about what was found, decide on a next step, and continue until the answer becomes clear. ReAct formalizes this intuition by prompting language models to produce interleaved sequences of thoughts and actions, where the model can "reason to act" (use thinking to decide what to do next) and "act to reason" (use observations from the environment to update its thinking).
The core mechanism of ReAct is a structured loop with three components:
| Component | Role | Effect on environment |
|---|---|---|
| Thought | Free-form natural language reasoning. The model decomposes the problem, interprets observations, formulates hypotheses, or plans the next step. | None. Thoughts are internal to the model and do not change the external environment. |
| Action | A task-specific command that interacts with the external environment. Actions follow a predefined format (e.g., Search[query], Lookup[keyword], Finish[answer]). | Yes. The action is executed, and the environment returns a result. |
| Observation | The result returned by the environment after an action is executed. Observations are appended to the model's context so it can reason about them in the next thought step. | None. Observations are inputs to the model, not generated by it. |
The loop proceeds as follows:
Finish[answer]).This structure allows the model to dynamically adjust its plan based on what it discovers at each step, rather than committing to a fixed sequence of actions in advance.
The following example, adapted from the original paper, shows ReAct solving a multi-hop question from the HotpotQA dataset using a simple Wikipedia API.
Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?
Thought 1: I need to search Colorado orogeny, find the area that the
eastern sector extends into, then find the elevation range
of that area.
Action 1: Search[Colorado orogeny]
Observation 1: The Colorado orogeny was an episode of mountain building
(an orogeny) in Colorado and surrounding areas.
Thought 2: It does not mention the eastern sector. So I need to look
up eastern sector.
Action 2: Lookup[eastern sector]
Observation 2: (Result 1 / 1) The eastern sector extends into the High
Plains and is called the Central Plains orogeny.
Thought 3: The eastern sector of Colorado orogeny extends into the
High Plains. So I need to search High Plains and find its
elevation range.
Action 3: Search[High Plains]
Observation 3: High Plains refers to one of two distinct land regions.
Thought 4: I need to instead search High Plains (United States).
Action 4: Search[High Plains (United States)]
Observation 4: The High Plains are a subregion of the Great Plains.
From east to west, the High Plains rise in elevation
from around 1,800 to 7,000 ft (550 to 2,130 m).
Thought 5: High Plains rise in elevation from around 1,800 to 7,000 ft,
so the answer is 1,800 to 7,000 ft.
Action 5: Finish[1,800 to 7,000 ft]
Several features of this trace are worth noting. The model decomposes the original question into sub-questions (Thought 1), recovers from an uninformative search result (Thought 2), disambiguates a query that returned the wrong entity (Thought 4), and extracts the relevant fact from a longer passage (Thought 5). These are exactly the kinds of flexible, adaptive behaviors that action-only approaches struggle with because they lack an explicit reasoning channel.
The specific actions available to a ReAct agent depend on the task and environment. The original paper defined actions for two categories of tasks.
For question answering (HotpotQA) and fact verification (FEVER), the paper used a simple Wikipedia API with three actions:
| Action | Description |
|---|---|
Search[entity] | Returns the first five sentences of the Wikipedia page for the given entity. If no exact match is found, returns the top five similar entities from the Wikipedia search engine. |
Lookup[string] | Returns the next sentence on the current page that contains the given string. Simulates the browser's Ctrl+F (find) functionality. |
Finish[answer] | Terminates the episode and submits the given answer. |
For text-based games (ALFWorld) and web navigation (WebShop), the actions were environment-specific commands such as "go to cabinet 1", "take knife 2", "click [Buy Now]", and similar instructions that directly manipulate the environment state.
In both cases, the key design choice is that the action space is simple and well-defined. The model does not need to generate arbitrary code or API calls; it selects from a small set of action templates and fills in the arguments based on its reasoning.
ReAct prompts are constructed using few-shot learning. The prompt contains a small number of human-written trajectories (typically 3 to 6 examples) that demonstrate the desired Thought-Action-Observation format. Each trajectory shows a complete problem-solving episode, including the question, the interleaved thoughts and actions, the observations returned by the environment, and the final answer.
For knowledge-intensive tasks, the authors manually wrote trajectories that include dense reasoning at every step: each thought explains the model's current understanding and what it plans to do next. For decision-making tasks, the authors used a sparser style, inserting reasoning only at points where it was most useful (for instance, when forming a subgoal or when recovering from a mistake). The model was free to decide the rhythm of reasoning versus acting.
The prompts were designed for PaLM-540B, the large language model used in the original experiments. No task-specific fine-tuning was required for the prompting experiments; the model learned the ReAct format purely from the in-context examples.
Two implementation approaches have emerged in practice:
| Approach | Description | When to use |
|---|---|---|
| Few-shot | The prompt includes several complete worked examples demonstrating the Thought-Action-Observation cycle. | When a base model (not instruction-tuned) is used, or when strict format adherence is required. |
| Zero-shot | The prompt provides detailed written instructions describing the format and available actions, without worked examples. | When an instruction-tuned model (e.g., GPT-4, Claude) is used, since these models can follow abstract instructions reliably. |
The original paper evaluated ReAct on four benchmarks spanning two task categories. All prompting experiments used PaLM-540B.
HotpotQA is a multi-hop question answering dataset that requires synthesizing information from multiple Wikipedia articles. The paper evaluated on a random subset of 500 examples from the validation set using 6-shot prompting.
| Method | Exact Match (EM) |
|---|---|
| Standard prompting | 25.7 |
| Chain-of-thought (CoT) | 29.4 |
| CoT with self-consistency (CoT-SC) | 33.4 |
| Act only | 25.7 |
| ReAct | 27.4 |
| ReAct to CoT-SC (switch on failure) | 35.1 |
| CoT-SC to ReAct (switch on failure) | 35.1 |
On HotpotQA, ReAct underperformed pure CoT in raw accuracy. However, the authors showed that the two methods fail on different types of questions. CoT achieves higher scores but suffers from a much higher false positive rate due to hallucination. ReAct makes fewer hallucination errors because it grounds its reasoning in retrieved Wikipedia content, but it sometimes fails due to search errors or retrieval of irrelevant articles. The best overall performance came from combining the two methods: trying ReAct first and falling back to CoT-SC when ReAct fails (or vice versa), which achieved 35.1 EM.
FEVER is a fact verification dataset where the model must classify a claim as "SUPPORTS", "REFUTES", or "NOT ENOUGH INFO" based on Wikipedia evidence. The paper used 3-shot prompting on 500 random validation examples.
| Method | Accuracy |
|---|---|
| Standard prompting | 57.1 |
| CoT | 56.3 |
| Act only | 58.9 |
| ReAct | 60.9 |
| ReAct to CoT-SC | 64.6 |
| CoT-SC to ReAct | 64.6 |
On FEVER, ReAct outperformed both CoT and act-only baselines. The combined approach again achieved the best result at 64.6 accuracy.
ALFWorld is a text-based game environment requiring household tasks (e.g., "put a clean apple on the counter"). The paper used 1 or 2 in-context examples and ran 6 trials, reporting the best trial.
| Method | Success Rate |
|---|---|
| BUTLER (imitation learning) | 37% |
| Act only | 45% |
| ReAct | 71% |
ReAct outperformed the act-only baseline by 26 percentage points and the BUTLER imitation learning baseline by 34 points. The relative performance gain of ReAct over Act ranged from 33% to 90% across the six trials, averaging 62%. The large gap demonstrates the value of interleaved reasoning: the model uses thoughts to decompose goals into subgoals, track which subgoals have been completed, and determine where to look for objects.
WebShop is a web navigation benchmark where the agent must find and purchase a product matching a text description by navigating a simulated e-commerce website. The paper reported success rate (finding the exact product) and average reward score.
| Method | Success Rate | Score |
|---|---|---|
| Imitation learning (IL) | 29.1% | 59.9 |
| IL + reinforcement learning | 28.7% | 62.4 |
| Act only | 30.1% | 66.6 |
| ReAct | 40.0% | 66.6 |
| Human expert | 59.6% | 82.1 |
ReAct achieved a 40% success rate, a 10 percentage point improvement over the act-only baseline and the prior best methods (IL, IL+RL), despite using only one or two in-context examples compared to the approximately 100,000 training instances used by the learning-based methods.
The authors performed a detailed error analysis by randomly sampling 50 trajectories each (correct and incorrect) from both ReAct and CoT on HotpotQA (200 examples total), then manually categorizing the success and failure modes.
The key findings were:
Beyond few-shot prompting, the paper also explored using ReAct-format trajectories to fine-tune smaller language models. The approach worked as follows:
The results showed that fine-tuning consistently and significantly improved EM scores over prompting alone. A fine-tuned PaLM-62B outperformed the prompted PaLM-540B on HotpotQA, demonstrating that the ReAct format transfers effectively to smaller models through distillation. Standard and CoT fine-tuning degraded after relatively few steps, while ReAct and Act methods generally benefited from more training steps and more training data.
This finding was significant because it suggested a practical pipeline: use a large prompted model to generate high-quality trajectories, then fine-tune a smaller, cheaper model on those trajectories for deployment.
ReAct sits at the intersection of several prompting paradigms. The following table summarizes how it relates to other approaches.
| Method | Reasoning | External actions | Grounding | Key limitation |
|---|---|---|---|---|
| Standard prompting | No | No | Internal knowledge only | Cannot reason through multi-step problems |
| Chain-of-thought (CoT) | Yes | No | Internal knowledge only | Prone to hallucination; cannot access new information |
| CoT with self-consistency (CoT-SC) | Yes | No | Internal knowledge only | Higher cost from multiple samples; still no external grounding |
| Act only | No | Yes | External via tool calls | Cannot reason about what action to take next or synthesize observations |
| ReAct | Yes | Yes | Both internal and external | Search/retrieval errors; potentially higher token usage per step |
| Reflexion | Yes | Yes | Both, plus self-evaluation memory | Requires additional self-critique step and episodic memory |
| ReWOO | Yes (plan only) | Yes | Both, but planning is done upfront | Cannot adapt mid-execution if conditions change unexpectedly |
The central insight from the paper is that reasoning and acting are complementary, not redundant. Reasoning without acting leads to hallucination because the model has no way to verify its claims. Acting without reasoning leads to inefficient or aimless behavior because the model cannot plan, recover from errors, or synthesize information across steps.
ReAct was one of the earliest and most influential demonstrations that prompting alone can turn a language model into a functional agent. Its Thought-Action-Observation loop became the conceptual blueprint for a generation of agent frameworks and research directions.
Several popular libraries adopted ReAct as a core pattern:
create_react_agent function, which takes a language model, a list of tools, and an optional prompt template, and returns an agent that follows the ReAct loop. LangChain's early agent architecture was built almost entirely around the ReAct pattern.create_react_agent utility.transformers agent API, using the Thought-Action-Observation format as the default agent architecture.ReAct directly inspired or informed several follow-up works:
The ReAct paradigm influenced the design of tool use and agentic capabilities in commercial language model APIs. OpenAI's function calling, Anthropic's tool use, and Google's Gemini function calling all incorporate elements of the Thought-Action-Observation pattern, though they typically implement it through structured API protocols rather than raw text prompting. Many production agent systems use a ReAct-inspired loop internally, even when the implementation details differ from the original paper's prompting approach.
Despite its wide influence, ReAct has several known limitations:
Practitioners building ReAct agents typically follow one of several patterns:
A minimal ReAct prompt includes:
The orchestration system then parses the model's output to detect action commands (using string matching, regular expressions, or structured output parsing), executes the action, appends the observation to the context, and calls the model again to generate the next thought.
Using a framework like LangChain, a basic ReAct agent can be created with a few lines of code by specifying a model, a list of tool objects (each with a name, description, and callable function), and optionally a custom prompt. The framework handles the orchestration loop, action parsing, tool execution, and observation injection automatically.
Production ReAct agents often include additional safeguards:
The ReAct loop bears a structural resemblance to the state-action-reward loop in reinforcement learning. In RL, an agent observes a state, takes an action, receives a reward, and updates its policy. In ReAct, the model reads the current context (state), generates a thought and action, receives an observation (analogous to a new state), and continues. However, ReAct differs from RL in several important ways: there is no explicit reward signal during the episode, the model is not trained through trial and error, and the "policy" is defined entirely by the prompt and the model's pretrained weights.
The connection to RL became more explicit in LATS, which introduced Monte Carlo Tree Search (a technique from game-playing RL) into the ReAct framework, and in Reflexion, which introduced a form of episodic learning through self-critique.