Agent planning refers to the process by which an AI agent determines a sequence of actions to accomplish a goal. In the context of large language models (LLMs) and modern agentic systems, planning encompasses task decomposition, reasoning about action sequences, evaluating possible strategies, and adapting plans in response to feedback from the environment. Planning is one of the core capabilities that distinguishes an autonomous agent from a simple question-answering system, enabling it to tackle complex, multi-step tasks that require coordination, tool use, and long-horizon decision-making.
Planning has been a central topic in artificial intelligence since the field's earliest days. Classical AI planning systems such as STRIPS (Stanford Research Institute Problem Solver, 1971) and later formalisms like the Planning Domain Definition Language (PDDL) provided rigorous frameworks for defining states, actions, preconditions, and effects. These systems could guarantee plan correctness and even optimality through systematic search algorithms, but they required problems to be fully specified in formal representations and could not handle natural language input or ambiguous real-world tasks.
The emergence of LLMs has opened a new chapter in planning research. Models like GPT-4, Claude, and open-source alternatives can interpret natural language goals, decompose complex tasks, reason about action sequences, and generate plans without requiring formal domain specifications. However, LLM-based planning introduces its own challenges: generated plans may contain hallucinated steps, violate constraints, or fail to account for long-term consequences. This tension between the flexibility of LLM-based planning and the rigor of classical approaches has become a defining theme in contemporary agent research.
Classical planning operates within a well-defined formal framework. A planning problem is typically specified by an initial state, a goal state, and a set of actions with preconditions and effects. The planner searches through the space of possible action sequences to find one that transforms the initial state into the goal state.
STRIPS, introduced by Richard Fikes and Nils Nilsson in 1971, was one of the first automated planning systems. It represented the world as a set of first-order logic predicates and defined operators with preconditions, add lists, and delete lists. The Planning Domain Definition Language (PDDL), introduced by Drew McDermott in 1998 for the International Planning Competition, generalized the STRIPS formalism and became the standard language for expressing planning problems. PDDL supports typed objects, conditional effects, numeric fluents, temporal constraints, and other expressive features that allow complex domains to be modeled.
Classical planners, given a PDDL problem specification, use search algorithms (forward state-space search, backward search, partial-order planning, or heuristic search methods like FF and Fast Downward) to find valid or optimal action sequences. Their strength lies in completeness and correctness guarantees: if a valid plan exists, a sound planner will find it.
Hierarchical task network (HTN) planning extends classical planning by introducing task hierarchies. Instead of searching over primitive actions alone, HTN planners decompose abstract (compound) tasks into subtasks using predefined decomposition methods. This recursive decomposition continues until all tasks have been reduced to primitive actions that can be executed directly.
HTN planning has several advantages for complex domains. The hierarchical structure encodes domain knowledge about how tasks should be accomplished, reducing the search space compared to flat planning. Task hierarchies are reusable and composable, making them effective in robotics, logistics, game AI, and simulation environments. However, HTN planning requires manually designed task hierarchies and decomposition methods, which limits its applicability to new domains.
Classical planning approaches have well-known limitations in the context of modern AI applications:
| Limitation | Description |
|---|---|
| Formal specification requirement | Problems must be encoded in PDDL or similar formalisms; natural language goals cannot be processed directly |
| Closed-world assumption | Planners assume complete knowledge of the initial state and all possible actions |
| Scalability | State-space search can become computationally intractable for large or complex domains |
| Brittleness | Plans break when the environment deviates from the model; replanning can be expensive |
| No commonsense reasoning | Classical planners have no built-in knowledge about how the world generally works |
These limitations motivated research into combining classical planners with language models, which can handle natural language, reason with commonsense knowledge, and operate in partially observable or loosely specified environments.
LLM-based planning uses large language models as the core reasoning engine for generating, evaluating, and refining plans. Rather than searching through formally defined state spaces, these approaches leverage the vast knowledge and reasoning capabilities encoded in pretrained language models.
A 2024 survey by Huang et al. ("Understanding the Planning of LLM Agents") identified five major categories of work in LLM-based agent planning: task decomposition, plan selection, external planner integration, reflection, and memory. These categories capture the primary strategies researchers have developed to improve the planning abilities of language model agents.
Task decomposition is the process of breaking a complex goal into smaller, more manageable sub-tasks. This is arguably the most widely studied aspect of LLM-based planning, as it directly addresses the difficulty LLMs face with long-horizon, multi-step problems.
In decomposition-first methods, the agent generates a complete plan by decomposing the task into all required sub-tasks before any execution begins.
Plan-and-Solve Prompting (Wang et al., 2023), published at ACL 2023, addresses limitations of zero-shot chain-of-thought prompting. Rather than simply asking the model to "think step by step," Plan-and-Solve prompting instructs the model to first devise a plan that divides the task into subtasks, then carry out each subtask according to that plan. This two-phase approach reduces missing-step errors that plague standard chain-of-thought reasoning. Experiments on GPT-3 showed that Plan-and-Solve consistently outperformed zero-shot chain-of-thought across arithmetic, commonsense, and symbolic reasoning datasets.
HuggingGPT (Shen et al., 2023), published at NeurIPS 2023, demonstrates task decomposition in a multi-model orchestration setting. Given a user request, HuggingGPT uses ChatGPT to decompose the request into sub-tasks, select appropriate specialist models from Hugging Face for each sub-task, execute those models, and synthesize the results. For example, a request to "describe an image in detail" might be decomposed into image captioning, object detection, image classification, and visual question answering sub-tasks, each handled by a dedicated model. HuggingGPT explicitly manages dependencies between sub-tasks, ensuring that prerequisite tasks complete before dependent tasks begin.
Least-to-Most Prompting (Zhou et al., 2023), published at ICLR 2023, takes a bottom-up decomposition approach. The method first breaks a complex problem into a series of simpler subproblems, then solves them sequentially, with each solution building on the answers to previously solved subproblems. This approach is particularly effective for generalization: it enables models to solve problems that are harder than the examples provided in the prompt, overcoming a key limitation of standard chain-of-thought prompting.
In interleaved approaches, decomposition and execution alternate. The agent decomposes only the next sub-task, executes it, observes the result, and then decides on the next sub-task based on the updated state.
ReAct (Yao et al., 2023), published at ICLR 2023, synergizes reasoning and acting in language models. A ReAct agent operates in a loop of Thought, Action, and Observation. The agent reasons about what to do next (Thought), takes an action such as calling a tool or querying a database (Action), observes the result (Observation), and then uses that information for the next reasoning step. By interleaving reasoning with environment interaction, ReAct agents can adapt their plans based on real-time feedback, making them more robust than static planners. ReAct outperformed several baselines on language reasoning and decision-making benchmarks including HotpotQA and ALFWorld.
ADaPT (Prasad et al., 2024), published at NAACL 2024, introduces as-needed decomposition and planning. Instead of decomposing all sub-tasks upfront, ADaPT only decomposes a sub-task when the LLM fails to execute it directly. If a sub-task proves too complex, the agent recursively decomposes it into finer-grained steps. This adaptive approach adjusts decomposition granularity to both task complexity and model capability, achieving success rate improvements of up to 28.3% on ALFWorld, 27% on WebShop, and 33% on TextCraft compared to non-adaptive baselines.
DEPS (Describe, Explain, Plan, and Select) (Zhu et al., 2023), published at NeurIPS 2023, addresses planning in open-ended environments like Minecraft. DEPS improves error correction during plan execution by having the agent describe the current execution state, explain failures, revise the plan, and select among parallel sub-goals using a trained goal selector that estimates completion difficulty. DEPS was the first zero-shot multi-task agent to accomplish over 70 Minecraft tasks, nearly doubling performance over prior methods.
Rather than generating a single plan, plan selection methods prompt the LLM to produce multiple candidate plans and then use search or evaluation algorithms to choose the best one.
Tree of Thoughts (ToT) (Yao et al., 2023), published at NeurIPS 2023, generalizes chain-of-thought prompting by enabling exploration of multiple reasoning paths. Where chain-of-thought generates a single linear sequence of thoughts, ToT structures the reasoning process as a tree. At each step, the model generates several candidate "thoughts" (intermediate reasoning steps), evaluates them using heuristics (which can themselves be LLM-generated), and uses search algorithms such as breadth-first search or depth-first search to explore the most promising branches. ToT also supports backtracking: if a branch proves unproductive, the search can return to an earlier state and try a different path.
ToT significantly improved performance on tasks requiring non-trivial planning, including the Game of 24 (mathematical reasoning), creative writing, and mini crosswords. The framework requires answering four design questions: how to decompose the reasoning into thought steps, how to generate candidate thoughts, how to evaluate states, and which search algorithm to use.
Graph of Thoughts (GoT) (Besta et al., 2024), published at AAAI 2024, extends the tree structure to an arbitrary graph. In GoT, individual LLM-generated thoughts are vertices in a graph, and edges represent dependencies between thoughts. This allows operations that trees cannot express, such as combining multiple thoughts into a single synthesized thought, or creating feedback loops where later thoughts refine earlier ones. GoT improved sorting quality by 62% over Tree of Thoughts while reducing costs by more than 31%.
A third category of approaches combines LLMs with external planning systems, leveraging the strengths of both. The LLM handles natural language understanding, commonsense reasoning, and problem formulation, while the external planner provides rigorous search and correctness guarantees.
LLM+P (Liu et al., 2023) was the first framework to integrate classical planners with LLMs. LLM+P takes a natural language description of a planning problem, uses the LLM to convert it into a PDDL specification, passes the PDDL problem to a classical planner, and then translates the planner's output back into natural language. In benchmark evaluations, LLM+P found optimal solutions for most problems, while standalone LLMs failed to produce even feasible plans for the majority of test cases.
SayCan (Ahn et al., 2022) from Google Research grounds LLM planning in robotic affordances. The system multiplies two probability scores for each candidate action: a "Say" score (how useful the action is toward the goal, estimated by the LLM) and a "Can" score (how feasible the action is given the robot's current state, estimated by learned affordance functions). Using the PaLM language model with affordance grounding, SayCan chose the correct sequence of skills 84% of the time and executed them successfully 74% of the time. SayCanPay (Hazra et al., 2023) extended this approach by adding a "Pay" component that considers long-term reward, addressing SayCan's greedy, short-sighted action selection.
Inner Monologue (Huang et al., 2023), published at CoRL 2023, introduces closed-loop language-based planning for robots. Rather than generating a plan once, Inner Monologue continuously feeds environment feedback (success detection, object recognition, scene descriptions, and human corrections) back into the LLM's context, creating an ongoing internal dialogue that allows the agent to replan in response to changing conditions. This approach significantly improved instruction completion rates on tabletop manipulation and mobile manipulation tasks.
Several prompting and reasoning strategies have been developed that directly enhance an agent's planning capabilities.
Chain-of-thought (CoT) prompting (Wei et al., 2022) encourages the LLM to generate intermediate reasoning steps before arriving at a final answer. While not a planning method per se, CoT is foundational to most LLM-based planning approaches, as it enables the model to reason through multi-step problems rather than attempting to jump directly to a solution. CoT can be elicited through few-shot examples that demonstrate step-by-step reasoning or through zero-shot prompts like "Let's think step by step."
The ReAct framework established the Thought-Action-Observation loop as a standard pattern for agentic planning. Several variants have built on this foundation:
| Method | Key innovation | Publication |
|---|---|---|
| ReAct | Interleaves reasoning traces with tool actions | ICLR 2023 |
| ReWOO | Decouples planning from execution for efficiency | Xu et al., 2023 |
| Reflexion | Adds verbal self-reflection and episodic memory | NeurIPS 2023 |
| ADaPT | Recursively decomposes only when execution fails | NAACL 2024 |
| DEPS | Adds description and explanation of failures for replanning | NeurIPS 2023 |
ReWOO (Reasoning WithOut Observation) (Xu et al., 2023) separates the planning phase entirely from the execution phase. A Planner module generates a complete blueprint of interdependent steps, a Worker module retrieves evidence from external tools for each step, and a Solver module synthesizes all plans and evidence into a final answer. By decoupling reasoning from observation, ReWOO achieved 5x token efficiency and 4% accuracy improvement on HotpotQA compared to interleaved approaches like ReAct.
A separate line of work focuses on training models with enhanced internal reasoning capabilities. OpenAI's o1 model (released September 2024) and its successor o3 (released early 2025) use a technique called "simulated reasoning," where the model generates an extended private chain of thought before producing an answer. These models spend more time "thinking" about problems, which improves performance on complex reasoning and planning tasks. On the AIME 2024 mathematics benchmark, o3 achieved 91.6% accuracy compared to o1's 74.3%, demonstrating the benefit of deeper reasoning for planning-intensive problems. The o3 model also integrates tool use, web search, and code execution into its reasoning process, blurring the line between internal reasoning and agentic planning.
Generating an initial plan is only part of the challenge. Agents must also verify that their plans are correct and refine them when they encounter errors or unexpected situations.
Reflexion (Shinn et al., 2023), published at NeurIPS 2023, introduces verbal reinforcement learning for language agents. Rather than updating model weights, Reflexion agents generate linguistic self-reflections after failed attempts and store these reflections in an episodic memory buffer. On the next attempt, the agent uses its stored reflections as additional context to avoid repeating previous mistakes. Reflexion achieved 91% pass@1 accuracy on the HumanEval coding benchmark (surpassing GPT-4's 80%) and improved performance on ALFWorld by 22% absolute over strong baselines within 12 iterative trials.
Closed-loop planning incorporates environmental feedback after each action to verify progress and trigger replanning when necessary. This contrasts with open-loop planning, where the full action sequence is generated and executed without intermediate feedback.
AdaPlanner (Sun et al., 2024) dynamically adjusts plan granularity based on uncertainty. When the agent is confident, it executes open-loop plans (longer action sequences without pausing for feedback). When uncertainty is high, it switches to closed-loop, ReAct-style micro-steps with observation after each action. This adaptive approach balances efficiency and robustness.
CART (Chen et al., 2025) is a traceable zero-shot planning framework that guides LLM-based planning agents through adaptive replanning in environments with incomplete information. When conditions trigger a replanning event, CART uses the historical planning trajectory to help the agent quickly resume planning from a reasonable node rather than starting over.
Plan verification ensures that a generated plan is valid before execution. Some approaches use the LLM itself to critique and verify plans, while others employ formal verification methods.
In the LLM+P framework, the classical planner inherently verifies plan validity because it only returns action sequences that satisfy all preconditions and achieve the goal state. Hybrid approaches that combine LLM plan generation with formal verification can catch errors such as missing preconditions, violated constraints, or impossible action sequences that LLMs might produce through hallucination.
Planning in embodied environments (robotics, virtual worlds) presents unique challenges because the agent must interact with a physical or simulated environment where actions have real consequences and states change continuously.
Voyager (Wang et al., 2023) is an LLM-powered embodied lifelong learning agent for Minecraft. Voyager combines three components: an automatic curriculum that maximizes exploration by proposing increasingly complex tasks, a skill library that stores successfully executed code-based skills indexed by their descriptions for retrieval in similar future situations, and an iterative prompting mechanism that generates executable code for embodied control. Voyager obtained 3.3x more unique items, traveled 2.3x longer distances, and unlocked key technology tree milestones up to 15.3x faster than prior state-of-the-art methods. The skill library enables compositional and lifelong learning: skills developed early can be retrieved and combined to solve novel tasks in new environments.
For real-world robotics, planning must account for physical constraints, sensor noise, and safety requirements. Systems like SayCan, SayCanPay, and Inner Monologue ground LLM-generated plans in the physical capabilities and current state of the robot, addressing the gap between what a language model might propose in the abstract and what is actually executable. This grounding is essential because LLMs can suggest actions that are physically impossible, unsafe, or infeasible given the robot's capabilities and surroundings.
Several modern frameworks have implemented agent planning architectures that developers can use to build planning-capable agents.
LangGraph, developed by the LangChain team, uses a graph-based architecture to define agent workflows. Instead of linear chains, developers define state machines with nodes, edges, and conditional routing. LangGraph supports the plan-and-execute pattern, where a planner LLM generates a multi-step plan and separate executor agents carry out each step. The graph structure naturally supports cycles (for iterative refinement), conditional branching (for adaptive planning), and state persistence (for long-running tasks). LangGraph is used in production at companies including LinkedIn, Uber, and Klarna.
The plan-and-execute architecture in LangGraph separates two components:
| Component | Role | Typical model |
|---|---|---|
| Planner | Generates a multi-step plan from the user's goal | Larger, more capable LLM |
| Executor | Carries out individual steps using tools | Smaller, domain-specific LLM or action agent |
This separation offers cost savings (the expensive planning LLM is called less frequently), better overall task completion rates (by forcing explicit upfront planning), and faster execution (sub-tasks can proceed without consulting the planner after each action).
AutoGen, originally developed by Microsoft Research, enables multi-agent conversations where agents with different roles can collaborate to solve tasks. In October 2025, Microsoft merged AutoGen with Semantic Kernel into a unified Microsoft Agent Framework. AutoGen supports planning through agent specialization: one agent can serve as a planner that decomposes tasks and coordinates work, while other agents serve as executors with access to specific tools or knowledge. Agents can engage in structured conversations to debate plans, critique solutions, and iteratively refine their approach.
CrewAI is an open-source framework for orchestrating role-based, collaborative AI agents. When planning is enabled, CrewAI generates a step-by-step workflow before agents begin their tasks, and this shared plan is injected into each agent's context so all participants understand the overall structure. CrewAI supports planning through role specialization, task delegation, and structured handoffs between agents. The framework emphasizes collaborative planning patterns where agents can review each other's work and provide feedback.
Anthropic's Model Context Protocol (MCP), donated to the Linux Foundation's Agentic AI Foundation in 2025, provides a standardized interface for connecting AI agents to external tools and data sources. While MCP is not a planning framework itself, it provides the infrastructure that planning agents need to interact with external systems during plan execution. With over 10,000 active public MCP servers and adoption by ChatGPT, Gemini, Microsoft Copilot, and other products, MCP has become a foundational layer for agentic planning systems.
Multi-agent planning involves multiple AI agents collaborating (or competing) to accomplish shared or individual goals. Multi-agent architectures can improve planning quality through specialization, parallel processing, and debate.
In role-based multi-agent systems, different agents are assigned distinct roles (planner, researcher, critic, executor) and collaborate through structured communication protocols. For example, a planning agent might generate an initial plan, a critic agent might identify potential issues, and the planner might revise the plan based on the critique. This mirrors how human teams often divide planning responsibilities.
Some multi-agent approaches use debate to improve plan quality. Multiple agents independently generate plans, then engage in structured discussion to evaluate alternatives and reach consensus. This approach leverages the diversity of different LLM outputs to explore a wider space of possible plans and identify weaknesses through adversarial critique.
In hierarchical multi-agent systems, a high-level planner agent decomposes tasks and delegates sub-tasks to specialized executor agents. This mirrors the structure of HTN planning but implements it through multi-agent communication rather than formal decomposition methods. HuggingGPT's architecture exemplifies this pattern: ChatGPT acts as the central planner, decomposing user requests and delegating sub-tasks to specialist models on Hugging Face.
Evaluating planning capabilities requires benchmarks that test multi-step reasoning, constraint satisfaction, and real-world applicability.
| Benchmark | Domain | What it tests |
|---|---|---|
| ALFWorld | Household tasks (text-based) | Multi-step interactive planning in simulated environments |
| WebShop | Online shopping | Planning sequences of web interactions to find and purchase products |
| TravelPlanner | Travel itinerary planning | Constraint satisfaction across transportation, accommodation, and activities using 4 million data records |
| HotpotQA | Multi-hop question answering | Planning information retrieval across multiple documents |
| SWE-bench | Software engineering | Planning code changes to resolve real GitHub issues |
| PlanBench | Classical planning domains | Reasoning about state changes and action effects |
| Game of 24 | Mathematical reasoning | Planning arithmetic operations to reach a target number |
| Minecraft (DEPS/Voyager) | Open-world survival | Long-horizon planning in procedurally generated environments |
TravelPlanner (Xie et al., 2024), published as an ICML 2024 spotlight paper, provides 1,225 meticulously curated travel planning intents and reference plans. Agents must collect information through diverse tools and make decisions while satisfying multiple constraints (budget, time, preferences). TravelPlanner revealed that even the most capable LLMs struggle with real-world planning that involves many interacting constraints.
Despite rapid progress, LLM-based planning faces several fundamental challenges.
LLMs can generate plans that include non-existent actions, invalid state transitions, or fabricated tool outputs. In embodied planning settings, hallucinated plan steps may reference objects that do not exist in the environment or propose actions that violate physical constraints. Plan verification mechanisms (formal checking, self-critique, or human oversight) are essential to catch these errors.
Current LLMs struggle with tasks that require planning over many steps. As the planning horizon increases, the probability of errors compounds, and the model may lose track of earlier context or constraints. The ICLR 2025 paper "LLMs Can Plan Only If We Tell Them" found that LLMs can often execute individual planning functions in isolation but struggle to autonomously coordinate them over extended sequences.
Real-world planning problems involve hard constraints (deadlines, budgets, physical laws) and soft constraints (preferences, priorities). LLMs frequently overlook constraints, especially in problems with many interacting requirements. TravelPlanner benchmark results demonstrated that even frontier models produce plans that violate basic constraints in the majority of test cases.
Planning methods that explore multiple reasoning paths (Tree of Thoughts, Graph of Thoughts) or iterate through refinement cycles (Reflexion) require many LLM calls, increasing latency and cost. The trade-off between plan quality and computational budget is an active area of research, with approaches like ReWOO specifically designed to reduce token consumption.
Plans that work in one domain may not transfer to another. While LLMs have broad knowledge, their planning strategies may be brittle when applied to novel domains that differ from their training distribution. Research on lifelong learning agents like Voyager, which build reusable skill libraries, represents one approach to improving planning generalization.