Agent planning

AI Agents Artificial Intelligence Reasoning Models

30 min read

Updated Jun 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 7, 2026

Fact-checked

In review queue

Sources

32 citations

Revision

v5 · 5,943 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Agent planning refers to the process by which an AI agent determines a sequence of actions to accomplish a goal. In the context of large language models (LLMs) and modern agentic systems, planning encompasses task decomposition, reasoning about action sequences, evaluating possible strategies, and adapting plans in response to feedback from the environment. Planning is one of the core capabilities that distinguishes an autonomous agent from a simple question-answering system, enabling it to tackle complex, multi-step tasks that require coordination, tool use, and long-horizon decision-making.

Background

Planning has been a central topic in artificial intelligence since the field's earliest days. Classical AI planning systems such as STRIPS (Stanford Research Institute Problem Solver, 1971) and later formalisms like the Planning Domain Definition Language (PDDL) provided rigorous frameworks for defining states, actions, preconditions, and effects. These systems could guarantee plan correctness and even optimality through systematic search algorithms, but they required problems to be fully specified in formal representations and could not handle natural language input or ambiguous real-world tasks.

The emergence of LLMs has opened a new chapter in planning research. Models like GPT-4, Claude, and open-source alternatives can interpret natural language goals, decompose complex tasks, reason about action sequences, and generate plans without requiring formal domain specifications. However, LLM-based planning introduces its own challenges: generated plans may contain hallucinated steps, violate constraints, or fail to account for long-term consequences. This tension between the flexibility of LLM-based planning and the rigor of classical approaches has become a defining theme in contemporary agent research.

Classical AI planning

Classical planning operates within a well-defined formal framework. A planning problem is typically specified by an initial state, a goal state, and a set of actions with preconditions and effects. The planner searches through the space of possible action sequences to find one that transforms the initial state into the goal state.

STRIPS and PDDL

STRIPS, introduced by Richard Fikes and Nils Nilsson in 1971, was one of the first automated planning systems.^[1] It represented the world as a set of first-order logic predicates and defined operators with preconditions, add lists, and delete lists.^[1] The Planning Domain Definition Language (PDDL), introduced by Drew McDermott in 1998 for the International Planning Competition, generalized the STRIPS formalism and became the standard language for expressing planning problems.^[2] PDDL supports typed objects, conditional effects, numeric fluents, temporal constraints, and other expressive features that allow complex domains to be modeled.

Classical planners, given a PDDL problem specification, use search algorithms (forward state-space search, backward search, partial-order planning, or heuristic search methods like FF and Fast Downward) to find valid or optimal action sequences. Their strength lies in completeness and correctness guarantees: if a valid plan exists, a sound planner will find it.

Hierarchical task network planning

Hierarchical task network (HTN) planning extends classical planning by introducing task hierarchies. Instead of searching over primitive actions alone, HTN planners decompose abstract (compound) tasks into subtasks using predefined decomposition methods. This recursive decomposition continues until all tasks have been reduced to primitive actions that can be executed directly.

HTN planning has several advantages for complex domains. The hierarchical structure encodes domain knowledge about how tasks should be accomplished, reducing the search space compared to flat planning. Task hierarchies are reusable and composable, making them effective in robotics, logistics, game AI, and simulation environments. However, HTN planning requires manually designed task hierarchies and decomposition methods, which limits its applicability to new domains.

Limitations of classical planning

Classical planning approaches have well-known limitations in the context of modern AI applications:

Limitation	Description
Formal specification requirement	Problems must be encoded in PDDL or similar formalisms; natural language goals cannot be processed directly
Closed-world assumption	Planners assume complete knowledge of the initial state and all possible actions
Scalability	State-space search can become computationally intractable for large or complex domains
Brittleness	Plans break when the environment deviates from the model; replanning can be expensive
No commonsense reasoning	Classical planners have no built-in knowledge about how the world generally works

These limitations motivated research into combining classical planners with language models, which can handle natural language, reason with commonsense knowledge, and operate in partially observable or loosely specified environments.

LLM-based planning

LLM-based planning uses large language models as the core reasoning engine for generating, evaluating, and refining plans. Rather than searching through formally defined state spaces, these approaches leverage the vast knowledge and reasoning capabilities encoded in pretrained language models.

A 2024 survey by Huang et al. ("Understanding the Planning of LLM Agents") identified five major categories of work in LLM-based agent planning: task decomposition, plan selection, external planner integration, reflection, and memory.^[3] These categories capture the primary strategies researchers have developed to improve the planning abilities of language model agents. A complementary 2025 survey of LLMs for automated planning by Aghzal, Plaku, Stein, and Yao organized the field into two broad families, LLMs used as standalone planners (through decomposition, refinement, search, and finetuning) and LLMs integrated with traditional planners (for translating natural language into formal specifications, supplying commonsense heuristics, or acting as plan critics); it concluded that LLMs are not well-suited to serve as standalone planners for long-horizon tasks and advocated balanced neuro-symbolic methods.^[26]

Task decomposition

Task decomposition is the process of breaking a complex goal into smaller, more manageable sub-tasks. This is arguably the most widely studied aspect of LLM-based planning, as it directly addresses the difficulty LLMs face with long-horizon, multi-step problems.

Decomposition-first approaches

In decomposition-first methods, the agent generates a complete plan by decomposing the task into all required sub-tasks before any execution begins.

Plan-and-Solve Prompting (Wang et al., 2023), published at ACL 2023, addresses limitations of zero-shot chain-of-thought prompting.^[4] Rather than simply asking the model to "think step by step," Plan-and-Solve prompting instructs the model to first devise a plan that divides the task into subtasks, then carry out each subtask according to that plan.^[4] This two-phase approach reduces missing-step errors that plague standard chain-of-thought reasoning. Experiments on GPT-3 showed that Plan-and-Solve consistently outperformed zero-shot chain-of-thought across arithmetic, commonsense, and symbolic reasoning datasets.^[4]

HuggingGPT (Shen et al., 2023), published at NeurIPS 2023, demonstrates task decomposition in a multi-model orchestration setting.^[5] Given a user request, HuggingGPT uses ChatGPT to decompose the request into sub-tasks, select appropriate specialist models from Hugging Face for each sub-task, execute those models, and synthesize the results.^[5] For example, a request to "describe an image in detail" might be decomposed into image captioning, object detection, image classification, and visual question answering sub-tasks, each handled by a dedicated model. HuggingGPT explicitly manages dependencies between sub-tasks, ensuring that prerequisite tasks complete before dependent tasks begin.

Least-to-Most Prompting (Zhou et al., 2023), published at ICLR 2023, takes a bottom-up decomposition approach.^[6] The method first breaks a complex problem into a series of simpler subproblems, then solves them sequentially, with each solution building on the answers to previously solved subproblems.^[6] This approach is particularly effective for generalization: it enables models to solve problems that are harder than the examples provided in the prompt, overcoming a key limitation of standard chain-of-thought prompting.

Interleaved decomposition and execution

In interleaved approaches, decomposition and execution alternate. The agent decomposes only the next sub-task, executes it, observes the result, and then decides on the next sub-task based on the updated state.

ReAct (Yao et al., 2023), published at ICLR 2023, synergizes reasoning and acting in language models.^[7] A ReAct agent operates in a loop of Thought, Action, and Observation. The agent reasons about what to do next (Thought), takes an action such as calling a tool or querying a database (Action), observes the result (Observation), and then uses that information for the next reasoning step. By interleaving reasoning with environment interaction, ReAct agents can adapt their plans based on real-time feedback, making them more robust than static planners. ReAct outperformed several baselines on language reasoning and decision-making benchmarks including HotpotQA and ALFWorld; on the interactive ALFWorld and WebShop benchmarks it improved absolute success rates by 34% and 10% respectively over imitation and reinforcement learning methods while using only one or two in-context examples.^[7]

ADaPT (Prasad et al., 2024), published at NAACL 2024, introduces as-needed decomposition and planning.^[16] Instead of decomposing all sub-tasks upfront, ADaPT only decomposes a sub-task when the LLM fails to execute it directly. If a sub-task proves too complex, the agent recursively decomposes it into finer-grained steps. This adaptive approach adjusts decomposition granularity to both task complexity and model capability, achieving success rate improvements of up to 28.3% on ALFWorld, 27% on WebShop, and 33% on TextCraft compared to non-adaptive baselines.^[16]

DEPS (Describe, Explain, Plan, and Select) (Zhu et al., 2023), published at NeurIPS 2023, addresses planning in open-ended environments like Minecraft.^[17] DEPS improves error correction during plan execution by having the agent describe the current execution state, explain failures, revise the plan, and select among parallel sub-goals using a trained goal selector that estimates completion difficulty.^[17] DEPS was the first zero-shot multi-task agent to accomplish over 70 Minecraft tasks, nearly doubling performance over prior methods.^[17]

Plan selection and search strategies

Rather than generating a single plan, plan selection methods prompt the LLM to produce multiple candidate plans and then use search or evaluation algorithms to choose the best one.

Tree of Thoughts

Tree of Thoughts (ToT) (Yao et al., 2023), published at NeurIPS 2023, generalizes chain-of-thought prompting by enabling exploration of multiple reasoning paths.^[8] Where chain-of-thought generates a single linear sequence of thoughts, ToT structures the reasoning process as a tree. At each step, the model generates several candidate "thoughts" (intermediate reasoning steps), evaluates them using heuristics (which can themselves be LLM-generated), and uses search algorithms such as breadth-first search or depth-first search to explore the most promising branches. ToT also supports backtracking: if a branch proves unproductive, the search can return to an earlier state and try a different path.

ToT significantly improved performance on tasks requiring non-trivial planning, including the Game of 24 (mathematical reasoning), creative writing, and mini crosswords. On the Game of 24, ToT reached a 74% success rate where GPT-4 with chain-of-thought prompting solved only 4% of tasks.^[8] The framework requires answering four design questions: how to decompose the reasoning into thought steps, how to generate candidate thoughts, how to evaluate states, and which search algorithm to use.

Graph of Thoughts

Graph of Thoughts (GoT) (Besta et al., 2024), published at AAAI 2024, extends the tree structure to an arbitrary graph.^[9] In GoT, individual LLM-generated thoughts are vertices in a graph, and edges represent dependencies between thoughts. This allows operations that trees cannot express, such as combining multiple thoughts into a single synthesized thought, or creating feedback loops where later thoughts refine earlier ones. GoT improved sorting quality by 62% over Tree of Thoughts while reducing costs by more than 31%.^[9]

Algorithm of Thoughts

Algorithm of Thoughts (AoT) (Sel et al., 2024), published at ICML 2024, embeds algorithmic search behavior such as depth-first or breadth-first exploration directly into a single prompt using in-context algorithmic exemplars, rather than orchestrating the search externally through repeated model calls.^[24] Because the exploration unfolds within one or a few queries, AoT avoids the halting, restarting, and prompt repetition that drive up the cost of external tree-search methods like ToT. The authors reported that AoT outperformed both single-query prompting and multi-query tree-search strategies while using significantly fewer tokens, observing that guiding a model with an algorithm can let it surpass the algorithm that inspired the prompt.^[24] AoT was later extended into AoT+ (see Limitations and open challenges).

External planner integration

A third category of approaches combines LLMs with external planning systems, leveraging the strengths of both. The LLM handles natural language understanding, commonsense reasoning, and problem formulation, while the external planner provides rigorous search and correctness guarantees.

LLM+P (Liu et al., 2023) was the first framework to integrate classical planners with LLMs.^[10] LLM+P takes a natural language description of a planning problem, uses the LLM to convert it into a PDDL specification, passes the PDDL problem to a classical planner, and then translates the planner's output back into natural language.^[10] In benchmark evaluations, LLM+P found optimal solutions for most problems, while standalone LLMs failed to produce even feasible plans for the majority of test cases.^[10]

SayCan (Ahn et al., 2022) from Google Research grounds LLM planning in robotic affordances.^[11] The system multiplies two probability scores for each candidate action: a "Say" score (how useful the action is toward the goal, estimated by the LLM) and a "Can" score (how feasible the action is given the robot's current state, estimated by learned affordance functions). Using the PaLM language model with affordance grounding, SayCan chose the correct sequence of skills 84% of the time and executed them successfully 74% of the time across 101 instructions in a real kitchen, a roughly 50% improvement in plan-success over the same model without affordance grounding.^[11] SayCanPay (Hazra et al., 2023) extended this approach by adding a "Pay" component that considers long-term reward, addressing SayCan's greedy, short-sighted action selection.^[12]

Inner Monologue (Huang et al., 2023), published at CoRL 2023, introduces closed-loop language-based planning for robots.^[13] Rather than generating a plan once, Inner Monologue continuously feeds environment feedback (success detection, object recognition, scene descriptions, and human corrections) back into the LLM's context, creating an ongoing internal dialogue that allows the agent to replan in response to changing conditions.^[13] This approach significantly improved instruction completion rates on tabletop manipulation and mobile manipulation tasks.

LLM-Modulo (Kambhampati et al., 2024), a position paper presented as an ICML 2024 spotlight, frames a generate-test-critique architecture in which the LLM proposes candidate plans and a bank of external, model-based verifiers (for example the VAL plan validator for PDDL, plus domain-specific or soft critics) checks them and returns feedback for revision.^[25] The authors argue that current LLMs cannot reliably plan on their own but are valuable inside such "LLM-Modulo" loops, where soundness is enforced by the external critics rather than by the model; new critics can be added at any time, making the framework extensible.^[25] A 2024 case study applied this approach to the TravelPlanner benchmark.^[25]

Reasoning strategies for planning

Several prompting and reasoning strategies have been developed that directly enhance an agent's planning capabilities.

Chain-of-thought reasoning

Chain-of-thought (CoT) prompting (Wei et al., 2022) encourages the LLM to generate intermediate reasoning steps before arriving at a final answer.^[19] While not a planning method per se, CoT is foundational to most LLM-based planning approaches, as it enables the model to reason through multi-step problems rather than attempting to jump directly to a solution. CoT can be elicited through few-shot examples that demonstrate step-by-step reasoning or through zero-shot prompts like "Let's think step by step."

ReAct and variants

The ReAct framework established the Thought-Action-Observation loop as a standard pattern for agentic planning.^[7] Several variants have built on this foundation:

Method	Key innovation	Publication
ReAct	Interleaves reasoning traces with tool actions	ICLR 2023
ReWOO	Decouples planning from execution for efficiency	Xu et al., 2023
Reflexion	Adds verbal self-reflection and episodic memory	NeurIPS 2023
ADaPT	Recursively decomposes only when execution fails	NAACL 2024
DEPS	Adds description and explanation of failures for replanning	NeurIPS 2023

ReWOO (Reasoning WithOut Observation) (Xu et al., 2023) separates the planning phase entirely from the execution phase.^[15] A Planner module generates a complete blueprint of interdependent steps, a Worker module retrieves evidence from external tools for each step, and a Solver module synthesizes all plans and evidence into a final answer.^[15] By decoupling reasoning from observation, ReWOO achieved 5x token efficiency and 4% accuracy improvement on HotpotQA compared to interleaved approaches like ReAct.^[15]

Reasoning models

A separate line of work focuses on training models with enhanced internal reasoning capabilities. OpenAI's o1 model (released September 2024) and its successor o3 (released early 2025) use a technique called "simulated reasoning," where the model generates an extended private chain of thought before producing an answer. These models spend more time "thinking" about problems, which improves performance on complex reasoning and planning tasks. On the AIME 2024 mathematics benchmark, o3 achieved 91.6% accuracy compared to o1's 74.3%, demonstrating the benefit of deeper reasoning for planning-intensive problems. The o3 model also integrates tool use, web search, and code execution into its reasoning process, blurring the line between internal reasoning and agentic planning.

The planning gains from these reasoning models have been measured directly on classical benchmarks. A 2025 systematic evaluation by Valmeekam and colleagues, published in Transactions on Machine Learning Research, tested o1 on PlanBench and reported that o1-preview reached a 97.8% zero-shot success rate on 600 Blocksworld instances, compared with 62.6% for Llama 3.1 405B and 34.6% for GPT-4, while still degrading sharply on obfuscated and longer-horizon variants and remaining short of classical planners on cost and guarantees.^[22]^[27] This pattern, strong gains over earlier LLMs but persistent failure modes on harder instances, recurs across the reasoning-model literature.

Generating an initial plan is only part of the challenge. Agents must also verify that their plans are correct and refine them when they encounter errors or unexpected situations.

Self-reflection

Reflexion (Shinn et al., 2023), published at NeurIPS 2023, introduces verbal reinforcement learning for language agents.^[14] Rather than updating model weights, Reflexion agents generate linguistic self-reflections after failed attempts and store these reflections in an episodic memory buffer.^[14] On the next attempt, the agent uses its stored reflections as additional context to avoid repeating previous mistakes. Reflexion achieved 91% pass@1 accuracy on the HumanEval coding benchmark (surpassing GPT-4's 80%) and improved performance on ALFWorld by 22% absolute over strong baselines within 12 iterative trials.^[14]

Closed-loop feedback

Closed-loop planning incorporates environmental feedback after each action to verify progress and trigger replanning when necessary. This contrasts with open-loop planning, where the full action sequence is generated and executed without intermediate feedback.

AdaPlanner (Sun et al., 2024) dynamically adjusts plan granularity based on uncertainty.^[21] When the agent is confident, it executes open-loop plans (longer action sequences without pausing for feedback). When uncertainty is high, it switches to closed-loop, ReAct-style micro-steps with observation after each action. This adaptive approach balances efficiency and robustness.

CART (Chen et al., 2025) is a traceable zero-shot planning framework that guides LLM-based planning agents through adaptive replanning in environments with incomplete information. When conditions trigger a replanning event, CART uses the historical planning trajectory to help the agent quickly resume planning from a reasonable node rather than starting over.

Plan verification

Plan verification ensures that a generated plan is valid before execution. Some approaches use the LLM itself to critique and verify plans, while others employ formal verification methods.

In the LLM+P framework, the classical planner inherently verifies plan validity because it only returns action sequences that satisfy all preconditions and achieve the goal state.^[10] Hybrid approaches that combine LLM plan generation with formal verification can catch errors such as missing preconditions, violated constraints, or impossible action sequences that LLMs might produce through hallucination. The LLM-Modulo framework generalizes this idea by routing every candidate plan through external verifiers and returning their feedback to the model until the plan passes.^[25]

Embodied and open-world planning

Planning in embodied environments (robotics, virtual worlds) presents unique challenges because the agent must interact with a physical or simulated environment where actions have real consequences and states change continuously.

Voyager

Voyager (Wang et al., 2023) is an LLM-powered embodied lifelong learning agent for Minecraft.^[18] Voyager combines three components: an automatic curriculum that maximizes exploration by proposing increasingly complex tasks, a skill library that stores successfully executed code-based skills indexed by their descriptions for retrieval in similar future situations, and an iterative prompting mechanism that generates executable code for embodied control.^[18] Voyager obtained 3.3x more unique items, traveled 2.3x longer distances, and unlocked key technology tree milestones up to 15.3x faster than prior state-of-the-art methods.^[18] The skill library enables compositional and lifelong learning: skills developed early can be retrieved and combined to solve novel tasks in new environments.

Robotic task planning

For real-world robotics, planning must account for physical constraints, sensor noise, and safety requirements. Systems like SayCan, SayCanPay, and Inner Monologue ground LLM-generated plans in the physical capabilities and current state of the robot, addressing the gap between what a language model might propose in the abstract and what is actually executable. This grounding is essential because LLMs can suggest actions that are physically impossible, unsafe, or infeasible given the robot's capabilities and surroundings.

Framework implementations

Several modern frameworks have implemented agent planning architectures that developers can use to build planning-capable agents.

LangGraph

LangGraph, developed by the LangChain team, uses a graph-based architecture to define agent workflows. Instead of linear chains, developers define state machines with nodes, edges, and conditional routing. LangGraph supports the plan-and-execute pattern, where a planner LLM generates a multi-step plan and separate executor agents carry out each step. The graph structure naturally supports cycles (for iterative refinement), conditional branching (for adaptive planning), and state persistence (for long-running tasks). LangGraph is used in production at companies including LinkedIn, Uber, and Klarna.

The plan-and-execute architecture in LangGraph separates two components:

Component	Role	Typical model
Planner	Generates a multi-step plan from the user's goal	Larger, more capable LLM
Executor	Carries out individual steps using tools	Smaller, domain-specific LLM or action agent

This separation offers cost savings (the expensive planning LLM is called less frequently), better overall task completion rates (by forcing explicit upfront planning), and faster execution (sub-tasks can proceed without consulting the planner after each action).

AutoGen

AutoGen, originally developed by Microsoft Research, enables multi-agent conversations where agents with different roles can collaborate to solve tasks. In October 2025, Microsoft merged AutoGen with Semantic Kernel into a unified Microsoft Agent Framework.^[32] AutoGen supports planning through agent specialization: one agent can serve as a planner that decomposes tasks and coordinates work, while other agents serve as executors with access to specific tools or knowledge. Agents can engage in structured conversations to debate plans, critique solutions, and iteratively refine their approach. The unified Microsoft Agent Framework first shipped in public preview on October 1, 2025, and reached a production-ready 1.0 release for .NET and Python in April 2026, with AutoGen and Semantic Kernel moving into maintenance mode.^[32]

CrewAI

CrewAI is an open-source framework for orchestrating role-based, collaborative AI agents. When planning is enabled, CrewAI generates a step-by-step workflow before agents begin their tasks, and this shared plan is injected into each agent's context so all participants understand the overall structure. CrewAI supports planning through role specialization, task delegation, and structured handoffs between agents. The framework emphasizes collaborative planning patterns where agents can review each other's work and provide feedback.

Model Context Protocol

Anthropic's Model Context Protocol (MCP) provides a standardized interface for connecting AI agents to external tools and data sources. While MCP is not a planning framework itself, it provides the infrastructure that planning agents need to interact with external systems during plan execution. In December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation co-founded by Anthropic, Block, and OpenAI, where it was contributed alongside Block's goose agent and OpenAI's AGENTS.md specification.^[30]^[31] With over 10,000 published MCP servers and adoption by ChatGPT, Gemini, Microsoft Copilot, Cursor, and other products (the protocol's installs grew from roughly 2 million at launch to about 97 million per month within 16 months), MCP has become a foundational layer for agentic planning systems.^[30]

Multi-agent planning

Multi-agent planning involves multiple AI agents collaborating (or competing) to accomplish shared or individual goals. Multi-agent architectures can improve planning quality through specialization, parallel processing, and debate.

Role-based collaboration

In role-based multi-agent systems, different agents are assigned distinct roles (planner, researcher, critic, executor) and collaborate through structured communication protocols. For example, a planning agent might generate an initial plan, a critic agent might identify potential issues, and the planner might revise the plan based on the critique. This mirrors how human teams often divide planning responsibilities.

Debate and consensus

Some multi-agent approaches use debate to improve plan quality. Multiple agents independently generate plans, then engage in structured discussion to evaluate alternatives and reach consensus. This approach leverages the diversity of different LLM outputs to explore a wider space of possible plans and identify weaknesses through adversarial critique.

Hierarchical multi-agent planning

In hierarchical multi-agent systems, a high-level planner agent decomposes tasks and delegates sub-tasks to specialized executor agents. This mirrors the structure of HTN planning but implements it through multi-agent communication rather than formal decomposition methods. HuggingGPT's architecture exemplifies this pattern: ChatGPT acts as the central planner, decomposing user requests and delegating sub-tasks to specialist models on Hugging Face.^[5]

Benchmarks and evaluation

Evaluating planning capabilities requires benchmarks that test multi-step reasoning, constraint satisfaction, and real-world applicability.

Benchmark	Domain	What it tests
ALFWorld	Household tasks (text-based)	Multi-step interactive planning in simulated environments
WebShop	Online shopping	Planning sequences of web interactions to find and purchase products
TravelPlanner	Travel itinerary planning	Constraint satisfaction across transportation, accommodation, and activities using 4 million data records
NATURAL PLAN	Trip, meeting, and calendar planning	Natural-language planning with realistic tool outputs as context
HotpotQA	Multi-hop question answering	Planning information retrieval across multiple documents
SWE-bench	Software engineering	Planning code changes to resolve real GitHub issues
PlanBench	Classical planning domains	Reasoning about state changes and action effects
Game of 24	Mathematical reasoning	Planning arithmetic operations to reach a target number
Minecraft (DEPS/Voyager)	Open-world survival	Long-horizon planning in procedurally generated environments

TravelPlanner (Xie et al., 2024), published as an ICML 2024 spotlight paper, provides 1,225 meticulously curated travel planning intents and reference plans.^[20] Agents must collect information through diverse tools and make decisions while satisfying multiple constraints (budget, time, preferences).^[20] TravelPlanner revealed that even the most capable LLMs struggle with real-world planning that involves many interacting constraints; in the original study GPT-4 reached a final pass rate of only 0.6%, and later reasoning models such as o1-preview improved this to roughly 10% when given complete information.^[20]^[29] A 2025 framework by Hao and colleagues, published at NAACL 2025, instead used the LLM to translate each query into a satisfiability (SMT) problem solved by a sound and complete solver, raising the TravelPlanner success rate to 93.9% and identifying unsatisfiable cores to explain infeasible requests, a concrete demonstration that pairing language understanding with formal verification can largely close the gap on constraint-heavy planning.^[29]

NATURAL PLAN (Zheng et al., 2024), released by Google DeepMind in June 2024, is a benchmark of realistic natural-language planning tasks across three domains: Trip Planning, Meeting Planning, and Calendar Scheduling.^[28] Rather than using synthetic environments or formal languages, it embeds the outputs of tools such as Google Flights, Google Maps, and Google Calendar directly into the prompt so that models have full information.^[28] Even with complete context, leading models scored well below human level: on Trip Planning, GPT-4 and Gemini 1.5 Pro reached only 31.1% and 34.8% solve rates respectively, and all tested models fell below 5% on the hardest 10-city instances, illustrating that performance degrades sharply as the number of interacting constraints grows.^[28]

Limitations and open challenges

Despite rapid progress, LLM-based planning faces several fundamental challenges.

Hallucination in plans

LLMs can generate plans that include non-existent actions, invalid state transitions, or fabricated tool outputs. In embodied planning settings, hallucinated plan steps may reference objects that do not exist in the environment or propose actions that violate physical constraints. Plan verification mechanisms (formal checking, self-critique, or human oversight) are essential to catch these errors.

Long-horizon planning

Current LLMs struggle with tasks that require planning over many steps. As the planning horizon increases, the probability of errors compounds, and the model may lose track of earlier context or constraints. The ICLR 2025 paper "LLMs Can Plan Only If We Tell Them" (Sel, Jia, and Jin) observed that even advanced models such as GPT-4 fail to match human performance on standard planning benchmarks like Blocksworld without additional support, but showed that their enhanced method, AoT+, lets an LLM autonomously generate long-horizon plans that reach state-of-the-art results and outcompete prior methods and human baselines; in other words, the bottleneck was the prompting and search strategy rather than an intrinsic inability to plan.^[23] A 2025 evaluation of the field nonetheless cautioned that LLMs remain ill-suited as standalone long-horizon planners and can produce plans with arbitrarily poor cost, supporting continued interest in neuro-symbolic combinations.^[26]

Constraint satisfaction

Real-world planning problems involve hard constraints (deadlines, budgets, physical laws) and soft constraints (preferences, priorities). LLMs frequently overlook constraints, especially in problems with many interacting requirements. TravelPlanner benchmark results demonstrated that even frontier models produce plans that violate basic constraints in the majority of test cases.^[20] Approaches that offload constraint checking to formal solvers, such as the SMT-based TravelPlanner framework and the LLM-Modulo critics, have shown that this failure mode can be sharply reduced when verification is externalized.^[25]^[29]

Computational cost

Planning methods that explore multiple reasoning paths (Tree of Thoughts, Graph of Thoughts) or iterate through refinement cycles (Reflexion) require many LLM calls, increasing latency and cost. The trade-off between plan quality and computational budget is an active area of research, with approaches like ReWOO specifically designed to reduce token consumption^[15] and single-prompt methods like Algorithm of Thoughts designed to capture search behavior without repeated external calls.^[24]

Generalization

Plans that work in one domain may not transfer to another. While LLMs have broad knowledge, their planning strategies may be brittle when applied to novel domains that differ from their training distribution. Research on lifelong learning agents like Voyager, which build reusable skill libraries, represents one approach to improving planning generalization.^[18]

2025-2026 developments

By 2025 and 2026, explicit planning became a defining feature of "deep research" agents from OpenAI, Google, and others, which decompose an open-ended research goal into a structured plan before retrieving and reasoning over sources, an approach surveyed as a shift from reactive, step-by-step prompting toward dedicated planning stages that reduce trial-and-error failures. In parallel, frontier model releases emphasized agentic planning directly: vendors positioned newer reasoning models as able to break complex tasks into independent subtasks, run tools and subagents in parallel, and sustain coherent work over hours of autonomous operation. On the research side, test-time (inference-time) compute scaling emerged as a complementary lever for planning, with parallel sampling and verification, sequential self-refinement, and adaptive "thinking" depth all used to trade additional inference compute for better plans, though gains were uneven and strongest on formally structured problems. A recurring theme across these efforts is that the most reliable long-horizon results still come from pairing LLM-driven planning with external structure, whether classical planners, formal solvers, verifier critics, or skill libraries, rather than from unaided end-to-end generation.^[26]^[29]

References

Fikes, R. E., & Nilsson, N. J. (1971). "STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving." *Artificial Intelligence*, 2(3-4), 189-208. ↩
McDermott, D. et al. (1998). "PDDL: The Planning Domain Definition Language." *Technical Report CVC TR-98-003*, Yale Center for Computational Vision and Control. ↩
Huang, Z., Liu, C., et al. (2024). "Understanding the Planning of LLM Agents: A Survey." *arXiv:2402.02716*. ↩
Wang, L., Xu, W., Lan, Y., et al. (2023). "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models." *Proceedings of ACL 2023*. ↩
Shen, Y., Song, K., et al. (2023). "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face." *Proceedings of NeurIPS 2023*. ↩
Zhou, D., Scharli, N., et al. (2023). "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models." *Proceedings of ICLR 2023*. ↩
Yao, S., Zhao, J., et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." *Proceedings of ICLR 2023*. ↩
Yao, S., Yu, D., et al. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." *Proceedings of NeurIPS 2023*. ↩
Besta, M., Blach, N., et al. (2024). "Graph of Thoughts: Solving Elaborate Problems with Large Language Models." *Proceedings of AAAI 2024*, 38(16), 17682-17690. ↩
Liu, B., Jiang, Y., et al. (2023). "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency." *arXiv:2304.11477*. ↩
Ahn, M., Brohan, A., et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." *arXiv:2204.01691*. ↩
Hazra, R., et al. (2023). "SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge." *arXiv:2308.12682*. ↩
Huang, W., Xia, F., et al. (2023). "Inner Monologue: Embodied Reasoning through Planning with Language Models." *Proceedings of CoRL 2023 (PMLR 205)*, 1769-1782. ↩
Shinn, N., Cassano, F., et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." *Proceedings of NeurIPS 2023*. ↩
Xu, B., Peng, Z., et al. (2023). "ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models." *arXiv:2305.18323*. ↩
Prasad, A., Koller, A., et al. (2024). "ADaPT: As-Needed Decomposition and Planning with Language Models." *Findings of NAACL 2024*. ↩
Zhu, X., et al. (2023). "Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents." *Proceedings of NeurIPS 2023*. ↩
Wang, G., Xie, Y., et al. (2023). "Voyager: An Open-Ended Embodied Agent with Large Language Models." *arXiv:2305.16291*. ↩
Wei, J., Wang, X., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *Proceedings of NeurIPS 2022*. ↩
Xie, J., Zhang, K., et al. (2024). "TravelPlanner: A Benchmark for Real-World Planning with Language Agents." *Proceedings of ICML 2024 (Spotlight)*. ↩
Sun, H., et al. (2024). "AdaPlanner: Adaptive Planning from Feedback with Language Models." *Proceedings of NeurIPS 2024*. ↩
Valmeekam, K., et al. (2024). "PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change." *Advances in Neural Information Processing Systems*, 2024. ↩
Sel, B., Jia, R., & Jin, M. (2025). "LLMs Can Plan Only If We Tell Them." *Proceedings of ICLR 2025*. arXiv:2501.13545. ↩
Sel, B., Al-Tawaha, A., Khattar, V., Jia, R., & Jin, M. (2024). "Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models." *Proceedings of ICML 2024*. arXiv:2308.10379. ↩
Kambhampati, S., Valmeekam, K., Guan, L., et al. (2024). "Position: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks." *Proceedings of ICML 2024 (Spotlight)*. arXiv:2402.01817. ↩
Aghzal, M., Plaku, E., Stein, G. J., & Yao, Z. (2025). "A Survey on Large Language Models for Automated Planning." *arXiv:2502.12435*. ↩
Valmeekam, K., Stechly, K., & Kambhampati, S. (2025). "A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1." *Transactions on Machine Learning Research*. ↩
Zheng, H. S., Mishra, S., Zhang, H., et al. (2024). "NATURAL PLAN: Benchmarking LLMs on Natural Language Planning." *arXiv:2406.04520*. ↩
Hao, Y., Chen, Y., Zhang, Y., et al. (2025). "Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools." *Proceedings of NAACL 2025*. arXiv:2404.11891. ↩
Anthropic (2025). "Donating the Model Context Protocol and establishing the Agentic AI Foundation." Anthropic News, December 2025. ↩
Linux Foundation (2025). "Linux Foundation Announces the Formation of the Agentic AI Foundation (AAIF), Anchored by New Project Contributions Including Model Context Protocol (MCP), goose and AGENTS.md." Linux Foundation press release, December 2025. ↩
Microsoft (2026). "Microsoft Ships Production-Ready Agent Framework 1.0 for .NET and Python." Microsoft Agent Framework documentation and release notes, April 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

BabyAGI Graph of Thoughts Motion planning

Background

Classical AI planning

STRIPS and PDDL

Hierarchical task network planning

Limitations of classical planning

LLM-based planning

Task decomposition

Decomposition-first approaches

Interleaved decomposition and execution

Plan selection and search strategies

Tree of Thoughts

Graph of Thoughts

Algorithm of Thoughts

External planner integration

Reasoning strategies for planning

Chain-of-thought reasoning

ReAct and variants

Reasoning models

Plan refinement and verification

Self-reflection

Closed-loop feedback

Plan verification

Embodied and open-world planning

Voyager

Robotic task planning

Framework implementations

LangGraph

AutoGen

CrewAI

Model Context Protocol

Multi-agent planning

Role-based collaboration

Debate and consensus

Hierarchical multi-agent planning

Benchmarks and evaluation

Limitations and open challenges

Hallucination in plans

Long-horizon planning

Constraint satisfaction

Computational cost

Generalization

2025-2026 developments

See also

References

Improve this article

Related Articles

Reflexion

ReAct (prompting)

Interleaved thinking

ARC-AGI 1

MathArena

SimpleBench

What links here

Related Articles

Reflexion

ReAct (prompting)

Interleaved thinking

ARC-AGI 1

MathArena

SimpleBench

What links here