Tree of Thoughts
Last reviewed
May 2, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 3,631 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 3,631 words
Add missing citations, update stale details, or suggest a clearer explanation.
Tree of Thoughts (ToT) is a prompting and inference-time search framework for large language models that lets the model explore multiple intermediate reasoning steps as nodes in a tree, evaluate each candidate, and backtrack when a path looks unpromising. The framework was introduced in May 2023 by Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan in the paper Tree of Thoughts: Deliberate Problem Solving with Large Language Models (arXiv:2305.10601), with affiliations to Princeton University and Google DeepMind. The paper was accepted as an oral presentation at NeurIPS 2023. ToT generalises chain-of-thought prompting by replacing the single linear reasoning trajectory with a structured search over partial solutions, where the language model itself acts both as a thought generator and as a value function.
A second paper with a similar title, Large Language Model Guided Tree-of-Thought by Jieyi Long (arXiv:2305.08291), appeared two days earlier on 15 May 2023 and proposes a related but distinct multi-module framework. The two works are independent and are sometimes confused in secondary sources. This article focuses on the Yao et al. version, which is the more widely cited and the one usually meant by "ToT" in the prompting literature.
| Field | Value |
|---|---|
| Paper title | Tree of Thoughts: Deliberate Problem Solving with Large Language Models |
| Authors | Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan |
| Affiliations | Princeton University, Google DeepMind |
| arXiv ID | 2305.10601 |
| First posted | 17 May 2023 (v1); revised 3 December 2023 (v2) |
| Conference | NeurIPS 2023 (oral) |
| Code repository | github.com/princeton-nlp/tree-of-thought-llm |
| License | MIT |
| Related disambiguation paper | Long, Large Language Model Guided Tree-of-Thought, arXiv:2305.08291, 15 May 2023 |
By late 2022, chain-of-thought prompting introduced by Wei et al. (Google Brain, 2022) had established that asking a large language model to produce intermediate reasoning steps before a final answer improves performance on arithmetic, commonsense, and symbolic tasks. Self-consistency, proposed by Wang et al. in 2022, extended chain-of-thought by sampling multiple reasoning paths and taking a majority vote on the final answer. Both techniques generate text strictly left to right and never revise an earlier step once it has been written.
This left-to-right structure has known limits. A single bad token early in the chain can derail the rest of the trajectory. Sampling many independent chains, as in self-consistency, helps when the correct answer is the most common mode of the distribution, but it does not let the model look ahead, compare partial solutions, or recover from a poor start. Many problems studied in classical AI, such as planning, theorem proving, and combinatorial puzzles, instead use explicit search with backtracking. Yao et al. argued that LLM reasoning could benefit from a similar treatment: maintain a frontier of partial solutions, expand them, score them, and prune.
The paper draws an analogy to Daniel Kahneman's distinction between System 1 (fast, intuitive) and System 2 (slow, deliberate) cognition. Standard autoregressive decoding and chain-of-thought are framed as System 1 style, while ToT is positioned as a deliberate, System 2 style search procedure built on top of a System 1 generator.
A ToT instance for a given task is defined by four components:
4 + 9 = 13 (left: 6 8 13). For Creative Writing it is a paragraph plan. For Mini Crosswords it is a single word filled into one across or down clue.k distinct continuations in one call.sure / maybe / impossible (used for Game of 24 and Crosswords), and a vote prompt that ranks several candidates against each other (used for Creative Writing). Numerical values are obtained by sampling the value prompt several times and averaging.b states per depth, and uses DFS with a pruning threshold for Mini Crosswords, where the model abandons a branch as soon as the evaluator declares any remaining clue impossible.The high-level loop is straightforward:
input: problem x, generator G, evaluator V, search S
initial state s0 = x
frontier = {s0}
for depth t = 1 .. T:
candidates = {G(s) for s in frontier}
scores = V(candidates)
frontier = S.select(candidates, scores, b)
return best terminal state in frontier
Because the four components are decoupled, ToT works on top of any base LLM without finetuning. All experiments in the original paper use GPT-4 accessed through the OpenAI API, with temperature 0.7 for sampling.
| Hyperparameter | Symbol | Role | Typical setting in paper |
|---|---|---|---|
| Tree depth | T | Number of thought steps before a terminal state | 3 (Game of 24), 3 (Creative Writing plan-then-write), variable up to 100 expansions (Crosswords) |
| Children per node | k | Candidates generated from each state | 5 (Game of 24), 5 plans then 5 passages (Creative Writing) |
| Beam width | b | States kept per depth in BFS | 1 or 5 (Game of 24), 1 (Creative Writing) |
| Value samples | Number of evaluator calls per candidate, averaged | 3 (Game of 24) | |
| Search type | BFS or DFS | BFS for Game of 24 and Creative Writing, DFS for Crosswords | |
| Evaluator style | Value prompt or vote prompt | Value for Game of 24 and Crosswords, vote for Creative Writing |
The combination of k and b controls the cost and breadth of the search. Larger b means more candidates survive each level and more LLM calls per task, with diminishing returns once the correct path is reliably inside the top-b set.
The paper evaluates ToT on three tasks chosen because they require explicit planning or backtracking and because vanilla GPT-4 performs poorly on them.
Game of 24 is a classic arithmetic puzzle: given four numbers, combine them with +, -, *, / (each number used once) to make 24. Yao et al. evaluate on 100 puzzles drawn from 4nums.com, ranked by difficulty. Each ToT step proposes one arithmetic operation, leaving fewer numbers and updating the running expression. The evaluator labels remaining-number sets as sure (a route to 24 looks reachable), likely, or impossible.
| Method | Game of 24 success rate |
|---|---|
| Standard input-output (IO) prompt | 7.3% |
| Chain-of-thought (CoT) | 4.0% |
| CoT with self-consistency, k=100 | 9.0% |
| IO best-of-100 (oracle pick) | 33% |
| CoT best-of-100 (oracle pick) | 49% |
| ToT (b=1) | 45% |
| ToT (b=5) | 74% |
The headline figure that gets quoted most often, that ToT reaches 74% on Game of 24 versus 4% for vanilla CoT, comes from this table. The CoT-with-self-consistency baseline at 9% is the most direct apples-to-apples comparison since it uses similar total compute to a small ToT search.
The Creative Writing task asks the model to produce a four-paragraph passage where each paragraph must end with a given sentence. The four ending sentences are sampled randomly so they do not naturally fit together, which forces the model to plan a story arc. ToT first generates 5 candidate plans with k=5, runs a vote step in which the LLM compares plans, then conditions on the winning plan to write the passage and votes again across 5 generated passages.
GPT-4 evaluates outputs on a 1-to-10 coherency scale, and human judges perform pairwise preference comparisons on 100 passages. ToT outputs receive a higher average GPT-4 coherency score than IO and CoT baselines, and human raters prefer ToT to CoT in 41 of 100 pairs versus 21 for CoT (the rest are ties). The exact numerical scale is reported in Figure 5 of the paper.
Mini Crosswords are 5x5 puzzles drawn from GooBix, with 5 across and 5 down clues. The task is harder than Game of 24 because the search depth is around 10 and intermediate words must be mutually consistent. ToT uses DFS and treats each thought as filling one clue. The evaluator labels each remaining clue, given the partial board, as sure / maybe / impossible, and the search prunes branches as soon as any remaining clue is judged impossible.
| Metric | IO | CoT | ToT (DFS) |
|---|---|---|---|
| Letter-level success | 38.7% | 40.6% | 78% |
| Word-level success | 14% | 15.6% | 60% |
| Game-level success (whole puzzle) | 0% | 1% | 20% (4 of 20 puzzles) |
The gap between word-level and game-level success rates illustrates a recurring observation about ToT. The search greatly improves local correctness, but solving a whole 5x5 puzzle still demands a sequence of correct decisions that the evaluator does not always detect.
ToT inherits the modularity of prompt engineering: the search is wrapped around a frozen LLM, so there is no training, no gradient access, and no need for special tokens. The same recipe transfers to any base model and any task as long as a thought decomposition and an evaluator can be specified in natural language.
The explicit search lets the model do something autoregressive decoding cannot: discard a partial solution and try a different one. On Game of 24 the dominant failure mode of CoT is committing to a first operation that makes 24 unreachable. ToT with even a small beam dodges this failure because losing branches are pruned before the model finishes the chain. On Crosswords, DFS lets the model abandon a clue assignment that conflicts with later clues, which a left-to-right decoder cannot do without a full restart.
Because the value function is itself an LLM call, the framework is general. Any task where humans can describe what "making progress" looks like in natural language can in principle be wrapped in ToT.
The biggest cost is API spend and latency. A single Game of 24 puzzle with b=5 and three steps requires roughly 100 GPT-4 calls counting candidate generation and value scoring, several orders of magnitude more than a single CoT chain. The paper reports total token costs to make this trade-off explicit. For tasks where vanilla CoT is already strong, the marginal accuracy gain rarely justifies the extra cost.
The second cost is task-specific design. The thought decomposition, the prompt format for the generator, the evaluator schema, and the search algorithm all have to be chosen by hand for each new task. There is no general-purpose ToT runner that takes an arbitrary problem and decides how to split it. Several follow-ups, including buffer-of-thoughts approaches, partly attack this limitation.
A third issue is that the evaluator is not always reliable. The value prompt asks the same model that generated the candidate to score it, and the model can be confidently wrong on both ends. On Mini Crosswords the paper notes that the search occasionally rules out branches that contained the correct word, simply because the evaluator marked an unrelated clue as impossible.
Independent replications have produced mixed results. Some implementations report numbers close to the paper's, while others report smaller gaps or much higher cost in practice. Sensitivity to prompt wording and to the specific GPT-4 snapshot date is a known concern with all GPT-4-only results from 2023.
| Property | IO prompt | Chain-of-thought | Self-consistency | Tree of Thoughts |
|---|---|---|---|---|
| Intermediate steps | None | Linear chain | Many parallel chains | Tree |
| Looks ahead | No | No | No | Yes (via evaluator) |
| Backtracks | No | No | No | Yes |
| Cost (LLM calls per task) | 1 | 1 | k (e.g. 40) | up to k * b * T |
| Aggregation | None | None | Majority vote on final answer | Search over partial states |
| Requires task design | Minimal | Few-shot exemplars | Few-shot exemplars | Decomposition + evaluator + search |
| Reported Game of 24 success | 7.3% | 4.0% | 9.0% | 74% (b=5) |
The ToT framing has spawned a sizable cluster of follow-up papers that swap the tree for other structures or change the search algorithm.
These variants share the assumption that inference-time search over LLM-generated thoughts is a useful abstraction, even when the specific data structure or search policy differs.
A paper titled Large Language Model Guided Tree-of-Thought by Jieyi Long (arXiv:2305.08291) appeared on 15 May 2023, two days before the Yao et al. preprint. Long's framework introduces a separate prompter agent, a checker module, a memory module, and a ToT controller that coordinates them across multi-round conversations, demonstrated on Sudoku. It uses the same name, was developed independently, and is not a subset or extension of the Yao et al. work. The two papers cite different motivations and use different evaluation tasks. When the prompting community refers to "ToT", they almost always mean the Yao et al. version, which is the one accepted at NeurIPS 2023 and the one whose code repository is hosted under princeton-nlp.
The official implementation lives at github.com/princeton-nlp/tree-of-thought-llm and contains the prompts, model outputs, and runner scripts for all three benchmark tasks. The repository is MIT licensed and depends on the OpenAI Python client.
Third-party libraries have added ToT-style helpers:
ToTChain module that wraps a generator, a checker, and a controller around any chat model, modelled on the Long paper's vocabulary rather than the Yao paper's.In practice, production systems usually use a stripped-down ToT: a small b, one to three steps, and a hand-tuned value prompt rather than the full BFS or DFS described in the paper. The decision is dominated by API cost.
ToT is one of the early entries in the broader trend of trading more inference compute for better answers, sometimes called test-time or inference-time compute. The same trend includes self-consistency, best-of-N sampling, process reward models, OpenAI's o1 and o3 reasoning models released in 2024 and 2025, and DeepSeek-R1 in 2025. Reasoning models internalise some of what ToT does externally: they generate long internal traces, evaluate partial work, and sometimes backtrack inside a single decoded sequence rather than via an external controller.
The ToT paper does not claim a fundamental ceiling on what scaled-up search can achieve. Subsequent work on process reward models and on learned verifiers (Cobbe et al., Lightman et al., Uesato et al.) has shown that the value-function side of ToT can be replaced with a trained scorer, often improving accuracy and reducing cost. The combination of LLM-generated candidates with learned verifiers is a direct descendant of the ToT formulation.
The vocabulary of ToT is borrowed almost verbatim from classical AI: state, successor, value, BFS, DFS, beam, pruning. The novelty is not the search but the use of an LLM as both the successor function and the heuristic. In classical planning and theorem proving, the successor function is given by the problem definition and the heuristic is a hand-designed scalar. Replacing both with prompted LLM calls makes the framework apply to fuzzy natural-language tasks like creative writing, where no formal successor function exists.
Monte Carlo tree search, used inside AlphaGo and AlphaZero, is the obvious next step from ToT. The RAP paper (Hao et al., 2023) makes that step explicit. More recent reasoning systems combine tree search with learned policies and value networks, bringing the framework even closer to the AlphaZero recipe.
The Yao et al. paper has accumulated several thousand citations since its release, with a citation count above 700 by mid-2024 according to Semantic Scholar and continuing to grow. It was selected for an oral presentation at NeurIPS 2023, one of the conference's higher recognition tiers. ToT is cited in surveys of LLM reasoning including A Survey on Large Language Model based Autonomous Agents (Wang et al., 2023), Reasoning with Language Model Prompting: A Survey (Qiao et al., 2023), and Thinking Machines: A Survey of LLM based Reasoning Strategies (2025).
Critiques in subsequent literature focus on three points: the cost-to-benefit ratio for tasks already handled by self-consistency, the fragility of LLM evaluators on out-of-distribution states, and the manual labour of designing thought decompositions. None of these critiques argue against the core idea of inference-time search; they argue against treating ToT as a drop-in replacement for chain-of-thought on every task.