Tree of Thoughts

Artificial Intelligence Prompt Engineering Reasoning Models

19 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v4 · 3,782 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Tree of Thoughts (ToT) is a prompting and inference-time search framework for large language models that lets the model explore multiple intermediate reasoning steps as nodes in a tree, evaluate each candidate, and backtrack when a path looks unpromising.^[1] In its headline result, ToT raised GPT-4's success rate on the Game of 24 arithmetic puzzle from 4% with chain-of-thought prompting to 74%.^[1] The framework was introduced in May 2023 by Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan in the paper Tree of Thoughts: Deliberate Problem Solving with Large Language Models (arXiv:2305.10601), with affiliations to Princeton University and Google DeepMind.^[1] The paper was accepted as an oral presentation at NeurIPS 2023.^[2] ToT generalises chain-of-thought prompting by replacing the single linear reasoning trajectory with a structured search over partial solutions, where the language model itself acts both as a thought generator and as a value function.^[1]

The authors frame the motivation directly: "Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role."^[1]

A second paper with a similar title, Large Language Model Guided Tree-of-Thought by Jieyi Long (arXiv:2305.08291), appeared two days earlier on 15 May 2023 and proposes a related but distinct multi-module framework.^[4] The two works are independent and are sometimes confused in secondary sources. This article focuses on the Yao et al. version, which is the more widely cited and the one usually meant by "ToT" in the prompting literature.

Facts

Field	Value
Paper title	Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Authors	Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan
Affiliations	Princeton University, Google DeepMind
arXiv ID	2305.10601
First posted	17 May 2023 (v1); revised 3 December 2023 (v2)
Conference	NeurIPS 2023 (oral)
Code repository	github.com/princeton-nlp/tree-of-thought-llm
License	MIT
Headline result	Game of 24: 74% (ToT, b=5) vs 4% (GPT-4 chain-of-thought)
Related disambiguation paper	Long, Large Language Model Guided Tree-of-Thought, arXiv:2305.08291, 15 May 2023

What problem does Tree of Thoughts solve?

By late 2022, chain-of-thought prompting introduced by Wei et al. (Google Brain, 2022) had established that asking a large language model to produce intermediate reasoning steps before a final answer improves performance on arithmetic, commonsense, and symbolic tasks.^[5] Self-consistency, proposed by Wang et al. in 2022, extended chain-of-thought by sampling multiple reasoning paths and taking a majority vote on the final answer.^[6] Both techniques generate text strictly left to right and never revise an earlier step once it has been written.

This left-to-right structure has known limits. A single bad token early in the chain can derail the rest of the trajectory. Sampling many independent chains, as in self-consistency, helps when the correct answer is the most common mode of the distribution, but it does not let the model look ahead, compare partial solutions, or recover from a poor start.^[6] Many problems studied in classical AI, such as planning, theorem proving, and combinatorial puzzles, instead use explicit search with backtracking. Yao et al. argued that LLM reasoning could benefit from a similar treatment: maintain a frontier of partial solutions, expand them, score them, and prune.^[1]

The paper draws an analogy to Daniel Kahneman's distinction between System 1 (fast, intuitive) and System 2 (slow, deliberate) cognition. Standard autoregressive decoding and chain-of-thought are framed as System 1 style, while ToT is positioned as a deliberate, System 2 style search procedure built on top of a System 1 generator.^[1]

How does Tree of Thoughts work?

A ToT instance for a given task is defined by four components:^[1]

Thought decomposition. The problem is broken into a sequence of intermediate steps. A "thought" is a coherent partial state in the language of the task. For Game of 24 it is one arithmetic line such as 4 + 9 = 13 (left: 6 8 13). For Creative Writing it is a paragraph plan. For Mini Crosswords it is a single word filled into one across or down clue.
Thought generator. Given the current state, the LLM proposes candidate next thoughts. Two variants are described: independent identical-and-distributed (i.i.d.) sampling and prompt-based proposal of k distinct continuations in one call.
State evaluator. The LLM scores each candidate state. The paper uses two patterns: a value prompt that returns a category such as sure / maybe / impossible (used for Game of 24 and Crosswords), and a vote prompt that ranks several candidates against each other (used for Creative Writing). Numerical values are obtained by sampling the value prompt several times and averaging.
Search algorithm. A standard tree search wraps the generator and evaluator. The paper uses BFS for Game of 24 and Creative Writing, retaining the top b states per depth, and uses DFS with a pruning threshold for Mini Crosswords, where the model abandons a branch as soon as the evaluator declares any remaining clue impossible.

The high-level loop is straightforward:

input: problem x, generator G, evaluator V, search S
initial state s0 = x
frontier = {s0}
for depth t = 1 .. T:
    candidates = {G(s) for s in frontier}
    scores = V(candidates)
    frontier = S.select(candidates, scores, b)
return best terminal state in frontier

Because the four components are decoupled, ToT works on top of any base LLM without finetuning. All experiments in the original paper use GPT-4 accessed through the OpenAI API, with temperature 0.7 for sampling.^[1]

Hyperparameters

Hyperparameter	Symbol	Role	Typical setting in paper
Tree depth	`T`	Number of thought steps before a terminal state	3 (Game of 24), 3 (Creative Writing plan-then-write), variable up to 100 expansions (Crosswords)
Children per node	`k`	Candidates generated from each state	5 (Game of 24), 5 plans then 5 passages (Creative Writing)
Beam width	`b`	States kept per depth in BFS	1 or 5 (Game of 24), 1 (Creative Writing)
Value samples		Number of evaluator calls per candidate, averaged	3 (Game of 24)
Search type		BFS or DFS	BFS for Game of 24 and Creative Writing, DFS for Crosswords
Evaluator style		Value prompt or vote prompt	Value for Game of 24 and Crosswords, vote for Creative Writing

The combination of k and b controls the cost and breadth of the search. Larger b means more candidates survive each level and more LLM calls per task, with diminishing returns once the correct path is reliably inside the top-b set.

What results did the original paper report?

The paper evaluates ToT on three tasks chosen because they require explicit planning or backtracking and because vanilla GPT-4 performs poorly on them.^[1]

Game of 24

Game of 24 is a classic arithmetic puzzle: given four numbers, combine them with +, -, *, / (each number used once) to make 24. Yao et al. evaluate on 100 puzzles drawn from 4nums.com, ranked by difficulty.^[1] Each ToT step proposes one arithmetic operation, leaving fewer numbers and updating the running expression. The evaluator labels remaining-number sets as sure (a route to 24 looks reachable), likely, or impossible.

Method	Game of 24 success rate
Standard input-output (IO) prompt	7.3%
Chain-of-thought (CoT)	4.0%
CoT with self-consistency, k=100	9.0%
IO best-of-100 (oracle pick)	33%
CoT best-of-100 (oracle pick)	49%
ToT (b=1)	45%
ToT (b=5)	74%

The headline figure that gets quoted most often, that ToT reaches 74% on Game of 24 versus 4% for vanilla CoT, comes from this table.^[1] In the words of the paper, "in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%."^[1] The CoT-with-self-consistency baseline at 9% is the most direct apples-to-apples comparison since it uses similar total compute to a small ToT search.

Creative Writing

The Creative Writing task asks the model to produce a four-paragraph passage where each paragraph must end with a given sentence. The four ending sentences are sampled randomly so they do not naturally fit together, which forces the model to plan a story arc. ToT first generates 5 candidate plans with k=5, runs a vote step in which the LLM compares plans, then conditions on the winning plan to write the passage and votes again across 5 generated passages.^[1]

GPT-4 evaluates outputs on a 1-to-10 coherency scale, and human judges perform pairwise preference comparisons on 100 passages. ToT outputs receive a higher average GPT-4 coherency score than IO and CoT baselines, and human raters prefer ToT to CoT in 41 of 100 pairs versus 21 for CoT (the rest are ties).^[1] The exact numerical scale is reported in Figure 5 of the paper.^[1]

Mini Crosswords

Mini Crosswords are 5x5 puzzles drawn from GooBix, with 5 across and 5 down clues. The task is harder than Game of 24 because the search depth is around 10 and intermediate words must be mutually consistent. ToT uses DFS and treats each thought as filling one clue. The evaluator labels each remaining clue, given the partial board, as sure / maybe / impossible, and the search prunes branches as soon as any remaining clue is judged impossible.^[1]

Metric	IO	CoT	ToT (DFS)
Letter-level success	38.7%	40.6%	78%
Word-level success	14%	15.6%	60%
Game-level success (whole puzzle)	0%	1%	20% (4 of 20 puzzles)

The gap between word-level and game-level success rates illustrates a recurring observation about ToT. The search greatly improves local correctness, but solving a whole 5x5 puzzle still demands a sequence of correct decisions that the evaluator does not always detect.

What are the strengths of Tree of Thoughts?

ToT inherits the modularity of prompt engineering: the search is wrapped around a frozen LLM, so there is no training, no gradient access, and no need for special tokens.^[1] The same recipe transfers to any base model and any task as long as a thought decomposition and an evaluator can be specified in natural language.

The explicit search lets the model do something autoregressive decoding cannot: discard a partial solution and try a different one. On Game of 24 the dominant failure mode of CoT is committing to a first operation that makes 24 unreachable. ToT with even a small beam dodges this failure because losing branches are pruned before the model finishes the chain.^[1] On Crosswords, DFS lets the model abandon a clue assignment that conflicts with later clues, which a left-to-right decoder cannot do without a full restart.^[1]

Because the value function is itself an LLM call, the framework is general. Any task where humans can describe what "making progress" looks like in natural language can in principle be wrapped in ToT.

What are the weaknesses and limitations?

The biggest cost is API spend and latency. A single Game of 24 puzzle with b=5 and three steps requires roughly 100 GPT-4 calls counting candidate generation and value scoring, several orders of magnitude more than a single CoT chain.^[1] The paper reports total token costs to make this trade-off explicit.^[1] For tasks where vanilla CoT is already strong, the marginal accuracy gain rarely justifies the extra cost.

The second cost is task-specific design. The thought decomposition, the prompt format for the generator, the evaluator schema, and the search algorithm all have to be chosen by hand for each new task. There is no general-purpose ToT runner that takes an arbitrary problem and decides how to split it. Several follow-ups, including buffer-of-thoughts approaches, partly attack this limitation.^[9]

A third issue is that the evaluator is not always reliable. The value prompt asks the same model that generated the candidate to score it, and the model can be confidently wrong on both ends. On Mini Crosswords the paper notes that the search occasionally rules out branches that contained the correct word, simply because the evaluator marked an unrelated clue as impossible.^[1]

Independent replications have produced mixed results. Some implementations report numbers close to the paper's, while others report smaller gaps or much higher cost in practice. Sensitivity to prompt wording and to the specific GPT-4 snapshot date is a known concern with all GPT-4-only results from 2023.

How does Tree of Thoughts differ from chain-of-thought and self-consistency?

Property	IO prompt	Chain-of-thought	Self-consistency	Tree of Thoughts
Intermediate steps	None	Linear chain	Many parallel chains	Tree
Looks ahead	No	No	No	Yes (via evaluator)
Backtracks	No	No	No	Yes
Cost (LLM calls per task)	1	1	k (e.g. 40)	up to k * b * T
Aggregation	None	None	Majority vote on final answer	Search over partial states
Requires task design	Minimal	Few-shot exemplars	Few-shot exemplars	Decomposition + evaluator + search
Reported Game of 24 success	7.3%	4.0%	9.0%	74% (b=5)

Variants and follow-up work

The ToT framing has spawned a sizable cluster of follow-up papers that swap the tree for other structures or change the search algorithm.

Graph of Thoughts (GoT) by Maciej Besta et al. (arXiv:2308.09687, August 2023) generalises the tree to a directed graph where thoughts can be merged, refined, or aggregated. The paper reports a 62% quality improvement on a sorting task over ToT and a more than 31% reduction in cost.^[7]
Algorithm of Thoughts (AoT) by Bilal Sel et al. (arXiv:2308.10379, August 2023) folds the search inside a single in-context example, embedding algorithmic exemplars in the prompt rather than running an external controller.^[8]
Buffer of Thoughts (BoT) by Ling Yang et al. (arXiv:2406.04271, June 2024, NeurIPS 2024 Spotlight) maintains a meta-buffer of reusable thought templates distilled from past problems, allowing reasoning to amortise across tasks instead of restarting per query.^[9]
Reasoning via Planning (RAP) by Shibo Hao et al. (arXiv:2305.14992, May 2023, EMNLP 2023) replaces ToT's heuristic search with Monte Carlo tree search and treats the LLM as both an agent and a world model.^[10]
Forest of Thoughts and various ToT-X variants have been proposed in 2024 and 2025 for code generation, mathematical theorem proving, and embodied planning, typically by ensembling multiple parallel ToT runs.
ReAct, also by Shunyu Yao and collaborators, is a separate but conceptually adjacent line of work that interleaves reasoning with tool calls and external actions rather than running a pure search inside the model.^[11]

These variants share the assumption that inference-time search over LLM-generated thoughts is a useful abstraction, even when the specific data structure or search policy differs.

Disambiguation: the Long paper

A paper titled Large Language Model Guided Tree-of-Thought by Jieyi Long (arXiv:2305.08291) appeared on 15 May 2023, two days before the Yao et al. preprint.^[4] Long's framework introduces a separate prompter agent, a checker module, a memory module, and a ToT controller that coordinates them across multi-round conversations, demonstrated on Sudoku.^[4] It uses the same name, was developed independently, and is not a subset or extension of the Yao et al. work. The two papers cite different motivations and use different evaluation tasks. When the prompting community refers to "ToT", they almost always mean the Yao et al. version, which is the one accepted at NeurIPS 2023 and the one whose code repository is hosted under princeton-nlp.^[2]^[3]

How is Tree of Thoughts implemented in practice?

The official implementation lives at github.com/princeton-nlp/tree-of-thought-llm and contains the prompts, model outputs, and runner scripts for all three benchmark tasks.^[3] The repository is MIT licensed and depends on the OpenAI Python client.^[3]

Third-party libraries have added ToT-style helpers:

LangChain ships an experimental ToTChain module that wraps a generator, a checker, and a controller around any chat model, modelled on the Long paper's vocabulary rather than the Yao paper's.
LlamaIndex provides a tree-of-thoughts query engine pattern in its examples directory, mainly for retrieval-augmented question answering with structured intermediate steps.
Several open-source agent frameworks expose ToT as a planning option alongside chain-of-thought and ReAct, including AutoGen and DSPy ports.

In practice, production systems usually use a stripped-down ToT: a small b, one to three steps, and a hand-tuned value prompt rather than the full BFS or DFS described in the paper. The decision is dominated by API cost.

Influence on inference-time compute

ToT is one of the early entries in the broader trend of trading more inference compute for better answers, sometimes called test-time or inference-time compute. The same trend includes self-consistency, best-of-N sampling, process reward models, OpenAI's o1 and o3 reasoning models released in 2024 and 2025, and DeepSeek-R1 in 2025. Reasoning models internalise some of what ToT does externally: they generate long internal traces, evaluate partial work, and sometimes backtrack inside a single decoded sequence rather than via an external controller.

The ToT paper does not claim a fundamental ceiling on what scaled-up search can achieve. Subsequent work on process reward models and on learned verifiers (Cobbe et al., Lightman et al., Uesato et al.) has shown that the value-function side of ToT can be replaced with a trained scorer, often improving accuracy and reducing cost.^[15]^[16] The combination of LLM-generated candidates with learned verifiers is a direct descendant of the ToT formulation.

Relation to classical search

The vocabulary of ToT is borrowed almost verbatim from classical AI: state, successor, value, BFS, DFS, beam, pruning. The novelty is not the search but the use of an LLM as both the successor function and the heuristic.^[1] In classical planning and theorem proving, the successor function is given by the problem definition and the heuristic is a hand-designed scalar. Replacing both with prompted LLM calls makes the framework apply to fuzzy natural-language tasks like creative writing, where no formal successor function exists.

Monte Carlo tree search, used inside AlphaGo and AlphaZero, is the obvious next step from ToT. The RAP paper (Hao et al., 2023) makes that step explicit.^[10] More recent reasoning systems combine tree search with learned policies and value networks, bringing the framework even closer to the AlphaZero recipe.

Reception and citations

The Yao et al. paper has accumulated several thousand citations since its release, with a citation count above 700 by mid-2024 according to Semantic Scholar and continuing to grow. It was selected for an oral presentation at NeurIPS 2023, one of the conference's higher recognition tiers.^[2] ToT is cited in surveys of LLM reasoning including A Survey on Large Language Model based Autonomous Agents (Wang et al., 2023), Reasoning with Language Model Prompting: A Survey (Qiao et al., 2023), and Thinking Machines: A Survey of LLM based Reasoning Strategies (2025).^[13]^[12]^[14]

Critiques in subsequent literature focus on three points: the cost-to-benefit ratio for tasks already handled by self-consistency, the fragility of LLM evaluators on out-of-distribution states, and the manual labour of designing thought decompositions. None of these critiques argue against the core idea of inference-time search; they argue against treating ToT as a drop-in replacement for chain-of-thought on every task.

References

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). *Tree of Thoughts: Deliberate Problem Solving with Large Language Models*. arXiv:2305.10601. First posted 17 May 2023, revised 3 December 2023. https://arxiv.org/abs/2305.10601 ↩
Yao, S., et al. (2023). *Tree of Thoughts: Deliberate Problem Solving with Large Language Models*. Advances in Neural Information Processing Systems 36 (NeurIPS 2023), oral presentation. https://papers.nips.cc/paper_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html ↩
Princeton NLP Group. *tree-of-thought-llm*. GitHub repository, MIT license. https://github.com/princeton-nlp/tree-of-thought-llm ↩
Long, J. (2023). *Large Language Model Guided Tree-of-Thought*. arXiv:2305.08291. Posted 15 May 2023. https://arxiv.org/abs/2305.08291 ↩
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). *Chain-of-Thought Prompting Elicits Reasoning in Large Language Models*. arXiv:2201.11903. NeurIPS 2022. ↩
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). *Self-Consistency Improves Chain of Thought Reasoning in Language Models*. arXiv:2203.11171. ICLR 2023. ↩
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., & Hoefler, T. (2023). *Graph of Thoughts: Solving Elaborate Problems with Large Language Models*. arXiv:2308.09687. Posted 18 August 2023. ↩
Sel, B., Al-Tawaha, A., Khattar, V., Wang, L., Jia, R., & Jin, M. (2023). *Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models*. arXiv:2308.10379. ↩
Yang, L., Yu, Z., Zhang, T., Cao, S., Xu, M., Zhang, W., Gonzalez, J. E., & Cui, B. (2024). *Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models*. arXiv:2406.04271. NeurIPS 2024 Spotlight. ↩
Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D. Z., & Hu, Z. (2023). *Reasoning with Language Model is Planning with World Model*. arXiv:2305.14992. EMNLP 2023. ↩
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). *ReAct: Synergizing Reasoning and Acting in Language Models*. arXiv:2210.03629. ICLR 2023. ↩
Qiao, S., Ou, Y., Zhang, N., Chen, X., Yao, Y., Deng, S., Tan, C., Huang, F., & Chen, H. (2023). *Reasoning with Language Model Prompting: A Survey*. arXiv:2212.09597. ACL 2023. ↩
Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W. X., Wei, Z., & Wen, J. (2023). *A Survey on Large Language Model based Autonomous Agents*. arXiv:2308.11432. ↩
*Thinking Machines: A Survey of LLM based Reasoning Strategies* (2025). arXiv:2503.10814. ↩
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). *Training Verifiers to Solve Math Word Problems*. arXiv:2110.14168. ↩
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). *Let's Verify Step by Step*. arXiv:2305.20050. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

Tree of Thoughts

Facts

What problem does Tree of Thoughts solve?

How does Tree of Thoughts work?

Hyperparameters

What results did the original paper report?

Game of 24

Creative Writing

Mini Crosswords

What are the strengths of Tree of Thoughts?

What are the weaknesses and limitations?

How does Tree of Thoughts differ from chain-of-thought and self-consistency?

Variants and follow-up work

Disambiguation: the Long paper

How is Tree of Thoughts implemented in practice?

Influence on inference-time compute

Relation to classical search

Reception and citations

See also

References

Improve this article

What links here (24 of 25)

What links here (24 of 25)

Facts

What problem does Tree of Thoughts solve?

How does Tree of Thoughts work?

Hyperparameters

What results did the original paper report?

Game of 24

Creative Writing

Mini Crosswords

What are the strengths of Tree of Thoughts?

What are the weaknesses and limitations?

How does Tree of Thoughts differ from chain-of-thought and self-consistency?

Variants and follow-up work

Disambiguation: the Long paper

How is Tree of Thoughts implemented in practice?

Influence on inference-time compute

Relation to classical search

Reception and citations

See also

References

Improve this article

Related Articles

ReAct (prompting)

Self-consistency

Agentic Context Engineering

Meta Prompting

Context engineering

ARC-AGI 1

What links here (24 of 25)

Related Articles

ReAct (prompting)

Self-consistency

Agentic Context Engineering

Meta Prompting

Context engineering

ARC-AGI 1

What links here (24 of 25)