Graph of Thoughts
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,030 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,030 words
Add missing citations, update stale details, or suggest a clearer explanation.
Graph of Thoughts (GoT) is a prompting and reasoning framework that models the intermediate steps of a large language model as an arbitrary directed graph rather than a linear chain or a tree. Each vertex is an "LLM thought" (a self-contained intermediate solution or piece of information) and each directed edge is a transformation that derives one thought from one or more others, so that aggregation, refinement, and feedback loops become first-class primitives.[1] The framework was introduced in the paper "Graph of Thoughts: Solving Elaborate Problems with Large Language Models" by Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michał Podstawski, Hubert Niewiadomski, Piotr Nyczyk and Torsten Hoefler, posted on arXiv on 18 August 2023 (arXiv:2308.09687) and published in the Proceedings of the AAAI Conference on Artificial Intelligence at AAAI-24 (Vol. 38, No. 16, pages 17682 to 17690).[1][2] The work was carried out primarily at the Scalable Parallel Computing Laboratory (SPCL) at ETH Zürich, with collaborators from Cledar (Warsaw) and Warsaw University of Technology, and an official Python reference implementation is maintained at the spcl/graph-of-thoughts GitHub repository.[3][4]
| Field | Value |
|---|---|
| Name | Graph of Thoughts (GoT) |
| Type | Prompting / reasoning framework for LLMs |
| First posted | 2023-08-18 (arXiv:2308.09687) |
| Conference | AAAI-24, pages 17682-17690 (Vol. 38, No. 16) |
| DOI | 10.1609/aaai.v38i16.29720 |
| Lead authors | Maciej Besta, Nils Blach (equal contribution) |
| Senior author | Torsten Hoefler |
| Affiliations | ETH Zürich (SPCL), Cledar, Warsaw University of Technology |
| Reference implementation | spcl/graph-of-thoughts on GitHub, Python package graph_of_thoughts |
| Models in original evaluation | GPT-3.5 (primary), GPT-4 (limited), Llama 2 |
| Reported headline result | ~62% lower median sorting error vs ToT; >31% cost reduction (128-element sorting) |
| Source | [1][2][4][7] |
By the time GoT was proposed, structured prompting had moved through several stages. The standard input-output (IO) prompt provided a single input and asked for a single output. Chain-of-Thought (CoT) prompting and Self-consistency (CoT-SC) extended this by asking the model to produce an explicit sequence of intermediate steps, turning the reasoning trace into a linear chain or a set of independently sampled chains.[5] Tree of Thoughts (ToT) generalised the chain into a tree, allowing the model to branch into multiple candidate continuations, score them, and backtrack, which improved performance on search-style tasks at the cost of additional model calls.[1][5]
Besta and colleagues observed that human problem solving is rarely captured by either a chain or a tree. People routinely combine partial results from independent lines of thought (aggregation), revisit and patch earlier intermediate conclusions (refinement) and form feedback loops where a later thought informs an earlier one.[1] These patterns naturally form a directed graph, possibly with cycles, in which an intermediate state may have several parents. Existing chain and tree topologies cannot represent such "merging" or "looping" structures, because each thought in a chain or a tree has exactly one parent.[1] GoT was designed as the smallest framework that closes this gap: it lets a thought have any number of incoming edges, so partial solutions from independent subproblems can be merged, and it lets edges form self-loops or cycles, so a thought can be iteratively improved.[1]
The authors also tied the design to a cost argument. In a more comprehensive follow-up, "Topologies of Reasoning: Demystifying Chains, Trees, and Graphs of Thoughts" (arXiv:2401.14295), the same group situates GoT inside a taxonomy of reasoning topologies and analyses how the shape of the reasoning graph affects latency, volume of information, and token cost across chain, tree, and graph schemes.[6]
The paper defines a reasoning process as a tuple (G, T, E, R).[1] G = (V, E) is a directed graph whose vertices V are LLM thoughts, with V possibly heterogeneous (for example, in a writing task some vertices are plans and others are paragraphs).[1] Edges (u, v) in E indicate that the content of v has been constructed using thought u as part of its prompt. T is the set of thought transformations that the framework can apply to the current graph. E is an evaluator function that assigns a numeric score to a thought, possibly using the whole graph as context. R is a ranking function that selects one or more thoughts from the current graph based on those scores.[1]
A transformation t in T rewrites the graph: V' = (V union V_plus) minus V_minus and E' = (E union E_plus) minus E_minus, where V_plus and E_plus are newly added vertices and edges, and V_minus, E_minus are removed ones.[7] Three graph-enabled transformations are highlighted:[1][7]
To analyse cost, the paper introduces the volume of a thought, defined as the number of preceding LLM thoughts from which the target thought is reachable by a directed path.[7] For a fixed total budget of Θ(N) thoughts, the paper compares topologies as follows: CoT has latency N and volume N; CoT-SC partitions the budget into k independent chains, so latency is N/k and volume is N/k; ToT achieves logarithmic latency log_k N but only logarithmic volume log_k N; GoT, by using aggregation, can achieve logarithmic latency log_k N while still letting the final thought depend on all N preceding thoughts, giving volume N.[7] This is presented as the key theoretical advantage of GoT: aggregation lets a low-latency reasoning graph still incorporate a large amount of upstream information.
In addition to the abstract transformations, the paper separates two graphs.[1][7] The Graph of Operations (GoO) is a static directed graph specified by the user. Its vertices are operations to execute (Generate, Aggregate, Score, KeepBestN and so on), and its edges describe data flow between them. The Graph Reasoning State (GRS) is the dynamic graph built up at runtime: as the controller traverses the GoO, it materialises actual thought vertices and the dependencies among them, plus their scores and validity flags.[1][7]
The reference implementation in spcl/graph-of-thoughts exposes a concrete set of operations that instantiate the abstract transformations.[4] The operations module documents the following:[8]
These operations communicate with the model through two helper classes, the Prompter, which formats the current GRS into a prompt for the LLM, and the Parser, which extracts structured information from the LLM reply.[4][8] A separate Controller orchestrates execution: it walks the GoO, invokes operations in dependency order, updates the GRS, dispatches calls to the configured language model, and writes the resulting GRS to a JSON file for later inspection.[4]
The spcl/graph-of-thoughts package is implemented in Python and can be installed from PyPI with pip install graph_of_thoughts, or from source in editable mode.[4] Its main modules mirror the abstractions of the paper:[4]
language_models: thin wrappers around chat-completion endpoints (for example, the OpenAI API for GPT-3.5 and GPT-4) plus configuration of model, temperature, and rate limits.prompter: per-task subclasses that turn parts of the GRS into natural-language prompts.parser: per-task subclasses that decode model replies into structured thoughts and validity information.operations: the operation classes listed above, plus the GraphOfOperations class used to build a GoO programmatically.controller: ties everything together and runs the GoO end to end.The repository ships example implementations for sorting, set intersection, keyword counting, and document merging, each containing prompter, parser, scoring logic, and a configurable Graph of Operations.[4] The repository's citation block lists the AAAI 2024 publication with DOI 10.1609/aaai.v38i16.29720 as the canonical reference.[4]
A typical end-to-end workflow with the reference framework runs in five steps: the developer subclasses Prompter and Parser for the task, defines a scoring function, builds a GraphOfOperations by appending operations such as Generate, Aggregate, Score, KeepBestN, Improve, and GroundTruth, instantiates a language model wrapper, and finally instantiates a Controller and calls its run() method, which serialises both the resulting GRS and a JSON log of all LLM calls so that costs and intermediate thoughts can be audited after the fact.[4][8] An independent walk-through of the framework, applied to a project-portfolio optimisation problem, shows the same control loop: Controller drives Prompter, LLM produces thoughts, Parser decodes them into the GRS, scoring and validation are applied, and the loop continues until the GoO is exhausted.[14]
The paper evaluates GoT against four baselines, IO prompting, CoT, CoT-SC, and ToT, on four tasks that were designed to exercise aggregation and refinement.[1][7] Most experiments use GPT-3.5 as the language model due to budget considerations, with limited additional experiments on GPT-4 and on Llama 2; in the paper's Llama 2 runs, results are described as "usually worse than GPT-3.5".[7] Each task is evaluated across 100 input samples per baseline, with temperature set to 1.0 and a 4k-token context (16k for document merging).[7]
The sorting task asks the model to sort a list of numbers of length 32, 64, or 128. The scoring function checks that the output is monotonically ordered and that the multiset of numbers is preserved, in order to penalise hallucinated or dropped values.[1][7] The GoO for sorting decomposes the list into chunks, sorts each chunk in parallel via Generate operations, then merges adjacent sorted chunks pairwise using Aggregate operations until a single sorted list remains, with Score and KeepBestN steps in between.[1] On the 128-element setting, GoT reduces the median number of errors by roughly 62% compared with ToT while simultaneously cutting cost by more than 31%, the headline figures reported in the abstract.[1][2]
For set intersection, the input is two sets of 32, 64, or 128 elements, and the model has to output their intersection. GoT splits the second set into subsets, asks the LLM to intersect each subset with the first set, and then aggregates the partial intersections; the result is then validated against the original sets.[7] GoT consistently exceeds the baselines, especially as set size grows, where IO and CoT degrade quickly because the answer must fit in a single chain.[7]
In keyword counting, the input is a passage and the model has to output the number of occurrences of each country mentioned. The GoO splits the passage, asks the LLM to count keywords per chunk, and then aggregates the per-chunk dictionaries into a single tally.[7] In the paper's experiments, GoT solved 25 of the test instances correctly while CoT solved 1 and ToT solved 0 in the same configuration, illustrating how aggregation of per-chunk counts beats both a single linear pass and a tree search.[7]
The document merging task takes several non-disclosure agreements (NDAs) that partially overlap in content and asks the model to produce a single NDA that minimises duplication while maximising information retained.[1][7] Quality is evaluated using the harmonic mean of two LLM-as-judge scores: one for redundancy and one for information retention. GoT achieves a harmonic-mean score around 8.5 out of 10 on 50 evaluated samples, outperforming the chain-based and tree-based baselines, especially as the number of input documents grows.[7]
Across all four tasks the paper reports the same overall pattern: GoT improves answer quality relative to IO, CoT, CoT-SC, and ToT, and on the merge-heavy tasks it does so while reducing total LLM cost compared with ToT, because aggregation cuts the number of redundant exploratory branches that ToT would have to score.[1][7]
The benchmarking methodology fixes several hyperparameters across baselines to make the comparison fair. All schemes use the same underlying chat-completion endpoint, the same temperature (1.0), the same 4k-token context window (16k tokens for document merging because of input length), the same number of input samples (100 per task and baseline, 50 for document merging), and a problem-specific scoring function that is held constant when switching between IO, CoT, CoT-SC, ToT and GoT.[1][7] Cost is measured in dollars based on OpenAI's billed token usage, so reductions in cost translate directly to reductions in the number of model calls and total prompt-plus-completion tokens.[1][4]
The framework is most easily understood in contrast with its predecessors. The table summarises the key differences.
| Property | IO prompting | Chain-of-Thought | Self-consistency (CoT-SC) | Tree of Thoughts | Graph of Thoughts |
|---|---|---|---|---|---|
| Topology of reasoning | single node | chain | k independent chains | tree | arbitrary directed graph, possibly with cycles |
| Aggregation of multiple parents | no | no | majority vote on final answer only | no | yes, via Aggregate operation |
| Refinement / feedback loops | no | no | no | no | yes, via Improve / ValidateAndImprove |
| Backtracking | no | no | no | yes | yes |
| Latency at Θ(N) thoughts | 1 | N | N / k | log_k N | log_k N |
| Volume of final thought | 1 | N | N / k | log_k N | N |
| Reference | [5] | [5] | [5] | [5] | [1][7] |
The salient point is the last two rows. ToT brought logarithmic latency by introducing branching, but each leaf in a tree can only "see" the thoughts along its root-to-leaf path, which has logarithmic depth. GoT recovers the linear volume of CoT by adding aggregation edges into the same tree, so the final thought can depend on every other thought generated during reasoning without having to be sequentially derived from them.[1][7]
A practitioner-oriented summary published in 2024 frames GoT as one element of a family of "graph-based prompting" methods and stresses that GoT subsumes both CoT and ToT as special cases of a directed graph (a path graph and a rooted tree, respectively), so any CoT or ToT pipeline can in principle be expressed as a Graph of Operations.[5]
The case studies in the original paper map onto three broad categories of LLM workloads that benefit from a graph topology. The first is divide-and-conquer over long inputs, exemplified by sorting and keyword counting, where the problem is split into independent chunks, each chunk is solved by the model, and the partial answers are then aggregated. The second is many-to-one synthesis, exemplified by document merging, where several documents must be combined into a single output while controlling redundancy and information loss. The third is exploratory search with refinement, where multiple candidate solutions are scored, refined, and then either combined or pruned, as illustrated by set intersection.[1][7]
Outside the original paper, practitioner guides identify additional families of tasks where GoT-style pipelines are appealing. Complex code-generation flows can benefit from generating multiple candidate functions, scoring them, and aggregating their common skeleton; multi-source question answering can benefit from merging retrieved passages into a synthesised answer; project portfolio optimisation can use Generate-Validate-Score-KeepBestN cycles to enumerate feasible portfolios under constraints and rank them by total value.[14] In all of these cases, the recurring justification is the same: when the natural problem structure has merging or feedback edges, encoding the prompt flow as a graph is a closer match to the problem than a chain or a tree, and the controller can exploit that structure to call the model only as often as the task actually requires.[1][14]
GoT has been incorporated into several surveys and practitioner guides as a canonical example of structured reasoning. The Topologies of Reasoning paper by the same group treats it as a representative graph topology, alongside chains and trees, and uses it to derive a comparative analysis of prompting structures, theoretical underpinnings, and connections to knowledge bases.[6] Independent expositions on agentic-patterns sites and AI knowledge bases describe GoT's core operations (branching, aggregation, refinement, looping) and note its overhead, with one source estimating that graph-of-thoughts pipelines can be five to twenty times more expensive than a single linear CoT call for simple problems, which is why GoT is recommended only for tasks that genuinely require interdependent reasoning.[9][10]
In April 2025 the same SPCL-led team published "Affordable AI Assistants with Knowledge Graph of Thoughts" (arXiv:2504.02670), which extends GoT to maintain a persistent structured knowledge graph during reasoning, sharing the same Graph of Operations machinery with the original framework. The paper is hosted on the SPCL publications page alongside the original GoT manuscript.[11]
Maciej Besta, the lead author, is a researcher at SPCL focusing on sparse graph computations and LLMs, and has received the IEEE TCSC Award for Excellence in Scalable Computing Early Career (2023) and the IEEE TCHPC Early Career Researchers Award (2024).[12] Co-first author Nils Blach, who completed his graduate work at SPCL under Besta and Hoefler, lists GoT as his primary publication on his personal page, and notes that the first two authors of the AAAI paper contributed equally.[13]
The spcl/graph-of-thoughts repository has accumulated thousands of GitHub stars and is widely cited as a reference open-source implementation; the citation block in its README is the canonical pointer used by downstream projects.[4] GoT vocabulary (aggregation, refinement, GoO, GRS) has also been adopted by derivative frameworks. The "Hierarchical Graph of Thoughts" (HGoT) paper, posted on arXiv as 2402.09390, reuses the graph topology idea and applies it to retrieval-augmented in-context learning for factuality evaluation; it explicitly cites the GoT framework as the basis for its hierarchical extension.[15] More broadly, agentic-pattern catalogues and practitioner newsletters describe GoT alongside ReAct, Self-Refine, and Plan-and-Execute as one of the named patterns for orchestrating multi-step LLM reasoning, often packaged via libraries such as LangGraph that materialise a graph of stateful nodes at runtime.[9][10]
Although the GoT framework is more expressive than chains and trees, this expressiveness comes at a cost.
GoT sits in the dense neighbourhood of structured-prompting techniques. Chain-of-Thought established that explicit intermediate steps help, Self-consistency showed that sampling many chains and voting raises accuracy, and Tree of Thoughts showed that branching plus search beats single chains on search-style problems.[5] Adjacent methods include ReAct (interleaving reasoning and tool calls), Self-Refine (iterative self-feedback, conceptually similar to GoT's Improve operation but applied to a single chain), Step-Back Prompting (abstracting before answering), Self-Discover prompting and Auto-CoT (constructing demonstrations automatically). On the structural side, GoT shares vocabulary with Graph Neural Networks, knowledge graphs and the computational graph view of model execution, but it operates entirely at the prompt level: there is no learned weight update and the graph is materialised through repeated calls to a pretrained LLM, drawing on the model's in-context learning capability and counting against test-time compute budgets.
The most direct intellectual sibling is the Topologies of Reasoning paper, which arranges chain, tree, and graph schemes along a taxonomy of "reasoning topologies" and asks which structure is best for which task; GoT plays the role of the most general topology in that taxonomy.[6] A second sibling is the hierarchical HGoT extension, which composes nested GoT graphs to handle factuality evaluation in retrieval-augmented in-context learning, reporting a roughly 7% improvement over competing methods on the FEVER fact-verification benchmark while matching strong baselines on Open-SQuAD and HotPotQA.[15] Together with the original GoT paper and the Knowledge Graph of Thoughts follow-up, these works form a coherent line of "graph-shaped reasoning" research originating from SPCL at ETH Zürich.[6][11][15]