Monte Carlo Tree Search
Last reviewed
Apr 28, 2026
Sources
26 citations
Review status
Source-backed
Revision
v2 ยท 5,893 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
26 citations
Review status
Source-backed
Revision
v2 ยท 5,893 words
Add missing citations, update stale details, or suggest a clearer explanation.
Monte Carlo Tree Search (MCTS) is a heuristic search algorithm for sequential decision-making problems, most famously used in game-playing artificial intelligence. It builds a search tree incrementally by performing many simulated playthroughs (called rollouts or playouts) of a problem, using random or lightly informed play, and aggregating the results to estimate the value of each candidate action at the root of the tree. MCTS combines the precision of game-tree search with the generality of Monte Carlo simulation, requiring no domain-specific evaluation function in its purest form. The method came to mainstream prominence with DeepMind's AlphaGo, which defeated world Go champion Lee Sedol in March 2016 and demonstrated that MCTS combined with deep neural networks could conquer a domain previously thought to be a decade or more away from human-level computer play.
The algorithm has four phases that repeat thousands or millions of times within a single decision: selection of a path through the existing tree using a tree policy, expansion of the tree by adding a new leaf node, simulation of a rollout from that leaf to a terminal state, and backpropagation of the rollout outcome back up the tree to update visit counts and value estimates. The most influential variant uses the UCB1 bandit formula, originally introduced by Auer, Cesa-Bianchi, and Fischer for the multi-armed bandit problem, applied at every node in the tree. This particular instantiation, called UCT for Upper Confidence Bounds applied to Trees, was introduced by Levente Kocsis and Csaba Szepesvari in 2006 and is the practical foundation of nearly all modern MCTS systems.
MCTS is anytime, meaning it can be stopped at any moment and will return the best action discovered so far, with the quality of the recommendation improving smoothly with additional computation. It grows its tree asymmetrically, devoting more visits to promising subtrees while ignoring obviously bad ones, which is essential for problems with extremely large branching factors such as Go (around 250 legal moves per turn). The algorithm is general purpose: the only domain-specific code required is a description of the legal action set, a transition function, and a terminal reward signal. This generality has made MCTS the backbone of reinforcement learning systems that learn complex environments without hand-engineered evaluation, including AlphaZero for chess, shogi, and Go, MuZero for Atari and board games with a learned dynamics model, AlphaTensor for matrix-multiplication algorithm discovery, AlphaDev for sorting algorithm synthesis, and AlphaProof for formal mathematical theorem proving.
The roots of MCTS reach back to early Monte Carlo evaluation in computer Go and to the broader field of stochastic dynamic programming. The first explicit use of Monte Carlo simulation for Go move evaluation is usually credited to Bernd Brugmann's 1993 program Gobble, which sampled random playouts to estimate the expected value of moves. Gobble was a curiosity at the time because traditional alpha-beta search dominated computer chess, but Go's enormous branching factor and the difficulty of writing a position evaluator left Brugmann's idea looking attractive in retrospect. Through the late 1990s and early 2000s several groups including Bouzy and Helmstetter explored Monte Carlo evaluation more systematically, gradually combining sampling with shallow trees.
The modern formulation of MCTS as a tree search guided by simulation outcomes was introduced by Remi Coulom in 2006. His paper "Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search," presented at the Computers and Games conference, described the four-phase algorithm now standard in the field. Coulom's program Crazy Stone, which implemented this approach, won the 9x9 Go event at the 12th ICGA Computer Olympiad in 2006, and Coulom is generally credited as the first to use the term "Monte Carlo tree search" in print.
In the same year Levente Kocsis and Csaba Szepesvari published "Bandit Based Monte-Carlo Planning" at the European Conference on Machine Learning, introducing UCT. Their paper supplied the theoretical justification missing from Coulom's empirical method: by treating the choice of move at every internal node as a multi-armed bandit problem and applying the UCB1 formula of Auer and colleagues, they proved that the value of the root converges to the minimax value as the number of simulations grows, and they derived finite-sample regret bounds. The combination of Coulom's algorithm template with Kocsis and Szepesvari's selection rule gave the field a method that was both practically strong and theoretically grounded.
The years 2006 to 2010 saw a Monte Carlo revolution in computer Go. Sylvain Gelly and David Silver implemented UCT in MoGo, then introduced Rapid Action Value Estimation (RAVE) and heuristic UCT, allowing MoGo to become the first program to reach master level on 9x9 Go and to defeat a professional 5-dan player Guo Juan in 2007. Other strong programs of the era included Fuego (developed at the University of Alberta), Erica, Pachi, and Zen, all built on MCTS. By 2008 it was widely accepted that MCTS had displaced alpha-beta search as the dominant paradigm for Go.
The canonical reference work for the field is Cameron Browne and colleagues' 2012 survey "A Survey of Monte Carlo Tree Search Methods," published in the IEEE Transactions on Computational Intelligence and AI in Games. The 43-page survey catalogued more than two hundred MCTS variants, enhancements, and applications across games, scheduling, planning, and procedural content generation. It remains the standard introduction to the field.
The combination of MCTS with deep neural networks by DeepMind starting in 2014 transformed the algorithm from a Go-specialized technique into a general framework for learned planning. AlphaGo, published by David Silver and colleagues in Nature in January 2016, used a value network and a policy network to guide MCTS, beating European Go champion Fan Hui 5-0 in October 2015 and Lee Sedol 4-1 in March 2016. AlphaGo Zero (2017) removed the need for human game data, learning entirely from self-play with a single network. AlphaZero (2017, full Nature paper 2018) generalized the same recipe to chess and shogi. MuZero (2020) removed the need for a known model of the environment by learning a dynamics function alongside the policy and value networks. The same template has since been adapted for matrix multiplication discovery (AlphaTensor, 2022), assembly-level sorting algorithm synthesis (AlphaDev, 2023), and formal mathematical proof search (AlphaProof, 2024). The Tree of Thoughts framework introduced by Shunyu Yao and colleagues at NeurIPS 2023, and the Reasoning via Planning method of Shibo Hao and colleagues at EMNLP 2023, both apply MCTS-style search to the generation steps of a large language model.
MCTS proceeds by repeating an outer loop, each iteration of which executes four phases on a partially built search tree. The tree is rooted at the current state, and each node stores statistics summarizing the simulations that have passed through it: typically a visit count N(s) and an estimate of the action value Q(s, a) for each action available at the node. After a budget of iterations or a wall-clock time limit, the algorithm reports the action with the highest visit count or highest mean value as the move to play.
| Phase | Description | Typical implementation |
|---|---|---|
| 1. Selection | Starting from the root, recursively choose the child with the highest tree-policy score until a node is reached that has at least one unexpanded action. | UCB1 / UCT in classical MCTS, PUCT with neural network priors in AlphaZero-style systems. |
| 2. Expansion | If the selected node has untried actions and is not terminal, add one or more new child nodes corresponding to those actions. | Add a single new child per iteration (most common), or expand all untried actions at once for vectorized neural network inference. |
| 3. Simulation | From the newly added node, play out a complete trajectory to a terminal state using a default policy. The trajectory is not stored in the tree. | Uniform random rollouts in pure MCTS, hand-crafted heavy playouts in early Go programs, neural network value estimate in AlphaZero (no explicit rollout). |
| 4. Backpropagation | Walk back up from the new node to the root, incrementing the visit count and updating the value estimate of every node along the path with the rollout's outcome. | Add the result to a running sum and increment the count, taking care to invert sign across alternating turns in two-player games. |
After many iterations, the visit counts at the children of the root form an approximate posterior over how good each move is. Most implementations choose the most visited child as the recommended action, both because it tends to be more robust than the highest-mean child and because the visit count itself acts as a useful target for learned policy networks in self-play settings.
The selection step decides how to descend the existing tree. Each node faces an exploration-exploitation trade-off identical to the multi-armed bandit problem: should the next visit go to the action with the best estimated value, exploiting current knowledge, or to a less-tried action that might prove better, exploring? The UCB1 formula due to Auer, Cesa-Bianchi, and Fischer (2002) addresses this trade-off with a closed-form bonus term:
UCB1(s, a) = Q(s, a) + c * sqrt( ln N(s) / N(s, a) )
Here Q(s, a) is the average reward observed from action a in state s, N(s) is the number of times state s has been visited, N(s, a) is the number of times action a has been chosen from s, and c is an exploration constant typically set to sqrt(2) for normalized rewards in the unit interval. The square-root term grows logarithmically with the parent's visit count and shrinks with the action's visit count, so under-explored actions accumulate an exploration bonus that eventually forces them to be tried. UCT applies this rule at every internal node of the tree.
Kocsis and Szepesvari proved that UCT is consistent: as the number of simulations goes to infinity, the value estimated at the root converges to the true minimax value of the game, provided rewards are bounded. They also derived bounds on the regret incurred along the way. These guarantees explained why UCT worked in practice and made it the default selection rule for MCTS implementations.
When selection reaches a node with at least one action that has never been tried, expansion adds a new node to the tree. The most common strategy is to add a single new node per iteration, choosing the untried action either uniformly at random or via a heuristic, which keeps memory growth linear in the simulation budget.
From the new node, the algorithm plays out a trajectory to a terminal state using the default policy. In pure MCTS the default policy is uniform random play. Stronger Go programs used hand-crafted heavy playouts that incorporated patterns and tactical refinements. In AlphaGo and its successors, the simulation phase is replaced by a single neural network value estimate, which removes rollout variance at the cost of requiring a trained value network.
Once the simulation produces an outcome, the algorithm walks back up the path from the new node to the root, updating the statistics at every node along the way. Each visited node has its visit count incremented by one and its accumulated value updated by the simulation's outcome. In two-player zero-sum games the sign of the value is flipped at every level. AlphaZero and MuZero use a min-max backup that takes the negative of the child value at every level for two-player games and a sum or mean for single-agent settings.
MCTS has several properties that distinguish it from older game-tree search methods such as alpha-beta pruning.
Anytime. The algorithm produces a best-so-far recommendation at every iteration. There is no need to specify a search depth in advance, and the quality of the answer improves smoothly with more time, which makes MCTS suitable for real-time decisions.
Asymmetric tree growth. Unlike depth-limited iterative deepening, MCTS does not visit all nodes uniformly. The UCB1 selection rule concentrates visits on subtrees that look promising, producing a tree that is deep in critical lines and shallow in obvious blunders. This adaptivity is what allows MCTS to be useful in games like Go where the branching factor exceeds 200 and complete search to even modest depth is impossible.
Aheuristic in pure form. The basic algorithm needs only a generative model of the environment and a terminal reward signal. There is no need for a hand-designed evaluation function or move ordering heuristic. This generality made MCTS a natural fit for general game playing competitions, where the rules of the game are revealed to the agent only at runtime.
Convergent. UCT is provably consistent for finite-horizon discounted Markov decision processes and for two-player zero-sum games with bounded rewards. The estimated value at the root converges to the true minimax value as the simulation budget grows.
Parallelizable. Independent simulations from the same root can be run in parallel, with care taken to avoid race conditions when updating shared statistics. Three common parallelization schemes are leaf parallelization, root parallelization, and tree parallelization with virtual losses, each with different trade-offs between speedup and search quality.
Limitations. MCTS struggles in environments with sparse or deceptive rewards because random rollouts can take exponentially many samples to reach a high-reward terminal state. It also struggles in tactical positions where a single forced sequence determines the outcome and missing it leads to disaster: alpha-beta search with a strong evaluator still outperforms MCTS in chess unless MCTS is augmented with a learned value network. Continuous action spaces require special treatment because the basic algorithm assumes a discrete branching factor at every node.
Research on MCTS has produced a large family of refinements. The Browne 2012 survey catalogues more than two hundred. The most influential variants are summarized below.
| Year | Variant | Key idea |
|---|---|---|
| 2006 | UCT (Kocsis and Szepesvari) | UCB1 formula applied at every node; the dominant selection rule for MCTS. |
| 2007 | RAVE / AMAF (Gelly and Silver) | Rapid Action Value Estimation: combines the value of a move across all simulations in which it appeared, not just those in which it was chosen first. |
| 2008 | Heuristic UCT / Progressive bias | Incorporates domain knowledge as an additive bonus on the UCB1 score that decays as visit counts grow. |
| 2012 | Information Set MCTS (Cowling et al.) | Extension to imperfect information games such as poker and bridge through determinization or shared statistics across information sets. |
| 2012 | UCB1-Tuned, MCTS-Solver | Refined exploration formulas and exact backups for proven game-theoretic values within the search tree. |
| 2016 | AlphaGo PUCT | Two networks: a policy network providing priors for action selection and a value network replacing rollouts. |
| 2017 | AlphaZero | Single network outputting both policy and value, learned entirely from self-play with no human data. PUCT formula is the standard selection rule. |
| 2020 | MuZero (Schrittwieser et al.) | Learns a dynamics model alongside policy and value networks, removing the need for an explicit environment simulator. |
| 2021 | Sampled MuZero (Hubert et al.) | Samples a small set of candidate actions from the policy at each node, enabling planning in continuous and combinatorial action spaces. |
| 2022 | Gumbel MuZero / Gumbel AlphaZero (Danihelka et al.) | Replaces UCB heuristics with sequential halving over Gumbel-perturbed policy logits, guaranteeing policy improvement even with very small simulation budgets. |
| 2022 | AlphaTensor (Fawzi et al.) | MuZero-style MCTS for tensor decomposition, discovering faster matrix multiplication algorithms than human-designed methods. |
| 2023 | AlphaDev (Mankowitz et al.) | MCTS for assembly-level sorting algorithm discovery, producing routines integrated into the LLVM standard library. |
| 2023 | RAP / Tree of Thoughts (Hao et al., Yao et al.) | Apply MCTS or analogous tree search to the reasoning steps of a large language model. |
| 2024 | AlphaProof (DeepMind) | MCTS-based proof search in the Lean theorem prover, achieving silver-medal performance at the International Mathematical Olympiad. |
In games like Go where the value of a move is often largely independent of the order in which earlier moves were played, the all-moves-as-first heuristic (AMAF) updates the statistics of every move that appeared anywhere in a simulation, not just the move actually chosen at the relevant node. RAVE, introduced in MoGo by Gelly and Silver, blends an AMAF estimate with the standard MCTS estimate using a weighting that favors AMAF when visit counts are low and MCTS as visits accumulate. RAVE was a major contributor to MoGo's success and remains common in MCTS implementations for games with high move commutativity.
The PUCT formula, used in AlphaZero, MuZero, AlphaTensor, and AlphaDev, modifies the exploration bonus to incorporate a prior probability supplied by a neural network policy:
PUCT(s, a) = Q(s, a) + c_puct * P(s, a) * sqrt( N(s) ) / ( 1 + N(s, a) )
Here P(s, a) is the prior probability of action a from state s as predicted by the policy network. The exploration bonus is scaled by this prior, so actions that the policy considers promising receive more exploration even when their visit counts are low. The constant c_puct controls the strength of exploration and is typically set somewhere between 1 and 4 depending on the application.
The selection rule of AlphaZero is heuristic and can fail to guarantee policy improvement when the simulation budget is small. Gumbel MCTS, introduced by Ivo Danihelka and colleagues at ICLR 2022, replaces the heuristic with a principled procedure: sample candidate actions without replacement using the Gumbel-Top-k trick on policy logits, then apply sequential halving to allocate simulations across the candidates. The resulting policy is provably an improvement over the prior even with as few as two simulations per move. Gumbel MuZero achieves the same final performance as MuZero on Go, chess, and Atari with significantly fewer simulations.
A recent line of work applies MCTS to the reasoning chain of a large language model. Tree of Thoughts, by Shunyu Yao and colleagues at Princeton, introduced the framework of treating each intermediate reasoning step as a node in a tree and using BFS, DFS, or beam search to explore the tree. Reasoning via Planning (RAP), by Shibo Hao and colleagues, casts the LLM itself as both a world model (predicting the next state given an action) and a reasoning agent, with MCTS choosing among candidate reasoning steps. These methods substantially improve LLM performance on planning tasks like the 24 Game and on multi-step math word problems compared to vanilla chain-of-thought prompting.
| Application | Year | Domain | Role of MCTS |
|---|---|---|---|
| Crazy Stone | 2006 | Computer Go | First program to use MCTS, won 9x9 Go event at 12th Computer Olympiad. |
| MoGo | 2007 | Computer Go | First program to defeat a professional Go player on 9x9 board, used UCT and RAVE. |
| Fuego | 2009 | Computer Go | Open-source MCTS framework, first program to defeat a 9-dan professional in even 9x9 game. |
| AlphaGo | 2016 | Computer Go | Combined MCTS with deep neural networks; defeated Lee Sedol, ranked #1 in the world. |
| AlphaGo Zero / AlphaZero | 2017 | Go, chess, shogi | Single neural network, MCTS as the policy improvement operator in self-play training. |
| MuZero | 2020 | Atari, Go, chess, shogi | MCTS over a learned model rather than an explicit simulator; matched AlphaZero on board games and Atari. |
| AlphaTensor | 2022 | Matrix multiplication algorithm discovery | MCTS-driven tensor decomposition; first improvement to Strassen's 4x4 algorithm in 50 years. |
| AlphaDev | 2023 | Sorting algorithm synthesis | MCTS over CPU instruction sequences; new sort routines integrated into LLVM libc++. |
| AlphaProof | 2024 | Formal theorem proving | MCTS in Lean tactic space; silver-medal performance at the 2024 International Mathematical Olympiad. |
| Tree of Thoughts / RAP | 2023 | LLM reasoning | Tree search over reasoning steps generated by a language model. |
| General Game Playing | 2007 onward | Board, card, and arcade games | Default approach in GGP competitions because MCTS requires no domain knowledge beyond the rules. |
| Information Set MCTS | 2010s | Poker, bridge, Magic, Hearthstone | MCTS variants for stochastic and imperfect information games. |
| Autonomous driving behavior planning | 2018 onward | Robotics | MCTS used to evaluate maneuver sequences such as merges, lane changes, and intersection negotiation. |
| Procedural content generation | 2010s onward | Game design | MCTS searches over game level layouts or quest structures evaluated by playtesting simulations. |
AlphaGo, described by David Silver and colleagues in their January 2016 Nature paper "Mastering the game of Go with deep neural networks and tree search," is the system that brought MCTS to mainstream attention. AlphaGo combined three components: a policy network trained on professional human games and refined with self-play reinforcement learning, a value network trained to predict game outcomes from positions, and an MCTS search that used the policy network to guide expansion and a mixture of the value network and a fast rollout policy to evaluate leaves. In a five-game match in March 2016, AlphaGo defeated 18-time world Go champion Lee Sedol four games to one. The match was watched by an estimated 200 million people worldwide and is widely considered one of the defining moments of modern AI.
AlphaGo's success had two consequences. It demonstrated that deep neural networks could solve perceptual problems in board games that had resisted symbolic methods for decades, and it revealed that MCTS was the right vehicle for combining learned evaluations with deliberate search. Both lessons have been carried forward into every subsequent DeepMind game-playing system.
AlphaGo Zero, published in October 2017, removed the dependence on human game data. A single neural network outputting both a policy and a value was trained from scratch by self-play, with MCTS acting as a policy improvement operator: the visit counts produced by the search were stronger than the raw network policy, and the network was trained to mimic them. After three days of training, AlphaGo Zero defeated the published version of AlphaGo 100 to 0. The same recipe, dubbed AlphaZero, was applied in December 2017 to chess and shogi, defeating the strongest engines in each game (Stockfish in chess, Elmo in shogi) within a day of training.
The key insight of AlphaZero is that MCTS does not just play the game well, it generates training data for the network. The visit counts at the root form a sharper distribution than the network's policy output, and training the network to match those visit counts is a form of policy iteration. Iterating this loop drives the network and the search to improve together, with no need for human knowledge beyond the rules.
MuZero, by Julian Schrittwieser and colleagues at DeepMind, was published in Nature in December 2020. Whereas AlphaZero requires a perfect model of the environment to expand the search tree, MuZero learns its own model. Three networks are trained jointly: a representation function that encodes the observation into a hidden state, a dynamics function that maps hidden state and action to a next hidden state and predicted reward, and a prediction function that outputs a policy and value from the hidden state. MCTS unrolls the dynamics function in latent space, never touching the real environment after the initial encoding.
MuZero matched AlphaZero on Go, chess, and shogi without being told the rules and achieved state-of-the-art on the Atari Learning Environment, demonstrating that planning with a learned model could exceed the performance of model-free methods that had dominated Atari for years. The MuZero formulation has since been extended to continuous control (Sampled MuZero), small-budget planning (Gumbel MuZero, EfficientZero), and the algorithm-discovery domains of AlphaTensor, AlphaDev, and AlphaProof.
AlphaTensor, published in Nature in October 2022, applies a MuZero-style MCTS to the problem of finding low-rank tensor decompositions, which correspond to fast matrix multiplication algorithms. Each move adds a rank-one tensor to the decomposition, and the goal is to reduce the residual to zero in as few moves as possible. AlphaTensor discovered an algorithm for multiplying 4x4 matrices over the field of two elements using 47 multiplications, beating the 49-multiplication algorithm derived from Strassen's method, the first improvement in fifty years.
AlphaDev, published in Nature in June 2023, extends the same idea to the synthesis of small assembly-level sorting routines. Each move appends a CPU instruction; the reward is correctness on test cases combined with execution speed. The search discovered new sort-three and sort-four implementations that have been merged into the LLVM standard library, where they speed up sorts of short sequences by up to 70 percent.
AlphaProof, announced by DeepMind in July 2024, applies an AlphaZero-style MCTS to formal proof search in the Lean theorem prover. The system achieved silver-medal performance at the 2024 International Mathematical Olympiad, solving four of six problems within the time limits given to human contestants and producing fully machine-verified proofs.
MCTS has been applied to behavior planning in autonomous vehicles, where the search tree represents sequences of high-level maneuvers (merge, change lane, yield, accelerate) over a horizon of several seconds. The CARLA simulator and several published research vehicles use MCTS for tactical decision making, often combined with learned policy priors and risk-sensitive backups that account for uncertainty in surrounding-vehicle behavior. In manipulation and mobile robotics, MCTS underlies several task-and-motion planning systems where high-level symbolic actions need to be checked against low-level geometric feasibility.
General game playing tournaments, held annually since 2005 at AAAI, supply each agent with the rules of a previously unseen game and ask it to play strongly against opponents. Because there is no time to design a domain-specific evaluator, MCTS has been the dominant approach since 2007. The CADIA-Player developed by Yngvi Bjornsson and Hilmar Finnsson at Reykjavik University was the first MCTS-based GGP champion in 2007.
MCTS occupies a particular niche in the landscape of decision-time planning algorithms. Alpha-beta pruning with iterative deepening remains strongest for chess thanks to its compatibility with strong static evaluators and tactical extensions. Pure UCT MCTS dominates general game playing because it requires no domain knowledge. MCTS with neural network guidance (PUCT) is the current state of the art for Go and most other complex games when paired with sufficient compute. Best-first heuristic search such as A* and IDA* remains the natural choice for single-agent shortest path problems with admissible heuristics. The boundaries continue to shift as learned components become more powerful: hybrid systems such as Stockfish NNUE in chess and Leela Chess Zero have brought neural evaluations into both alpha-beta and MCTS engines.
Fuego (C++) is a long-standing open-source MCTS framework for Go and general game playing developed at the University of Alberta. Pachi (C) is a strong open-source Go program using MCTS with RAVE and pattern playouts. KataGo (C++ and Python) is a self-play MCTS Go engine in the AlphaGo Zero tradition, widely used by amateur and professional Go players. Leela Zero and Leela Chess Zero are open-source AlphaZero replications for Go and chess that run distributed self-play training. OpenSpiel is DeepMind's framework for reinforcement learning research in games, including MCTS, AlphaZero, and other algorithms. LightZero is a Python benchmark for MCTS algorithms including AlphaZero, MuZero, EfficientZero, Sampled MuZero, and Gumbel MuZero. mctx is DeepMind's reference implementation in JAX of UCT, AlphaZero, and Gumbel MuZero. Most modern research happens in Python with PyTorch or JAX backends because of the tight integration with neural network training; C++ implementations remain dominant in production game engines where wall-clock per-move budgets matter.
Despite its successes, MCTS has open problems. Sparse-reward environments such as long-horizon robotics tasks present extreme difficulty because random rollouts almost never reach a positive reward. Tactical positions in chess can require depths that pure UCT explores too uniformly, which is why even AlphaZero-style chess engines benefit from selective deepening heuristics borrowed from alpha-beta engines. Continuous action spaces require sampling-based variants such as progressive widening, KR-UCT, or Sampled MuZero, none of which are as theoretically clean as discrete UCT.
The simulation budget remains a concern. AlphaGo at the time of the Lee Sedol match used roughly 1,920 CPUs and 280 GPUs for several seconds per move; even AlphaGo Zero on a single TPU performed about 1,600 simulations per move. Reducing this budget while preserving strength is an active area of research, with EfficientZero and Gumbel MuZero making notable progress. For LLM-based MCTS, where each simulation requires running a multi-billion-parameter model, the cost is even more pressing and motivates work on cheaper rollouts, learned value heads, and batched search.
Theoretically, the regret guarantees of UCT are stated for problems with bounded rewards and finite horizons, conditions that are sometimes violated in practice. There has been progress on distributional MCTS, which maintains a posterior over the value distribution at each node, and on Bayesian MCTS variants that use Thompson sampling for selection, but the practical impact of these approaches has been more modest than that of the AlphaZero-style learned components. Integrating MCTS with model-based reinforcement learning, off-policy data, and large-scale pretraining is the current frontier.