Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a heuristic search algorithm for sequential decision-making problems, most famously used in game-playing artificial intelligence. It builds a search tree incrementally by performing many simulated playthroughs (called rollouts or playouts) of a problem, using random or lightly informed play, and aggregating the results to estimate the value of each candidate action at the root of the tree. MCTS combines the precision of game-tree search with the generality of Monte Carlo simulation, requiring no domain-specific evaluation function in its purest form. The method came to mainstream prominence with DeepMind's AlphaGo, which defeated world Go champion Lee Sedol in March 2016 and demonstrated that MCTS combined with deep neural networks could conquer a domain previously thought to be a decade or more away from human-level computer play.

The algorithm has four phases that repeat thousands or millions of times within a single decision: selection of a path through the existing tree using a tree policy, expansion of the tree by adding a new leaf node, simulation of a rollout from that leaf to a terminal state, and backpropagation of the rollout outcome back up the tree to update visit counts and value estimates. The most influential variant uses the UCB1 bandit formula, originally introduced by Auer, Cesa-Bianchi, and Fischer for the multi-armed bandit problem, applied at every node in the tree. This particular instantiation, called UCT for Upper Confidence Bounds applied to Trees, was introduced by Levente Kocsis and Csaba Szepesvari in 2006 and is the practical foundation of nearly all modern MCTS systems.

MCTS is anytime, meaning it can be stopped at any moment and will return the best action discovered so far, with the quality of the recommendation improving smoothly with additional computation. It grows its tree asymmetrically, devoting more visits to promising subtrees while ignoring obviously bad ones, which is essential for problems with extremely large branching factors such as Go (around 250 legal moves per turn). The algorithm is general purpose: the only domain-specific code required is a description of the legal action set, a transition function, and a terminal reward signal. This generality has made MCTS the backbone of reinforcement learning systems that learn complex environments without hand-engineered evaluation, including AlphaZero for chess, shogi, and Go, MuZero for Atari and board games with a learned dynamics model, AlphaTensor for matrix-multiplication algorithm discovery, AlphaDev for sorting algorithm synthesis, and AlphaProof for formal mathematical theorem proving.

History

The roots of MCTS reach back to early Monte Carlo evaluation in computer Go and to the broader field of stochastic dynamic programming. The first explicit use of Monte Carlo simulation for Go move evaluation is usually credited to Bernd Brugmann's 1993 program Gobble, which sampled random playouts to estimate the expected value of moves. Gobble was a curiosity at the time because traditional alpha-beta search dominated computer chess, but Go's enormous branching factor and the difficulty of writing a position evaluator left Brugmann's idea looking attractive in retrospect. Through the late 1990s and early 2000s several groups including Bouzy and Helmstetter explored Monte Carlo evaluation more systematically, gradually combining sampling with shallow trees.

The modern formulation of MCTS as a tree search guided by simulation outcomes was introduced by Remi Coulom in 2006. His paper "Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search," presented at the Computers and Games conference, described the four-phase algorithm now standard in the field. Coulom's program Crazy Stone, which implemented this approach, won the 9x9 Go event at the 12th ICGA Computer Olympiad in 2006, and Coulom is generally credited as the first to use the term "Monte Carlo tree search" in print.

In the same year Levente Kocsis and Csaba Szepesvari published "Bandit Based Monte-Carlo Planning" at the European Conference on Machine Learning, introducing UCT. Their paper supplied the theoretical justification missing from Coulom's empirical method: by treating the choice of move at every internal node as a multi-armed bandit problem and applying the UCB1 formula of Auer and colleagues, they proved that the value of the root converges to the minimax value as the number of simulations grows, and they derived finite-sample regret bounds. The combination of Coulom's algorithm template with Kocsis and Szepesvari's selection rule gave the field a method that was both practically strong and theoretically grounded.

The years 2006 to 2010 saw a Monte Carlo revolution in computer Go. Sylvain Gelly and David Silver implemented UCT in MoGo, then introduced Rapid Action Value Estimation (RAVE) and heuristic UCT, allowing MoGo to become the first program to reach master level on 9x9 Go and to defeat a professional 5-dan player Guo Juan in 2007. Other strong programs of the era included Fuego (developed at the University of Alberta), Erica, Pachi, and Zen, all built on MCTS. By 2008 it was widely accepted that MCTS had displaced alpha-beta search as the dominant paradigm for Go.

The canonical reference work for the field is Cameron Browne and colleagues' 2012 survey "A Survey of Monte Carlo Tree Search Methods," published in the IEEE Transactions on Computational Intelligence and AI in Games. The 43-page survey catalogued more than two hundred MCTS variants, enhancements, and applications across games, scheduling, planning, and procedural content generation. It remains the standard introduction to the field.

The combination of MCTS with deep neural networks by DeepMind starting in 2014 transformed the algorithm from a Go-specialized technique into a general framework for learned planning. AlphaGo, published by David Silver and colleagues in Nature in January 2016, used a value network and a policy network to guide MCTS, beating European Go champion Fan Hui 5-0 in October 2015 and Lee Sedol 4-1 in March 2016. AlphaGo Zero (2017) removed the need for human game data, learning entirely from self-play with a single network. AlphaZero (2017, full Nature paper 2018) generalized the same recipe to chess and shogi. MuZero (2020) removed the need for a known model of the environment by learning a dynamics function alongside the policy and value networks. The same template has since been adapted for matrix multiplication discovery (AlphaTensor, 2022), assembly-level sorting algorithm synthesis (AlphaDev, 2023), and formal mathematical proof search (AlphaProof, 2024). The Tree of Thoughts framework introduced by Shunyu Yao and colleagues at NeurIPS 2023, and the Reasoning via Planning method of Shibo Hao and colleagues at EMNLP 2023, both apply MCTS-style search to the generation steps of a large language model.

Algorithm: the four phases

MCTS proceeds by repeating an outer loop, each iteration of which executes four phases on a partially built search tree. The tree is rooted at the current state, and each node stores statistics summarizing the simulations that have passed through it: typically a visit count N(s) and an estimate of the action value Q(s, a) for each action available at the node. After a budget of iterations or a wall-clock time limit, the algorithm reports the action with the highest visit count or highest mean value as the move to play.

Phase	Description	Typical implementation
1. Selection	Starting from the root, recursively choose the child with the highest tree-policy score until a node is reached that has at least one unexpanded action.	UCB1 / UCT in classical MCTS, PUCT with neural network priors in AlphaZero-style systems.
2. Expansion	If the selected node has untried actions and is not terminal, add one or more new child nodes corresponding to those actions.	Add a single new child per iteration (most common), or expand all untried actions at once for vectorized neural network inference.
3. Simulation	From the newly added node, play out a complete trajectory to a terminal state using a default policy. The trajectory is not stored in the tree.	Uniform random rollouts in pure MCTS, hand-crafted heavy playouts in early Go programs, neural network value estimate in AlphaZero (no explicit rollout).
4. Backpropagation	Walk back up from the new node to the root, incrementing the visit count and updating the value estimate of every node along the path with the rollout's outcome.	Add the result to a running sum and increment the count, taking care to invert sign across alternating turns in two-player games.

After many iterations, the visit counts at the children of the root form an approximate posterior over how good each move is. Most implementations choose the most visited child as the recommended action, both because it tends to be more robust than the highest-mean child and because the visit count itself acts as a useful target for learned policy networks in self-play settings.

Selection and the UCB1 formula

The selection step decides how to descend the existing tree. Each node faces an exploration-exploitation trade-off identical to the multi-armed bandit problem: should the next visit go to the action with the best estimated value, exploiting current knowledge, or to a less-tried action that might prove better, exploring? The UCB1 formula due to Auer, Cesa-Bianchi, and Fischer (2002) addresses this trade-off with a closed-form bonus term:

UCB1(s, a) = Q(s, a) + c * sqrt( ln N(s) / N(s, a) )

Here Q(s, a) is the average reward observed from action a in state s, N(s) is the number of times state s has been visited, N(s, a) is the number of times action a has been chosen from s, and c is an exploration constant typically set to sqrt(2) for normalized rewards in the unit interval. The square-root term grows logarithmically with the parent's visit count and shrinks with the action's visit count, so under-explored actions accumulate an exploration bonus that eventually forces them to be tried. UCT applies this rule at every internal node of the tree.

Kocsis and Szepesvari proved that UCT is consistent: as the number of simulations goes to infinity, the value estimated at the root converges to the true minimax value of the game, provided rewards are bounded. They also derived bounds on the regret incurred along the way. These guarantees explained why UCT worked in practice and made it the default selection rule for MCTS implementations.

Expansion, simulation, and backpropagation

When selection reaches a node with at least one action that has never been tried, expansion adds a new node to the tree. The most common strategy is to add a single new node per iteration, choosing the untried action either uniformly at random or via a heuristic, which keeps memory growth linear in the simulation budget.

From the new node, the algorithm plays out a trajectory to a terminal state using the default policy. In pure MCTS the default policy is uniform random play. Stronger Go programs used hand-crafted heavy playouts that incorporated patterns and tactical refinements. In AlphaGo and its successors, the simulation phase is replaced by a single neural network value estimate, which removes rollout variance at the cost of requiring a trained value network.

Once the simulation produces an outcome, the algorithm walks back up the path from the new node to the root, updating the statistics at every node along the way. Each visited node has its visit count incremented by one and its accumulated value updated by the simulation's outcome. In two-player zero-sum games the sign of the value is flipped at every level. AlphaZero and MuZero use a min-max backup that takes the negative of the child value at every level for two-player games and a sum or mean for single-agent settings.

Properties

MCTS has several properties that distinguish it from older game-tree search methods such as alpha-beta pruning.

Anytime. The algorithm produces a best-so-far recommendation at every iteration. There is no need to specify a search depth in advance, and the quality of the answer improves smoothly with more time, which makes MCTS suitable for real-time decisions.

Asymmetric tree growth. Unlike depth-limited iterative deepening, MCTS does not visit all nodes uniformly. The UCB1 selection rule concentrates visits on subtrees that look promising, producing a tree that is deep in critical lines and shallow in obvious blunders. This adaptivity is what allows MCTS to be useful in games like Go where the branching factor exceeds 200 and complete search to even modest depth is impossible.

Aheuristic in pure form. The basic algorithm needs only a generative model of the environment and a terminal reward signal. There is no need for a hand-designed evaluation function or move ordering heuristic. This generality made MCTS a natural fit for general game playing competitions, where the rules of the game are revealed to the agent only at runtime.

Convergent. UCT is provably consistent for finite-horizon discounted Markov decision processes and for two-player zero-sum games with bounded rewards. The estimated value at the root converges to the true minimax value as the simulation budget grows.

Parallelizable. Independent simulations from the same root can be run in parallel, with care taken to avoid race conditions when updating shared statistics. Three common parallelization schemes are leaf parallelization, root parallelization, and tree parallelization with virtual losses, each with different trade-offs between speedup and search quality.

Limitations. MCTS struggles in environments with sparse or deceptive rewards because random rollouts can take exponentially many samples to reach a high-reward terminal state. It also struggles in tactical positions where a single forced sequence determines the outcome and missing it leads to disaster: alpha-beta search with a strong evaluator still outperforms MCTS in chess unless MCTS is augmented with a learned value network. Continuous action spaces require special treatment because the basic algorithm assumes a discrete branching factor at every node.

Variants and enhancements

Research on MCTS has produced a large family of refinements. The Browne 2012 survey catalogues more than two hundred. The most influential variants are summarized below.

Year	Variant	Key idea
2006	UCT (Kocsis and Szepesvari)	UCB1 formula applied at every node; the dominant selection rule for MCTS.
2007	RAVE / AMAF (Gelly and Silver)	Rapid Action Value Estimation: combines the value of a move across all simulations in which it appeared, not just those in which it was chosen first.
2008	Heuristic UCT / Progressive bias	Incorporates domain knowledge as an additive bonus on the UCB1 score that decays as visit counts grow.
2012	Information Set MCTS (Cowling et al.)	Extension to imperfect information games such as poker and bridge through determinization or shared statistics across information sets.
2012	UCB1-Tuned, MCTS-Solver	Refined exploration formulas and exact backups for proven game-theoretic values within the search tree.
2016	AlphaGo PUCT	Two networks: a policy network providing priors for action selection and a value network replacing rollouts.
2017	AlphaZero	Single network outputting both policy and value, learned entirely from self-play with no human data. PUCT formula is the standard selection rule.
2020	MuZero (Schrittwieser et al.)	Learns a dynamics model alongside policy and value networks, removing the need for an explicit environment simulator.
2021	Sampled MuZero (Hubert et al.)	Samples a small set of candidate actions from the policy at each node, enabling planning in continuous and combinatorial action spaces.
2022	Gumbel MuZero / Gumbel AlphaZero (Danihelka et al.)	Replaces UCB heuristics with sequential halving over Gumbel-perturbed policy logits, guaranteeing policy improvement even with very small simulation budgets.
2022	AlphaTensor (Fawzi et al.)	MuZero-style MCTS for tensor decomposition, discovering faster matrix multiplication algorithms than human-designed methods.
2023	AlphaDev (Mankowitz et al.)	MCTS for assembly-level sorting algorithm discovery, producing routines integrated into the LLVM standard library.
2023	RAP / Tree of Thoughts (Hao et al., Yao et al.)	Apply MCTS or analogous tree search to the reasoning steps of a large language model.
2024	AlphaProof (DeepMind)	MCTS-based proof search in the Lean theorem prover, achieving silver-medal performance at the International Mathematical Olympiad.

RAVE and the all-moves-as-first heuristic

In games like Go where the value of a move is often largely independent of the order in which earlier moves were played, the all-moves-as-first heuristic (AMAF) updates the statistics of every move that appeared anywhere in a simulation, not just the move actually chosen at the relevant node. RAVE, introduced in MoGo by Gelly and Silver, blends an AMAF estimate with the standard MCTS estimate using a weighting that favors AMAF when visit counts are low and MCTS as visits accumulate. RAVE was a major contributor to MoGo's success and remains common in MCTS implementations for games with high move commutativity.

PUCT and neural network priors

The PUCT formula, used in AlphaZero, MuZero, AlphaTensor, and AlphaDev, modifies the exploration bonus to incorporate a prior probability supplied by a neural network policy:

PUCT(s, a) = Q(s, a) + c_puct * P(s, a) * sqrt( N(s) ) / ( 1 + N(s, a) )

Here P(s, a) is the prior probability of action a from state s as predicted by the policy network. The exploration bonus is scaled by this prior, so actions that the policy considers promising receive more exploration even when their visit counts are low. The constant c_puct controls the strength of exploration and is typically set somewhere between 1 and 4 depending on the application.

Gumbel MCTS

The selection rule of AlphaZero is heuristic and can fail to guarantee policy improvement when the simulation budget is small. Gumbel MCTS, introduced by Ivo Danihelka and colleagues at ICLR 2022, replaces the heuristic with a principled procedure: sample candidate actions without replacement using the Gumbel-Top-k trick on policy logits, then apply sequential halving to allocate simulations across the candidates. The resulting policy is provably an improvement over the prior even with as few as two simulations per move. Gumbel MuZero achieves the same final performance as MuZero on Go, chess, and Atari with significantly fewer simulations.

MCTS for large language models

A recent line of work applies MCTS to the reasoning chain of a large language model. Tree of Thoughts, by Shunyu Yao and colleagues at Princeton, introduced the framework of treating each intermediate reasoning step as a node in a tree and using BFS, DFS, or beam search to explore the tree. Reasoning via Planning (RAP), by Shibo Hao and colleagues, casts the LLM itself as both a world model (predicting the next state given an action) and a reasoning agent, with MCTS choosing among candidate reasoning steps. These methods substantially improve LLM performance on planning tasks like the 24 Game and on multi-step math word problems compared to vanilla chain-of-thought prompting.

Notable applications

Application	Year	Domain	Role of MCTS
Crazy Stone	2006	Computer Go	First program to use MCTS, won 9x9 Go event at 12th Computer Olympiad.
MoGo	2007	Computer Go	First program to defeat a professional Go player on 9x9 board, used UCT and RAVE.
Fuego	2009	Computer Go	Open-source MCTS framework, first program to defeat a 9-dan professional in even 9x9 game.
AlphaGo	2016	Computer Go	Combined MCTS with deep neural networks; defeated Lee Sedol, ranked #1 in the world.
AlphaGo Zero / AlphaZero	2017	Go, chess, shogi	Single neural network, MCTS as the policy improvement operator in self-play training.
MuZero	2020	Atari, Go, chess, shogi	MCTS over a learned model rather than an explicit simulator; matched AlphaZero on board games and Atari.
AlphaTensor	2022	Matrix multiplication algorithm discovery	MCTS-driven tensor decomposition; first improvement to Strassen's 4x4 algorithm in 50 years.
AlphaDev	2023	Sorting algorithm synthesis	MCTS over CPU instruction sequences; new sort routines integrated into LLVM libc++.
AlphaProof	2024	Formal theorem proving	MCTS in Lean tactic space; silver-medal performance at the 2024 International Mathematical Olympiad.
Tree of Thoughts / RAP	2023	LLM reasoning	Tree search over reasoning steps generated by a language model.
General Game Playing	2007 onward	Board, card, and arcade games	Default approach in GGP competitions because MCTS requires no domain knowledge beyond the rules.
Information Set MCTS	2010s	Poker, bridge, Magic, Hearthstone	MCTS variants for stochastic and imperfect information games.
Autonomous driving behavior planning	2018 onward	Robotics	MCTS used to evaluate maneuver sequences such as merges, lane changes, and intersection negotiation.
Procedural content generation	2010s onward	Game design	MCTS searches over game level layouts or quest structures evaluated by playtesting simulations.

AlphaGo and the Lee Sedol match

AlphaGo, described by David Silver and colleagues in their January 2016 Nature paper "Mastering the game of Go with deep neural networks and tree search," is the system that brought MCTS to mainstream attention. AlphaGo combined three components: a policy network trained on professional human games and refined with self-play reinforcement learning, a value network trained to predict game outcomes from positions, and an MCTS search that used the policy network to guide expansion and a mixture of the value network and a fast rollout policy to evaluate leaves. In a five-game match in March 2016, AlphaGo defeated 18-time world Go champion Lee Sedol four games to one. The match was watched by an estimated 200 million people worldwide and is widely considered one of the defining moments of modern AI.

AlphaGo's success had two consequences. It demonstrated that deep neural networks could solve perceptual problems in board games that had resisted symbolic methods for decades, and it revealed that MCTS was the right vehicle for combining learned evaluations with deliberate search. Both lessons have been carried forward into every subsequent DeepMind game-playing system.

AlphaGo Zero, AlphaZero, and self-play

AlphaGo Zero, published in October 2017, removed the dependence on human game data. A single neural network outputting both a policy and a value was trained from scratch by self-play, with MCTS acting as a policy improvement operator: the visit counts produced by the search were stronger than the raw network policy, and the network was trained to mimic them. After three days of training, AlphaGo Zero defeated the published version of AlphaGo 100 to 0. The same recipe, dubbed AlphaZero, was applied in December 2017 to chess and shogi, defeating the strongest engines in each game (Stockfish in chess, Elmo in shogi) within a day of training.

The key insight of AlphaZero is that MCTS does not just play the game well, it generates training data for the network. The visit counts at the root form a sharper distribution than the network's policy output, and training the network to match those visit counts is a form of policy iteration. Iterating this loop drives the network and the search to improve together, with no need for human knowledge beyond the rules.

MuZero and learned models

MuZero, by Julian Schrittwieser and colleagues at DeepMind, was published in Nature in December 2020. Whereas AlphaZero requires a perfect model of the environment to expand the search tree, MuZero learns its own model. Three networks are trained jointly: a representation function that encodes the observation into a hidden state, a dynamics function that maps hidden state and action to a next hidden state and predicted reward, and a prediction function that outputs a policy and value from the hidden state. MCTS unrolls the dynamics function in latent space, never touching the real environment after the initial encoding.

MuZero matched AlphaZero on Go, chess, and shogi without being told the rules and achieved state-of-the-art on the Atari Learning Environment, demonstrating that planning with a learned model could exceed the performance of model-free methods that had dominated Atari for years. The MuZero formulation has since been extended to continuous control (Sampled MuZero), small-budget planning (Gumbel MuZero, EfficientZero), and the algorithm-discovery domains of AlphaTensor, AlphaDev, and AlphaProof.

AlphaTensor, AlphaDev, AlphaProof

AlphaTensor, published in Nature in October 2022, applies a MuZero-style MCTS to the problem of finding low-rank tensor decompositions, which correspond to fast matrix multiplication algorithms. Each move adds a rank-one tensor to the decomposition, and the goal is to reduce the residual to zero in as few moves as possible. AlphaTensor discovered an algorithm for multiplying 4x4 matrices over the field of two elements using 47 multiplications, beating the 49-multiplication algorithm derived from Strassen's method, the first improvement in fifty years.

AlphaDev, published in Nature in June 2023, extends the same idea to the synthesis of small assembly-level sorting routines. Each move appends a CPU instruction; the reward is correctness on test cases combined with execution speed. The search discovered new sort-three and sort-four implementations that have been merged into the LLVM standard library, where they speed up sorts of short sequences by up to 70 percent.

AlphaProof, announced by DeepMind in July 2024, applies an AlphaZero-style MCTS to formal proof search in the Lean theorem prover. The system achieved silver-medal performance at the 2024 International Mathematical Olympiad, solving four of six problems within the time limits given to human contestants and producing fully machine-verified proofs.

Robotics, planning, and general game playing

MCTS has been applied to behavior planning in autonomous vehicles, where the search tree represents sequences of high-level maneuvers (merge, change lane, yield, accelerate) over a horizon of several seconds. The CARLA simulator and several published research vehicles use MCTS for tactical decision making, often combined with learned policy priors and risk-sensitive backups that account for uncertainty in surrounding-vehicle behavior. In manipulation and mobile robotics, MCTS underlies several task-and-motion planning systems where high-level symbolic actions need to be checked against low-level geometric feasibility.

General game playing tournaments, held annually since 2005 at AAAI, supply each agent with the rules of a previously unseen game and ask it to play strongly against opponents. Because there is no time to design a domain-specific evaluator, MCTS has been the dominant approach since 2007. The CADIA-Player developed by Yngvi Bjornsson and Hilmar Finnsson at Reykjavik University was the first MCTS-based GGP champion in 2007.

Comparison with other search methods

MCTS occupies a particular niche in the landscape of decision-time planning algorithms. Alpha-beta pruning with iterative deepening remains strongest for chess thanks to its compatibility with strong static evaluators and tactical extensions. Pure UCT MCTS dominates general game playing because it requires no domain knowledge. MCTS with neural network guidance (PUCT) is the current state of the art for Go and most other complex games when paired with sufficient compute. Best-first heuristic search such as A* and IDA* remains the natural choice for single-agent shortest path problems with admissible heuristics. The boundaries continue to shift as learned components become more powerful: hybrid systems such as Stockfish NNUE in chess and Leela Chess Zero have brought neural evaluations into both alpha-beta and MCTS engines.

Implementations and software

Fuego (C++) is a long-standing open-source MCTS framework for Go and general game playing developed at the University of Alberta. Pachi (C) is a strong open-source Go program using MCTS with RAVE and pattern playouts. KataGo (C++ and Python) is a self-play MCTS Go engine in the AlphaGo Zero tradition, widely used by amateur and professional Go players. Leela Zero and Leela Chess Zero are open-source AlphaZero replications for Go and chess that run distributed self-play training. OpenSpiel is DeepMind's framework for reinforcement learning research in games, including MCTS, AlphaZero, and other algorithms. LightZero is a Python benchmark for MCTS algorithms including AlphaZero, MuZero, EfficientZero, Sampled MuZero, and Gumbel MuZero. mctx is DeepMind's reference implementation in JAX of UCT, AlphaZero, and Gumbel MuZero. Most modern research happens in Python with PyTorch or JAX backends because of the tight integration with neural network training; C++ implementations remain dominant in production game engines where wall-clock per-move budgets matter.

Limitations and ongoing research

Despite its successes, MCTS has open problems. Sparse-reward environments such as long-horizon robotics tasks present extreme difficulty because random rollouts almost never reach a positive reward. Tactical positions in chess can require depths that pure UCT explores too uniformly, which is why even AlphaZero-style chess engines benefit from selective deepening heuristics borrowed from alpha-beta engines. Continuous action spaces require sampling-based variants such as progressive widening, KR-UCT, or Sampled MuZero, none of which are as theoretically clean as discrete UCT.

The simulation budget remains a concern. AlphaGo at the time of the Lee Sedol match used roughly 1,920 CPUs and 280 GPUs for several seconds per move; even AlphaGo Zero on a single TPU performed about 1,600 simulations per move. Reducing this budget while preserving strength is an active area of research, with EfficientZero and Gumbel MuZero making notable progress. For LLM-based MCTS, where each simulation requires running a multi-billion-parameter model, the cost is even more pressing and motivates work on cheaper rollouts, learned value heads, and batched search.

Theoretically, the regret guarantees of UCT are stated for problems with bounded rewards and finite horizons, conditions that are sometimes violated in practice. There has been progress on distributional MCTS, which maintains a posterior over the value distribution at each node, and on Bayesian MCTS variants that use Thompson sampling for selection, but the practical impact of these approaches has been more modest than that of the AlphaZero-style learned components. Integrating MCTS with model-based reinforcement learning, off-policy data, and large-scale pretraining is the current frontier.

References

Coulom, R. (2006). "Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search." In H. J. van den Herik, P. Ciancarini, and H. H. L. M. Donkers (Eds.), Computers and Games, 5th International Conference, CG 2006, Lecture Notes in Computer Science, vol. 4630, pp. 72 to 83, Springer.
Kocsis, L., and Szepesvari, C. (2006). "Bandit Based Monte-Carlo Planning." In J. Furnkranz, T. Scheffer, and M. Spiliopoulou (Eds.), Machine Learning: ECML 2006, Lecture Notes in Computer Science, vol. 4212, pp. 282 to 293, Springer.
Browne, C., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. (2012). "A Survey of Monte Carlo Tree Search Methods." IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1 to 43.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 47(2 to 3), 235 to 256.
Brugmann, B. (1993). "Monte Carlo Go." Technical report, Physics Department, Syracuse University.
Gelly, S., and Silver, D. (2007). "Combining Online and Offline Knowledge in UCT." Proceedings of the 24th International Conference on Machine Learning, pp. 273 to 280.
Gelly, S., and Silver, D. (2011). "Monte-Carlo Tree Search and Rapid Action Value Estimation in Computer Go." Artificial Intelligence, 175(11), 1856 to 1875.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature, 529(7587), 484 to 489.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. (2017). "Mastering the game of Go without human knowledge." Nature, 550(7676), 354 to 359.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." Science, 362(6419), 1140 to 1144.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., and Silver, D. (2020). "Mastering Atari, Go, chess and shogi by planning with a learned model." Nature, 588(7839), 604 to 609.
Hubert, T., Schrittwieser, J., Antonoglou, I., Barekatain, M., Schmitt, S., and Silver, D. (2021). "Learning and Planning in Complex Action Spaces." Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 4476 to 4486.
Danihelka, I., Guez, A., Schrittwieser, J., and Silver, D. (2022). "Policy improvement by planning with Gumbel." Proceedings of the International Conference on Learning Representations (ICLR).
Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., Ruiz, F. J. R., Schrittwieser, J., Swirszcz, G., Silver, D., Hassabis, D., and Kohli, P. (2022). "Discovering faster matrix multiplication algorithms with reinforcement learning." Nature, 610(7930), 47 to 53.
Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J. B., Ahern, A., Koppe, T., Millikin, K., Gaffney, S., Elster, S., Broshear, J., Gamble, C., Milan, K., Tung, R., Hwang, M., Cemgil, T., Barekatain, M., Li, Y., Mandhane, A., Hubert, T., Schrittwieser, J., Hassabis, D., Kohli, P., Riedmiller, M., Vinyals, O., and Silver, D. (2023). "Faster sorting algorithms discovered using deep reinforcement learning." Nature, 618(7964), 257 to 263.
DeepMind (2024). "AI achieves silver-medal standard solving International Mathematical Olympiad problems." Google DeepMind blog post and technical report on AlphaProof and AlphaGeometry 2.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." Advances in Neural Information Processing Systems 36 (NeurIPS).
Hao, S., Gu, Y., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z., and Hu, Z. (2023). "Reasoning with Language Model is Planning with World Model." Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8154 to 8173.
Cowling, P. I., Powley, E. J., and Whitehouse, D. (2012). "Information Set Monte Carlo Tree Search." IEEE Transactions on Computational Intelligence and AI in Games, 4(2), 120 to 143.
Enzenberger, M., Muller, M., Arneson, B., and Segal, R. (2010). "Fuego: An Open-Source Framework for Board Games and Go Engine Based on Monte Carlo Tree Search." IEEE Transactions on Computational Intelligence and AI in Games, 2(4), 259 to 270.
Bjornsson, Y., and Finnsson, H. (2009). "CadiaPlayer: A Simulation-Based General Game Player." IEEE Transactions on Computational Intelligence and AI in Games, 1(1), 4 to 15.
Chaslot, G., Winands, M. H. M., and van den Herik, H. J. (2008). "Parallel Monte-Carlo Tree Search." Computers and Games, Lecture Notes in Computer Science, vol. 5131, pp. 60 to 71, Springer.
Cazenave, T. (2015). "Generalized Rapid Action Value Estimation." Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), pp. 754 to 760.
Winands, M. H. M., Bjornsson, Y., and Saito, J. T. (2008). "Monte-Carlo Tree Search Solver." Computers and Games, Lecture Notes in Computer Science, vol. 5131, pp. 25 to 36, Springer.
Tesauro, G., and Galperin, G. R. (1996). "On-line Policy Improvement using Monte-Carlo Search." Advances in Neural Information Processing Systems 9, 1068 to 1074.
Wu, D. J. (2019). "KataGo: Accelerating Self-Play Learning in Go." arXiv:1902.10565.

History

Algorithm: the four phases

Selection and the UCB1 formula

Expansion, simulation, and backpropagation

Properties

Variants and enhancements

RAVE and the all-moves-as-first heuristic

PUCT and neural network priors

Gumbel MCTS

MCTS for large language models

Notable applications

AlphaGo and the Lee Sedol match

AlphaGo Zero, AlphaZero, and self-play

MuZero and learned models

AlphaTensor, AlphaDev, AlphaProof

Robotics, planning, and general game playing

Comparison with other search methods

Implementations and software

Limitations and ongoing research

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

AlphaGo

OpenAI Five

AlphaStar

AlphaZero

Temporal-difference learning

History

Algorithm: the four phases

Selection and the UCB1 formula

Expansion, simulation, and backpropagation

Properties

Variants and enhancements

RAVE and the all-moves-as-first heuristic

PUCT and neural network priors

Gumbel MCTS

MCTS for large language models

Notable applications

AlphaGo and the Lee Sedol match

AlphaGo Zero, AlphaZero, and self-play

MuZero and learned models

AlphaTensor, AlphaDev, AlphaProof

Robotics, planning, and general game playing

Comparison with other search methods

Implementations and software

Limitations and ongoing research

See also

References

Related Articles

Machine learning terms/Reinforcement Learning

AlphaGo

OpenAI Five

AlphaStar

AlphaZero

Temporal-difference learning