MuZero is a model-based reinforcement learning algorithm developed by DeepMind that achieves superhuman performance in board games and Atari video games without being given the rules of the environment in advance. Introduced in a preprint in November 2019 and formally published in Nature in December 2020, MuZero is the direct successor to AlphaZero and the broader AlphaGo lineage. Its central innovation is that it learns its own internal model of environment dynamics, then plans inside that learned model using Monte Carlo tree search, rather than relying on a hand-coded simulator. By doing so, MuZero matched the strength of AlphaZero in Go, chess, and shogi while simultaneously setting state of the art results across the 57 game Atari benchmark, a domain where the previous AlphaZero family could not be applied because no rule based simulator was available.[1]
MuZero has since become a foundational algorithm at DeepMind. It underpins later research systems including AlphaTensor, which discovered new matrix multiplication algorithms in 2022, and AlphaDev, which discovered faster sorting routines that were merged into the standard C++ library in 2023.[2][3] It has also been deployed in production at Google to optimize video compression for YouTube.[4] A growing family of variants extends MuZero to continuous action spaces, stochastic environments, sample efficient learning, and offline data, including Sampled MuZero, Stochastic MuZero, MuZero Reanalyse, EfficientZero, and Gumbel MuZero.[5][6][7][8][9]
MuZero sits at the end of a multi year research program at DeepMind that began with AlphaGo in 2016. AlphaGo combined supervised learning from expert human games, reinforcement learning from self play, and Monte Carlo tree search to defeat the top professional Go player Lee Sedol. AlphaGo Zero, introduced in 2017, removed the dependency on human games and learned entirely from self play, using a single neural network that output both a move policy and a value estimate. AlphaZero generalized AlphaGo Zero to chess and shogi using the same algorithm and architecture, again learning from scratch through self play guided by Monte Carlo tree search.[1]
Despite their power, all of these systems shared a critical assumption. Each of them required a perfect simulator of the game. AlphaZero needed to know exactly which moves were legal, what the next position would be after any action, and when the game ended. This requirement was acceptable in board games where rules are known and easy to encode, but it ruled out broad classes of problems where dynamics are unknown, partially observed, or visually complex. Atari video games, robotics, and most real world planning tasks fall into this category. The MuZero project was framed as the answer to a single question: can a planning agent reach AlphaZero level performance without ever being told the rules of its world?
Reinforcement learning algorithms are commonly divided into model free and model based methods. Model free methods such as Q learning, DQN, and policy gradient algorithms learn a value function or policy directly from experience, without an explicit representation of how the environment behaves. Model based methods first learn a transition model, often called a world model, that predicts the next observation and reward given a current state and an action, and then use that model for planning. Model based methods have historically promised better sample efficiency and stronger generalization, but earlier attempts struggled because errors in the learned model compounded over multi step rollouts and degraded the quality of any plan built on top of them.
MuZero takes a distinctive position inside this design space. Rather than learning a model that reconstructs full pixel observations or full game states, it learns a model whose only obligation is to predict the quantities that planning actually needs: the action policy, the value of the position, and the immediate reward. This is sometimes called a value equivalent model. By relaxing the modeling target, MuZero is able to learn useful internal representations that may bear no visual or symbolic resemblance to the true environment yet remain accurate enough to drive a powerful tree search.
MuZero combines three learned functions with a search procedure. At inference time, given a sequence of observations, the algorithm encodes them into a hidden state and then runs a Monte Carlo tree search inside the hidden state space to choose the next action. At training time, all three networks are optimized jointly so that the predictions made along simulated trajectories match the targets collected from real play.
The MuZero model has three components, each implemented as a deep neural network and trained end to end.[1]
The representation function $h$ maps the recent history of raw observations to an initial hidden state. In a board game it consumes a stack of recent positions, and in Atari it consumes a stack of recent frames. Formally, $s^0 = h(o_1, o_2, \ldots, o_t)$, where $s^0$ is the initial hidden state used to seed planning. The representation function is purely an encoder. It does not need to be invertible and does not have to reconstruct the input.
The dynamics function $g$ models how the hidden state evolves under actions. Given a hidden state $s^{k}$ and an action $a^{k+1}$, it produces a successor hidden state and a predicted immediate reward: $(s^{k+1}, r^{k+1}) = g(s^{k}, a^{k+1})$. This is the learned analogue of an environment simulator. It runs entirely inside the abstract latent space and never touches raw pixels or board configurations after the first encoding.
The prediction function $f$ maps a hidden state to a policy and a value: $(p^{k}, v^{k}) = f(s^{k})$. The policy is a probability distribution over actions and the value estimates the expected return from that hidden state. Together, $g$ and $f$ play the role that a hand coded simulator and a value network would play in AlphaZero.
The table below summarizes the role of each component.
| Component | Symbol | Inputs | Outputs | Role |
|---|---|---|---|---|
| Representation function | h | Sequence of past observations | Initial hidden state s_0 | Encodes raw observations into a planning friendly latent state |
| Dynamics function | g | Hidden state s_k and action a_k+1 | Next hidden state s_k+1 and predicted reward r_k+1 | Learned model of environment transitions and rewards |
| Prediction function | f | Hidden state s_k | Policy p_k and value v_k | Provides priors and value estimates that guide MCTS |
A key conceptual point is that the hidden state has no semantics imposed by the designer. It is whatever vector representation the network finds useful for predicting reward, value, and policy along simulated trajectories. The model is therefore not constrained to match the true mechanics of the world; it only has to be value equivalent to the true world along the trajectories it explores.
MuZero plans using a variant of the Monte Carlo tree search procedure introduced in AlphaGo Zero and AlphaZero. The root of the search tree is the initial hidden state produced by the representation function. Each edge corresponds to an action. Each simulation of the search proceeds in three phases.
During selection, the search descends the tree by repeatedly choosing the action that maximizes an upper confidence bound that combines the policy prior, the visit count, and the running value estimate. This is the same PUCT style rule used by AlphaZero, adapted to operate over hidden states.
During expansion and evaluation, when the search reaches a leaf, it applies the dynamics function to obtain a new hidden state and a predicted reward, then applies the prediction function to obtain a policy and value for that new state. The value is backed up along the visited path, accumulating predicted rewards along the way.
During backup, statistics for visited edges are updated. After all simulations are complete, the agent samples or selects an action based on the visit counts at the root.
This search runs entirely inside the learned latent space. The agent never invokes the real environment during planning. In the published implementation, MuZero used 800 simulations per move in board games and 50 simulations per step in Atari.[1]
MuZero is trained by playing many games against itself, storing each trajectory in a replay buffer. To form a training example, a starting state is sampled from the buffer and the network is unrolled for $K$ hypothetical steps using the actual actions that were taken. At each unrolled step, the model produces predictions of policy, value, and reward, and these are compared against three targets.
The policy target is the visit count distribution that MCTS produced at the corresponding real step. This is the same self play improvement signal that AlphaZero uses: the search is treated as a stronger policy than the network alone, and the network is trained to match it.
The value target is an n step bootstrapped return computed from the actual rewards observed in the trajectory plus a discounted MCTS value at a later state. In two player zero sum games this collapses to the game outcome.
The reward target is the actual reward observed at the corresponding real step.
The loss is the sum of cross entropy or squared error terms for each prediction at each unrolled step, plus an L2 weight regularization term. All three networks are differentiated through the unrolled hidden state trajectory and updated jointly.
The central theoretical claim of MuZero is that an internal model only needs to be accurate where it matters for planning. Earlier model based methods often tried to learn pixel perfect or state accurate models, which is wasteful and unstable when most of the input contains information that is irrelevant to good decisions. By training the model end to end against the same quantities that drive search and policy improvement, MuZero focuses learning capacity on the parts of the world that change action choice. Empirically, this allowed MuZero to match a system, AlphaZero, that had access to perfect rules.[1]
The original MuZero paper reported results on four benchmark sets: Go, chess, shogi, and the 57 game Atari Learning Environment.[1]
In Go, MuZero matched the strength of AlphaZero across self play training and surpassed it after roughly one million training steps. The paper reported that increasing search time at evaluation from 0.1 seconds to 50 seconds per move improved playing strength by more than 1000 Elo, comparable to the gap between a strong amateur and an elite professional, indicating that planning depth in the learned model translates into real strength.
In chess and shogi, MuZero again matched AlphaZero's superhuman performance after about one million training steps, despite never being told the legal moves of either game. The dynamics function had to discover, from rewards alone, which transitions were possible.
In Atari, MuZero set new state of the art results on the standard 57 game suite, surpassing the previous best model free method R2D2 on both mean and median normalized score. This was a particularly significant result because the AlphaZero family had never been applied to Atari before. In Atari there is no rule based simulator that the agent could call; the only way to plan is to learn a model.
The table below summarizes how MuZero relates to its immediate predecessors.
| Property | AlphaGo (2016) | AlphaZero (2017) | MuZero (2019) |
|---|---|---|---|
| Domains | Go | Go, chess, shogi | Go, chess, shogi, 57 Atari games |
| Human data required | Yes, supervised pretraining | No, learns from self play | No, learns from self play |
| Rules of the environment | Required | Required | Not required, learned |
| Simulator | Hand coded perfect simulator | Hand coded perfect simulator | Learned dynamics network |
| Search algorithm | Monte Carlo tree search | Monte Carlo tree search | Monte Carlo tree search in latent space |
| Visual inputs | Board features | Board features | Raw pixels for Atari, board features for others |
| Atari support | No | No | Yes |
Since its publication, MuZero has spawned an active line of research that extends it to settings the original algorithm could not handle and improves its sample efficiency.
MuZero Reanalyse, described in the original Nature paper and detailed in a 2021 follow up by Schrittwieser, Hubert, and colleagues, addresses sample efficiency. The idea is to revisit older trajectories in the replay buffer and re run MCTS on them using the latest version of the network, producing fresh and improved policy and value targets without any new environment interaction. The same paper introduced MuZero Unplugged, a unified algorithm that subsumes both online and offline reinforcement learning by tuning the ratio of fresh experience to reanalysed experience. MuZero Unplugged set new state of the art results on the RL Unplugged offline benchmark and on Atari at 200 million frames.[5]
Sampled MuZero, introduced by Hubert and colleagues at ICML 2021, extends MuZero to environments with very large or continuous action spaces. The original MuZero enumerates every action at each tree node, which is impossible when actions are continuous vectors or when the action set is combinatorially large. Sampled MuZero instead draws a small set of candidate actions from the policy at each node and runs MCTS over only those samples. The same idea is applied during training, where the policy update uses sampled rather than enumerated actions. Sampled MuZero recovered MuZero level performance on Go and Atari and learned high dimensional continuous control tasks on the DeepMind Control Suite from both state and pixel inputs.[6]
Stochastic MuZero, introduced by Antonoglou and colleagues at ICLR 2022, extends MuZero to environments with intrinsic randomness. The original MuZero implicitly assumes deterministic dynamics. In games like backgammon or 2048, parts of the state transition are governed by dice rolls or random tile spawns, which a deterministic model cannot represent. Stochastic MuZero factors each transition into a deterministic step from the current state to an afterstate, then a stochastic step from the afterstate to the next state, sampled using learned chance codes. The new model preserves value equivalence and supports a stochastic variant of tree search. Stochastic MuZero matched or exceeded the state of the art on 2048 and backgammon while preserving MuZero's performance on Go.[7]
EfficientZero, introduced by Ye and colleagues at NeurIPS 2021, is a sample efficient variant of MuZero designed for the Atari 100k benchmark, where agents are allowed only 100,000 environment steps, roughly two hours of game time. EfficientZero adds three key changes to MuZero: a self supervised consistency loss that ties together hidden states predicted by the dynamics model and hidden states encoded from real observations, end to end prediction of value prefixes that smooth reward credit assignment over time, and corrected off policy value targets that account for stale data in the replay buffer. EfficientZero achieved 194.3 percent mean and 109.0 percent median human normalized score on Atari 100k, the first time superhuman performance was reached on Atari with so little data, and approached the performance of DQN trained on 200 million frames using roughly five hundred times fewer samples.[8] EfficientZero V2 later extended the approach to continuous control, using a Gaussian policy and ideas inspired by Sampled MuZero.
Gumbel MuZero, presented at ICLR 2022, replaces the standard PUCT selection rule with a planning procedure based on Gumbel top k sampling. It is designed to make the policy improvement step more reliable when the number of MCTS simulations is small or the action space is very large. Gumbel MuZero was shown to match the original MuZero's performance with far fewer simulations per move and to scale better to large action spaces.[9]
Muesli is a related DeepMind algorithm that achieves MuZero level Atari performance without explicit MCTS at training time, instead using a regularized policy update inspired by maximum a posteriori policy optimization. DreamerV3 by Hafner and colleagues is a different family of world model based reinforcement learning that, unlike MuZero, learns reconstructive world models and uses an actor critic policy trained inside imagined rollouts. DreamerV3 reaches strong performance across many tasks with a single set of hyperparameters and is often discussed alongside MuZero as the two leading model based reinforcement learning paradigms, though they make different design choices about whether to reconstruct observations and whether to plan with explicit search.
The table below summarizes the main MuZero variants and their target settings.
| Variant | Year | Authors | Key contribution | Target setting |
|---|---|---|---|---|
| MuZero | 2019, Nature 2020 | Schrittwieser et al. | Original learned model and MCTS planning | Discrete actions, deterministic environments, Go, chess, shogi, Atari |
| MuZero Reanalyse and Unplugged | 2021 | Schrittwieser, Hubert et al. | Reanalysing old data with the latest model, unified online and offline RL | Sample efficient and offline RL |
| Sampled MuZero | 2021, ICML | Hubert et al. | MCTS over sampled actions | Continuous and combinatorial action spaces |
| Stochastic MuZero | 2022, ICLR | Antonoglou et al. | Afterstate factorization and chance codes | Stochastic environments such as backgammon and 2048 |
| EfficientZero | 2021, NeurIPS | Ye et al. | Self supervised consistency loss and value prefixes | Sample efficient visual RL, Atari 100k |
| EfficientZero V2 | 2024 | Ye et al. | Gaussian policy parametrization | Continuous control with limited data |
| Gumbel MuZero | 2022, ICLR | Danihelka et al. | Gumbel top k planning rule | Low simulation budgets and large action spaces |
DeepMind has positioned MuZero as a stepping stone toward general purpose planning agents that can act in the real world, not just in games. Several published applications illustrate this direction.
In February 2022, DeepMind announced that a MuZero based agent had been deployed inside Google's video compression pipeline. The agent, sometimes referred to as MuZero Rate Controller, replaces the rate control component of the open source VP9 codec. Rate control decides how many bits to allocate to each frame, balancing visual quality against file size. The decision is naturally formulated as a sequential planning problem with combinatorial action spaces, which makes it a strong fit for MuZero. DeepMind reported an average bitrate reduction of approximately four percent across a large set of YouTube videos at matched quality. The system was then rolled out across a portion of YouTube's traffic, making MuZero one of the first DeepMind reinforcement learning systems to ship to large scale production.[4]
AlphaTensor, published by Fawzi and colleagues in Nature in October 2022, applies the MuZero algorithm to the problem of discovering matrix multiplication algorithms. The task is cast as a single player game called TensorGame, in which the agent decomposes a target tensor that represents a matrix product into a sum of rank one outer products. Each move adds one outer product, and the goal is to reach an exact decomposition using as few moves as possible. The action space is enormous, exceeding ten to the twelfth possible moves in many configurations. AlphaTensor adapts MuZero with sample based planning and synthetic data augmentation to handle this scale. It rediscovered Strassen's classical algorithm and improved on the best known algorithm for multiplying four by four matrices in modular arithmetic for the first time in fifty years, along with thousands of new algorithms across other matrix sizes and fields.[2] The result is widely seen as a landmark in the use of reinforcement learning for algorithm discovery, in the same lineage as scientific applications such as AlphaFold.
AlphaDev, published by Mankowitz and colleagues in Nature in June 2023, again applies the MuZero algorithm, this time to discover faster sorting routines at the assembly instruction level. The task is framed as a single player game in which the agent constructs a program one instruction at a time, with rewards that combine correctness and runtime. AlphaDev discovered new sorting algorithms for fixed input sizes that outperform the long standing implementations in the LLVM standard C++ library by up to 70 percent for very small lists, with smaller but still meaningful improvements at larger sizes. The discovered routines were merged into the LLVM libc++ standard sort, where they are used trillions of times per day across software that depends on the C++ standard library.[3] As with AlphaTensor, AlphaDev demonstrates that the MuZero recipe of learning a model and planning with MCTS in an abstract latent space generalizes well beyond traditional games.
Researchers have explored MuZero variants for chip placement, scheduling, traffic light control, and energy management. The combination of large combinatorial action spaces, sequential structure, and noisy or expensive environment access makes such tasks a natural fit for the planning style that MuZero embodies. The general design pattern is to define a goal and a reward signal, then let MuZero discover both an internal model of the system and a policy that exploits it.
MuZero is widely regarded as one of the most important results in reinforcement learning of the late 2010s and early 2020s for several reasons. It eliminated a long standing constraint on the AlphaGo and AlphaZero approach by removing the dependency on a hand coded simulator. It demonstrated empirically that a value equivalent learned model can support planning at the same level as a perfect simulator, validating a long line of theoretical work on model based RL. It unified board games and pixel based video games inside a single algorithmic framework. And it provided a practical recipe for using planning as a policy improvement operator at scale, which has been picked up by subsequent work on algorithm discovery, scientific discovery, and real world systems optimization.
The algorithm has also influenced thinking about the relationship between learning and search in artificial intelligence. AlphaGo Zero and AlphaZero made the case that search dramatically amplifies the strength of a learned policy. MuZero strengthened that case by showing that even the world model used during search can itself be learned, leaving very little of the AlphaGo recipe that requires human design. The boundary between learning and planning, on this view, becomes a continuum rather than a binary distinction.
At the same time, MuZero has limitations that have motivated much of the follow up work cited above. The original algorithm assumes discrete actions and deterministic dynamics, which Sampled MuZero and Stochastic MuZero address. It is sample hungry on Atari, which EfficientZero and MuZero Unplugged address. The published version is computationally expensive, requiring substantial accelerator resources for training, and DeepMind released only pseudocode rather than a reference implementation, which has slowed reproduction in the open research community. Open source efforts such as the muzero general project by Werner Duvaud and the EfficientZero codebase by Ye and colleagues have partially filled this gap.