AlphaZero is a computer program developed by DeepMind that learned to play chess, shogi (Japanese chess), and Go at a superhuman level, starting from zero human knowledge. Rather than relying on handcrafted evaluation functions or databases of expert games, AlphaZero taught itself each game entirely through self-play reinforcement learning, using only the rules of the game as input. The system achieved world-champion-level performance in all three games within hours of training, defeating the strongest existing programs: Stockfish in chess, Elmo in shogi, and AlphaGo Zero in Go.
AlphaZero was first described in a preprint paper released on December 5, 2017, and the full peer-reviewed version was published in the journal Science on December 7, 2018, under the title "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." The paper was authored by David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis.
Before AlphaZero, the strongest game-playing programs relied on fundamentally different approaches depending on the game. Chess engines like Stockfish used alpha-beta search with handcrafted evaluation functions tuned by human experts over decades. These evaluation functions assigned numerical weights to features such as material balance, king safety, pawn structure, and piece mobility. In shogi, programs similarly relied on expert-designed heuristics. In Go, the situation was more complex because the branching factor (approximately 250 legal moves per position, compared to about 35 in chess) made traditional search methods impractical, which led DeepMind to develop AlphaGo.
AlphaGo, which famously defeated world champion Lee Sedol in March 2016, used a combination of supervised learning from human expert games and reinforcement learning through self-play. While this was a landmark achievement, the reliance on human game data meant the system was partly constrained by the patterns and strategies humans had discovered. AlphaGo Zero, published in October 2017 in Nature, removed this dependency entirely for Go, learning solely through self-play and surpassing all previous versions of AlphaGo within 40 days of training.
AlphaZero took this idea one step further. The core question behind AlphaZero was whether a single, general-purpose algorithm could master multiple different games without any game-specific modifications or human knowledge beyond the basic rules. The answer turned out to be yes.
The progression from AlphaGo to AlphaZero represents a steady movement toward generality and away from reliance on human expertise.
The original AlphaGo came in several versions. AlphaGo Fan defeated European Go champion Fan Hui 5-0 in October 2015, becoming the first program to beat a professional Go player on a full 19x19 board without handicap. AlphaGo Lee defeated 9-dan professional Lee Sedol 4-1 in March 2016 in a match watched by over 200 million people worldwide. AlphaGo Master went 60-0 against top professionals in online games from December 2016 to January 2017, and later defeated world number one Ke Jie 3-0 at the Future of Go Summit in May 2017.
All versions of AlphaGo were trained initially on a dataset of human expert games (about 160,000 games from online Go servers) using supervised learning. This human data was used to train a policy network that predicted expert moves. The system then improved beyond human level through reinforcement learning via self-play.
AlphaGo Zero eliminated the supervised learning phase entirely. It started from completely random play and used only self-play reinforcement learning. Other differences from the original AlphaGo included combining the policy and value functions into a single neural network (rather than separate networks), using a simpler board representation (raw board positions instead of hand-engineered features), and replacing the rollout-based evaluation with a value network. AlphaGo Zero used a ResNet architecture with either 20 or 40 residual blocks.
After just three days of training, AlphaGo Zero defeated the version of AlphaGo that beat Lee Sedol by 100 games to 0. After 40 days of training, it surpassed all previous versions including AlphaGo Master. However, AlphaGo Zero was designed exclusively for Go and exploited Go-specific properties, such as the rotational and reflectional symmetry of the board, to augment its training data eightfold.
AlphaZero generalized the approach of AlphaGo Zero to work across multiple games. To achieve this generality, several Go-specific optimizations were removed. AlphaZero did not use data augmentation based on board symmetries (since chess and shogi boards are not rotationally symmetric). It used the same hyperparameters for all three games, with only minor variations in the neural network architecture to accommodate the different board sizes and move spaces. It also replaced the hard outcome of win/loss with a more nuanced evaluation, handling draws in chess and shogi (which do not occur in Go under standard rules).
AlphaZero combines a deep neural network with Monte Carlo tree search (MCTS). The neural network evaluates board positions and suggests promising moves, while MCTS uses these evaluations to search ahead and select the best action.
The neural network takes a board position as input and produces two outputs: a policy (a probability distribution over legal moves indicating which moves are most promising) and a value (a scalar estimate between -1 and +1 predicting the expected game outcome from the current position).
The architecture is based on a deep residual network (ResNet). The network body consists of one convolutional layer followed by 19 residual blocks. Each residual block contains two convolutional layers with batch normalization and rectified linear unit (ReLU) activations, connected by a skip connection. Each convolution applies 256 filters of kernel size 3x3 with stride 1.
The network splits into two heads after the shared residual tower:
The input representation encodes the board state from the perspective of the current player. For chess, the input consists of 119 planes of 8x8, encoding the positions of all pieces for the last eight board states (to capture history and repetition), plus additional planes for castling rights, move counters, and the color of the current player.
AlphaZero uses a variant of MCTS guided by the neural network. During each search, the algorithm builds a tree of possible future positions by repeatedly performing four steps:
During training games, AlphaZero performed 800 MCTS simulations per move. Despite searching far fewer positions than traditional engines, the neural network guidance allowed AlphaZero to focus its search on the most relevant lines of play.
| Metric | AlphaZero (Chess) | Stockfish | AlphaZero (Shogi) | Elmo | AlphaZero (Go) |
|---|---|---|---|---|---|
| Positions evaluated per second | 80,000 | 70,000,000 | 40,000 | 35,000,000 | 16,000 |
| MCTS simulations per move | 800 | N/A | 800 | N/A | 800 |
| Search type | MCTS + neural network | Alpha-beta + handcrafted eval | MCTS + neural network | Alpha-beta + handcrafted eval | MCTS + neural network |
The difference in raw search speed is striking. Stockfish evaluated roughly 875 times more positions per second than AlphaZero in chess, and Elmo evaluated roughly 875 times more positions per second than AlphaZero in shogi. Yet AlphaZero's neural network allowed it to evaluate positions far more accurately, making each evaluation count for much more than the shallow evaluations performed by traditional engines.
AlphaZero was trained entirely through self-play reinforcement learning. The process began with a neural network initialized with random weights, meaning the initial policy was effectively random play. The system then improved iteratively through the following cycle:
Training proceeded for 700,000 steps (mini-batches of size 4,096 each). The learning rate started at 0.2 and was reduced to 0.02, then 0.002, and finally 0.0002 at predetermined steps during training.
AlphaZero was trained using Google's Tensor Processing Units (TPUs). Self-play games were generated on 5,000 first-generation TPUs, while the neural network was trained on 64 second-generation TPUs. These processes ran in parallel: the self-play actors continuously generated games using the latest network checkpoint, while the training process continuously updated the network using the most recent game data.
The amount of training time varied by game complexity:
| Game | Total training time | Games generated during training | Time to surpass existing champion |
|---|---|---|---|
| Chess | ~9 hours | 44 million | ~4 hours (surpassed Stockfish at ~300,000 steps) |
| Shogi | ~12 hours | 24 million | ~2 hours (surpassed Elmo at ~110,000 steps) |
| Go | ~13 days | 21 million | ~8 hours (surpassed AlphaGo Lee at ~165,000 steps) |
The difference in training time reflects the different complexities of each game. Go, with its much larger board (19x19 versus 8x8 or 9x9) and higher branching factor, required significantly more training. Notably, AlphaZero surpassed the strongest existing program in chess and shogi within just a few hours, despite starting from zero knowledge.
For the actual matches against opponents, AlphaZero ran on a single machine with 4 first-generation TPUs and 44 CPU cores. This is the same hardware configuration used by AlphaGo Zero. A first-generation TPU is roughly comparable in inference speed to a commodity GPU such as an NVIDIA Titan V, though the architectures are not directly comparable.
AlphaZero was evaluated against the strongest available program in each game. The 2018 Science paper reported results from 1,000-game matches played under tournament-like time controls (3 hours per side plus a 15-second increment per move).
AlphaZero played 1,000 games against Stockfish, which was the strongest traditional chess engine at the time and the 2016 TCEC (Top Chess Engine Championship) Season 9 superfinal winner. In the updated evaluation published in the 2018 Science paper, Stockfish ran with 44 CPU cores, a 32 GB hash table, and access to Syzygy 6-piece endgame tablebases. These conditions were significantly improved over the initial 2017 preprint, which had been criticized for giving Stockfish only 64 threads and a 1 GB hash table.
The results were decisive:
| Match | Games | AlphaZero wins | Draws | AlphaZero losses | AlphaZero score |
|---|---|---|---|---|---|
| AlphaZero vs. Stockfish (1,000 games, 3h+15s) | 1,000 | 155 | 839 | 6 | 574.5/1,000 |
AlphaZero won 155 games, lost only 6, and drew 839. The overwhelming majority of games were draws (83.9%), which is typical at the highest levels of chess. But AlphaZero's win-to-loss ratio of roughly 26:1 left no doubt about which program was stronger.
The 2018 paper also tested time-odds matches, in which AlphaZero was given progressively less thinking time than Stockfish. AlphaZero continued to outscore Stockfish even when given only one-tenth the thinking time. Stockfish only began to gain an edge when the time odds reached approximately 30:1.
AlphaZero searched roughly 80,000 positions per second in chess, compared to Stockfish's 70 million. Despite evaluating nearly a thousand times fewer positions, AlphaZero's evaluations were far more informed, allowing it to focus its search on the most relevant continuations.
In shogi, AlphaZero faced Elmo, the 2017 World Computer Shogi Championship (WCSC27) winner. Elmo ran under conditions matching those used at the WCSC27 championship, combined with the YaneuraOu search engine. The match used the same time controls as the chess match (3 hours per side plus a 15-second increment).
AlphaZero won 91.2% of games against Elmo. It was particularly dominant when playing sente (first move), achieving a 98.2% win rate. AlphaZero searched approximately 40,000 positions per second in shogi, compared to Elmo's 35 million.
Shogi is in some ways more complex than Western chess because captured pieces can be returned to the board (a rule known as "drops"), which increases the branching factor and makes the game harder for traditional search-based programs. AlphaZero's neural network approach handled this additional complexity without any game-specific modifications.
In Go, AlphaZero played against a 3-day-trained version of AlphaGo Zero. This was a strong opponent, as even AlphaGo Zero trained for 3 days had already surpassed all previous versions of AlphaGo. AlphaZero won 60 games and lost 40 out of 100 total games (61% win rate).
This result is notable because AlphaGo Zero exploited Go-specific symmetries to augment its training data eightfold (through rotations and reflections of the board), while AlphaZero did not use any such augmentation. Despite this disadvantage, AlphaZero was able to recover and even exceed the performance of the Go-specific system using a fully general approach.
AlphaZero's playing style attracted intense interest from the chess community because it was so different from conventional computer chess. Traditional engines like Stockfish play in a way that is often described as materialistic: they prioritize maintaining a material advantage and calculate deeply to verify tactical sequences. AlphaZero's approach was strikingly different.
AlphaZero frequently sacrificed material (pawns, pieces, or even a full exchange) in return for long-term positional compensation such as improved piece activity, control of key squares, or a sustained initiative against the opponent's king. This style of play, sometimes called speculative or intuitive, is more commonly associated with attacking human grandmasters than with computer programs.
Chess Grandmaster Matthew Sadler, who analyzed over 2,000 of AlphaZero's games for the book Game Changer (co-authored with Natasha Regan, published January 2019), described AlphaZero's play as remarkable for "the way its pieces swarm around the opponent's king with purpose and power." Sadler compared the experience to "discovering the secret notebooks of some great player from the past."
Former World Chess Champion Garry Kasparov wrote a foreword for Game Changer and commented: "It plays with a very dynamic style, much like my own!" Kasparov, known for his aggressive and dynamic approach during his playing career, expressed enthusiasm about AlphaZero's willingness to sacrifice material for the initiative.
Danish Grandmaster Peter Heine Nielsen, who serves as a second for World Champion Magnus Carlsen, compared AlphaZero's play to that of "a superior alien species." Norwegian Grandmaster Jon Ludvig Hammer described it as "insane attacking chess" combined with deep positional understanding. Yoshiharu Habu, a 9-dan professional shogi player and one of the greatest shogi players in history, said that AlphaZero showed "new possibilities for the game."
Not everyone was equally impressed. Grandmaster Hikaru Nakamura pointed out that AlphaZero ran on Google TPU hardware while Stockfish ran on conventional CPUs, and questioned whether the comparison was fair. Tord Romstad, one of Stockfish's developers, also noted that the conditions in the original 2017 preprint were suboptimal for Stockfish. These concerns were partially addressed in the 2018 Science paper, which gave Stockfish improved hardware settings and access to endgame tablebases.
AlphaZero showed a preference for certain openings that had fallen out of favor in top-level human play. It frequently employed the English Opening and various flank openings as White, and showed that certain positions previously considered equal or slightly better for one side actually contained hidden resources. Its willingness to accept isolated, doubled, or backward pawns in return for piece activity ran counter to decades of computer chess orthodoxy, where engines strongly penalized such structural weaknesses.
AlphaZero's games and approach have had a measurable effect on how humans think about and play chess.
Magnus Carlsen, the World Chess Champion at the time of AlphaZero's publication, cited AlphaZero as a source of inspiration for his play in 2019 and beyond. The willingness to sacrifice material for dynamic compensation, a hallmark of AlphaZero's style, became more common in top-level human games after 2018. Players became more open to positions where traditional engines gave small material disadvantages but where the compensation in activity and initiative was real.
AlphaZero's success inspired the development of open-source neural network chess engines, most notably Leela Chess Zero (Lc0). The Leela Chess Zero project, announced on January 9, 2018 (just weeks after AlphaZero's preprint), attempted to reproduce AlphaZero's approach using distributed computing. Volunteers contributed computing power to generate self-play training games, and over time Lc0 became one of the strongest chess engines in the world, competing directly with Stockfish in major computer chess championships.
Stockfish itself eventually adopted neural network evaluation with the introduction of NNUE (Efficiently Updatable Neural Network) in 2020, moving away from its traditional handcrafted evaluation function. Modern versions of Stockfish combine NNUE evaluation with alpha-beta search, representing a hybrid approach influenced in part by the success of neural network methods demonstrated by AlphaZero.
In a 2020 study published in collaboration with former World Chess Champion Vladimir Kramnik, DeepMind researchers used AlphaZero to evaluate alternative chess rule sets. The paper, "Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess," examined nine chess variants including No-Castling chess, Torpedo chess (where pawns can advance two squares from any rank), Self-Capture chess, and Stalemate-equals-win. By training separate AlphaZero instances on each variant, the researchers could simulate the equivalent of decades of human play within a day and assess which rule changes produced more dynamic, decisive games.
The initial preprint released in December 2017 attracted both excitement and criticism. Several issues were raised by the computer chess community, and DeepMind addressed many of these in the final 2018 Science paper.
| Aspect | 2017 Preprint | 2018 Science Paper |
|---|---|---|
| Time control | 1 minute per move | 3 hours + 15 seconds/move |
| Stockfish version | Stockfish 8 | Stockfish 8 and development Stockfish (Jan 2018) |
| Stockfish hash table | 1 GB | 32 GB |
| Stockfish threads | 64 | 44 (matching TCEC conditions) |
| Endgame tablebases | Not used | 6-piece Syzygy tablebases |
| Number of chess games | 100 | 1,000 |
| Openings | Fixed starting position | Fixed starting position |
| PUCT variant | Same as AlphaGo Zero | Updated dynamic variant |
The 2018 paper provided Stockfish with substantially better conditions, including access to endgame tablebases (which are critical for precise endgame play) and a much larger hash table. Despite these improvements, AlphaZero's dominance remained clear.
The following table compares the four major iterations of DeepMind's game-playing AI systems.
| Feature | AlphaGo | AlphaGo Zero | AlphaZero | MuZero |
|---|---|---|---|---|
| Year | 2015-2017 | 2017 | 2017-2018 | 2019-2020 |
| Publication venue | Nature (2016) | Nature (2017) | Science (2018) | Nature (2020) |
| Games played | Go only | Go only | Chess, shogi, Go | Chess, shogi, Go, Atari (57 games) |
| Human data required | Yes (160,000 expert games) | No | No | No |
| Game rules required | Yes | Yes | Yes | No (learns a model of the environment) |
| Network architecture | Separate policy and value networks | Single dual-headed ResNet (20 or 40 blocks) | Single dual-headed ResNet (19 blocks, 256 filters) | Representation, dynamics, and prediction networks |
| Training method | Supervised learning + RL self-play | RL self-play only | RL self-play only | RL self-play only |
| Board symmetry exploitation | Yes | Yes (8x augmentation) | No | No |
| Search algorithm | MCTS with rollouts | MCTS with neural network evaluation | MCTS with neural network evaluation | MCTS with learned model |
| Key advancement | First to beat professional Go player | Removed need for human data in Go | Generalized to multiple games | Removed need for known game rules |
MuZero, published by DeepMind as a preprint in November 2019 and in Nature in December 2020, extended AlphaZero's approach by removing the requirement for a known game model. While AlphaZero needed the exact rules of each game to simulate future positions during MCTS, MuZero learned its own internal model of the environment. This model did not attempt to reconstruct the full game state; instead, it learned to predict only the quantities relevant to planning: the reward, the policy, and the value.
MuZero achieved this through three neural networks working together:
MuZero matched AlphaZero's performance in chess and shogi, surpassed it in Go, and also achieved state-of-the-art results on a suite of 57 Atari games, surpassing the previous best method (R2D2, Recurrent Replay Distributed DQN) in both mean and median performance across the game suite. The ability to operate without known game rules opened the door to applying the same planning-based approach to environments where the dynamics are not known in advance, which is the case for most real-world problems.
The input to AlphaZero's neural network varies by game:
Moves are encoded as output planes of the neural network's policy head:
The neural network is trained to minimize a combined loss:
l = (z - v)^2 - pi^T * log(p) + c * ||theta||^2
where z is the actual game outcome (+1 for win, -1 for loss, 0 for draw), v is the predicted value, pi is the MCTS search probability vector, p is the predicted policy, and c * ||theta||^2 is an L2 regularization term that prevents overfitting. The first term trains the value head, the second term trains the policy head, and the regularization term keeps the weights small.
AlphaZero, despite its achievements, has several notable limitations.
Training AlphaZero required 5,000 first-generation TPUs for self-play generation and 64 second-generation TPUs for neural network training. This level of hardware is not available to most researchers or organizations. The Leela Chess Zero project demonstrated that similar results could eventually be achieved with distributed consumer hardware, but the training process took months rather than hours.
AlphaZero was never released as open-source software, and the trained models were not made publicly available. The exact training data and hyperparameters, while described in the paper, could not be independently verified. This led to some skepticism in the computer chess community and motivated the creation of open-source alternatives like Leela Chess Zero.
AlphaZero (and AlphaGo Zero before it) requires a perfect simulator of the game environment to perform MCTS. This limits its direct applicability to perfect information, deterministic games where the complete state is known to both players. Games with hidden information (such as poker), stochastic elements (such as backgammon), or continuous action spaces cannot be directly addressed by AlphaZero's algorithm. MuZero partially addressed this limitation by learning the environment model rather than requiring it.
The fairness of the Stockfish comparison was debated. Even in the improved 2018 evaluation, AlphaZero ran on TPU hardware specifically designed for neural network computation, while Stockfish ran on general-purpose CPUs. Some argued that a fair comparison would require both programs to run on equivalent hardware budgets. Additionally, Stockfish has continued to improve significantly since 2018; modern versions of Stockfish (with NNUE evaluation) are estimated to be hundreds of Elo points stronger than the version tested against AlphaZero.
AlphaZero demonstrated that a single, relatively simple algorithm (self-play reinforcement learning with MCTS and a deep neural network) could achieve superhuman performance across multiple board games without any human knowledge. This was a significant result for the field of artificial intelligence because it showed that domain-specific expertise and hand-engineered features, which had been the foundation of game-playing AI for decades, could be entirely replaced by learned representations.
The approach also revealed something about the nature of these games themselves. The fact that AlphaZero could discover, in a matter of hours, strategies that humans had spent centuries developing (and in some cases, strategies that humans had never discovered) raised interesting questions about how much of existing game theory was optimal and how much was simply the result of historical accident and convention.
Chess Grandmaster Matthew Sadler and Women's International Master Natasha Regan documented AlphaZero's chess strategies in their book Game Changer: AlphaZero's Groundbreaking Chess Strategies and the Promise of AI, published in January 2019 by New in Chess. The book won the English Chess Foundation 2019 Book of the Year award and the FIDE Averbakh-Boleslavsky Award for 2019. It included a foreword by Garry Kasparov and an introduction by Demis Hassabis.
The broader impact of AlphaZero extends beyond games. The general principle of combining learned evaluation functions with tree search has been applied to problems in protein structure prediction, mathematics, and code generation. The demonstration that tabula rasa learning (starting from scratch) could match or exceed decades of accumulated human knowledge inspired new research directions across machine learning and AI.