AlphaZero

AlphaZero is a computer program developed by DeepMind that learned to play chess, shogi (Japanese chess), and Go at a superhuman level, starting from zero human knowledge. Rather than relying on handcrafted evaluation functions or databases of expert games, AlphaZero taught itself each game entirely through self-play reinforcement learning, using only the rules of the game as input. The system achieved world-champion-level performance in all three games within hours of training, defeating the strongest existing programs: Stockfish in chess, Elmo in shogi, and AlphaGo Zero in Go.

AlphaZero was first described in a preprint paper released on December 5, 2017, and the full peer-reviewed version was published in the journal Science on December 7, 2018, under the title "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." The paper was authored by David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis.

Background and motivation

Before AlphaZero, the strongest game-playing programs relied on fundamentally different approaches depending on the game. Chess engines like Stockfish used alpha-beta search with handcrafted evaluation functions tuned by human experts over decades. These evaluation functions assigned numerical weights to features such as material balance, king safety, pawn structure, and piece mobility. In shogi, programs similarly relied on expert-designed heuristics. In Go, the situation was more complex because the branching factor (approximately 250 legal moves per position, compared to about 35 in chess) made traditional search methods impractical, which led DeepMind to develop AlphaGo.

AlphaGo, which famously defeated world champion Lee Sedol in March 2016, used a combination of supervised learning from human expert games and reinforcement learning through self-play. While this was a landmark achievement, the reliance on human game data meant the system was partly constrained by the patterns and strategies humans had discovered. AlphaGo Zero, published in October 2017 in Nature, removed this dependency entirely for Go, learning solely through self-play and surpassing all previous versions of AlphaGo within 40 days of training.

AlphaZero took this idea one step further. The core question behind AlphaZero was whether a single, general-purpose algorithm could master multiple different games without any game-specific modifications or human knowledge beyond the basic rules. The answer turned out to be yes.

Evolution from AlphaGo to AlphaZero

The progression from AlphaGo to AlphaZero represents a steady movement toward generality and away from reliance on human expertise.

AlphaGo (2015-2017)

The original AlphaGo came in several versions. AlphaGo Fan defeated European Go champion Fan Hui 5-0 in October 2015, becoming the first program to beat a professional Go player on a full 19x19 board without handicap. AlphaGo Lee defeated 9-dan professional Lee Sedol 4-1 in March 2016 in a match watched by over 200 million people worldwide. AlphaGo Master went 60-0 against top professionals in online games from December 2016 to January 2017, and later defeated world number one Ke Jie 3-0 at the Future of Go Summit in May 2017.

All versions of AlphaGo were trained initially on a dataset of human expert games (about 160,000 games from online Go servers) using supervised learning. This human data was used to train a policy network that predicted expert moves. The system then improved beyond human level through reinforcement learning via self-play.

AlphaGo Zero (October 2017)

AlphaGo Zero eliminated the supervised learning phase entirely. It started from completely random play and used only self-play reinforcement learning. Other differences from the original AlphaGo included combining the policy and value functions into a single neural network (rather than separate networks), using a simpler board representation (raw board positions instead of hand-engineered features), and replacing the rollout-based evaluation with a value network. AlphaGo Zero used a ResNet architecture with either 20 or 40 residual blocks.

After just three days of training, AlphaGo Zero defeated the version of AlphaGo that beat Lee Sedol by 100 games to 0. After 40 days of training, it surpassed all previous versions including AlphaGo Master. However, AlphaGo Zero was designed exclusively for Go and exploited Go-specific properties, such as the rotational and reflectional symmetry of the board, to augment its training data eightfold.

AlphaZero (December 2017)

AlphaZero generalized the approach of AlphaGo Zero to work across multiple games. To achieve this generality, several Go-specific optimizations were removed. AlphaZero did not use data augmentation based on board symmetries (since chess and shogi boards are not rotationally symmetric). It used the same hyperparameters for all three games, with only minor variations in the neural network architecture to accommodate the different board sizes and move spaces. It also replaced the hard outcome of win/loss with a more nuanced evaluation, handling draws in chess and shogi (which do not occur in Go under standard rules).

Architecture

AlphaZero combines a deep neural network with Monte Carlo tree search (MCTS). The neural network evaluates board positions and suggests promising moves, while MCTS uses these evaluations to search ahead and select the best action.

Neural network

The neural network takes a board position as input and produces two outputs: a policy (a probability distribution over legal moves indicating which moves are most promising) and a value (a scalar estimate between -1 and +1 predicting the expected game outcome from the current position).

The architecture is based on a deep residual network (ResNet). The network body consists of one convolutional layer followed by 19 residual blocks. Each residual block contains two convolutional layers with batch normalization and rectified linear unit (ReLU) activations, connected by a skip connection. Each convolution applies 256 filters of kernel size 3x3 with stride 1.

The network splits into two heads after the shared residual tower:

Policy head: An additional convolutional layer followed by a final convolution with 73 filters (for chess), producing a probability distribution over all possible moves. The number of output planes varies by game to reflect the different move representations.
Value head: A convolution of 1 filter with kernel size 1x1, followed by a fully connected layer of 256 units with ReLU activation, and a final output unit with tanh activation that produces a value in the range [-1, +1].

The input representation encodes the board state from the perspective of the current player. For chess, the input consists of 119 planes of 8x8, encoding the positions of all pieces for the last eight board states (to capture history and repetition), plus additional planes for castling rights, move counters, and the color of the current player.

Monte Carlo tree search

AlphaZero uses a variant of MCTS guided by the neural network. During each search, the algorithm builds a tree of possible future positions by repeatedly performing four steps:

Selection: Starting from the root (current position), the algorithm traverses the tree by selecting the child node that maximizes an upper confidence bound. This bound balances exploitation (choosing moves with high estimated value) and exploration (choosing moves that have been less visited). AlphaZero uses a variant of the PUCT (Predictor + Upper Confidence bound applied to Trees) algorithm for this selection.
Expansion: When a leaf node is reached, the neural network evaluates the position, returning both a policy vector and a value estimate.
Backup: The value estimate is propagated back up the tree, updating the average value at each node along the path.
Play selection: After the search is complete, a move is selected based on the visit counts at the root node. Moves that were visited more often during search are considered stronger.

During training games, AlphaZero performed 800 MCTS simulations per move. Despite searching far fewer positions than traditional engines, the neural network guidance allowed AlphaZero to focus its search on the most relevant lines of play.

Metric	AlphaZero (Chess)	Stockfish	AlphaZero (Shogi)	Elmo	AlphaZero (Go)
Positions evaluated per second	80,000	70,000,000	40,000	35,000,000	16,000
MCTS simulations per move	800	N/A	800	N/A	800
Search type	MCTS + neural network	Alpha-beta + handcrafted eval	MCTS + neural network	Alpha-beta + handcrafted eval	MCTS + neural network

The difference in raw search speed is striking. Stockfish evaluated roughly 875 times more positions per second than AlphaZero in chess, and Elmo evaluated roughly 875 times more positions per second than AlphaZero in shogi. Yet AlphaZero's neural network allowed it to evaluate positions far more accurately, making each evaluation count for much more than the shallow evaluations performed by traditional engines.

Training

AlphaZero was trained entirely through self-play reinforcement learning. The process began with a neural network initialized with random weights, meaning the initial policy was effectively random play. The system then improved iteratively through the following cycle:

Self-play game generation: The current neural network, combined with MCTS, played games against itself. Each move in these games was selected using 800 MCTS simulations. The outcomes of these games (win, loss, or draw) and the search statistics from each position were stored.
Neural network training: The neural network was updated using the self-play data. The policy head was trained to match the MCTS search probabilities (not the raw neural network output, but the improved move distribution produced by the search). The value head was trained to predict the game outcome. The loss function combined these two objectives with an L2 regularization term.
Iteration: The updated network was then used to generate new self-play games, and the process repeated.

Training proceeded for 700,000 steps (mini-batches of size 4,096 each). The learning rate started at 0.2 and was reduced to 0.02, then 0.002, and finally 0.0002 at predetermined steps during training.

Training infrastructure

AlphaZero was trained using Google's Tensor Processing Units (TPUs). Self-play games were generated on 5,000 first-generation TPUs, while the neural network was trained on 64 second-generation TPUs. These processes ran in parallel: the self-play actors continuously generated games using the latest network checkpoint, while the training process continuously updated the network using the most recent game data.

Training duration and games generated

The amount of training time varied by game complexity:

Game	Total training time	Games generated during training	Time to surpass existing champion
Chess	~9 hours	44 million	~4 hours (surpassed Stockfish at ~300,000 steps)
Shogi	~12 hours	24 million	~2 hours (surpassed Elmo at ~110,000 steps)
Go	~13 days	21 million	~8 hours (surpassed AlphaGo Lee at ~165,000 steps)

The difference in training time reflects the different complexities of each game. Go, with its much larger board (19x19 versus 8x8 or 9x9) and higher branching factor, required significantly more training. Notably, AlphaZero surpassed the strongest existing program in chess and shogi within just a few hours, despite starting from zero knowledge.

Evaluation hardware

For the actual matches against opponents, AlphaZero ran on a single machine with 4 first-generation TPUs and 44 CPU cores. This is the same hardware configuration used by AlphaGo Zero. A first-generation TPU is roughly comparable in inference speed to a commodity GPU such as an NVIDIA Titan V, though the architectures are not directly comparable.

Match results

AlphaZero was evaluated against the strongest available program in each game. The 2018 Science paper reported results from 1,000-game matches played under tournament-like time controls (3 hours per side plus a 15-second increment per move).

Chess: AlphaZero vs. Stockfish

AlphaZero played 1,000 games against Stockfish, which was the strongest traditional chess engine at the time and the 2016 TCEC (Top Chess Engine Championship) Season 9 superfinal winner. In the updated evaluation published in the 2018 Science paper, Stockfish ran with 44 CPU cores, a 32 GB hash table, and access to Syzygy 6-piece endgame tablebases. These conditions were significantly improved over the initial 2017 preprint, which had been criticized for giving Stockfish only 64 threads and a 1 GB hash table.

The results were decisive:

Match	Games	AlphaZero wins	Draws	AlphaZero losses	AlphaZero score
AlphaZero vs. Stockfish (1,000 games, 3h+15s)	1,000	155	839	6	574.5/1,000

AlphaZero won 155 games, lost only 6, and drew 839. The overwhelming majority of games were draws (83.9%), which is typical at the highest levels of chess. But AlphaZero's win-to-loss ratio of roughly 26:1 left no doubt about which program was stronger.

The 2018 paper also tested time-odds matches, in which AlphaZero was given progressively less thinking time than Stockfish. AlphaZero continued to outscore Stockfish even when given only one-tenth the thinking time. Stockfish only began to gain an edge when the time odds reached approximately 30:1.

AlphaZero searched roughly 80,000 positions per second in chess, compared to Stockfish's 70 million. Despite evaluating nearly a thousand times fewer positions, AlphaZero's evaluations were far more informed, allowing it to focus its search on the most relevant continuations.

Shogi: AlphaZero vs. Elmo

In shogi, AlphaZero faced Elmo, the 2017 World Computer Shogi Championship (WCSC27) winner. Elmo ran under conditions matching those used at the WCSC27 championship, combined with the YaneuraOu search engine. The match used the same time controls as the chess match (3 hours per side plus a 15-second increment).

AlphaZero won 91.2% of games against Elmo. It was particularly dominant when playing sente (first move), achieving a 98.2% win rate. AlphaZero searched approximately 40,000 positions per second in shogi, compared to Elmo's 35 million.

Shogi is in some ways more complex than Western chess because captured pieces can be returned to the board (a rule known as "drops"), which increases the branching factor and makes the game harder for traditional search-based programs. AlphaZero's neural network approach handled this additional complexity without any game-specific modifications.

Go: AlphaZero vs. AlphaGo Zero

In Go, AlphaZero played against a 3-day-trained version of AlphaGo Zero. This was a strong opponent, as even AlphaGo Zero trained for 3 days had already surpassed all previous versions of AlphaGo. AlphaZero won 60 games and lost 40 out of 100 total games (61% win rate).

This result is notable because AlphaGo Zero exploited Go-specific symmetries to augment its training data eightfold (through rotations and reflections of the board), while AlphaZero did not use any such augmentation. Despite this disadvantage, AlphaZero was able to recover and even exceed the performance of the Go-specific system using a fully general approach.

Playing style

AlphaZero's playing style attracted intense interest from the chess community because it was so different from conventional computer chess. Traditional engines like Stockfish play in a way that is often described as materialistic: they prioritize maintaining a material advantage and calculate deeply to verify tactical sequences. AlphaZero's approach was strikingly different.

Material sacrifice and long-term planning

AlphaZero frequently sacrificed material (pawns, pieces, or even a full exchange) in return for long-term positional compensation such as improved piece activity, control of key squares, or a sustained initiative against the opponent's king. This style of play, sometimes called speculative or intuitive, is more commonly associated with attacking human grandmasters than with computer programs.

Chess Grandmaster Matthew Sadler, who analyzed over 2,000 of AlphaZero's games for the book Game Changer (co-authored with Natasha Regan, published January 2019), described AlphaZero's play as remarkable for "the way its pieces swarm around the opponent's king with purpose and power." Sadler compared the experience to "discovering the secret notebooks of some great player from the past."

Grandmaster and expert reactions

Former World Chess Champion Garry Kasparov wrote a foreword for Game Changer and commented: "It plays with a very dynamic style, much like my own!" Kasparov, known for his aggressive and dynamic approach during his playing career, expressed enthusiasm about AlphaZero's willingness to sacrifice material for the initiative.

Danish Grandmaster Peter Heine Nielsen, who serves as a second for World Champion Magnus Carlsen, compared AlphaZero's play to that of "a superior alien species." Norwegian Grandmaster Jon Ludvig Hammer described it as "insane attacking chess" combined with deep positional understanding. Yoshiharu Habu, a 9-dan professional shogi player and one of the greatest shogi players in history, said that AlphaZero showed "new possibilities for the game."

Not everyone was equally impressed. Grandmaster Hikaru Nakamura pointed out that AlphaZero ran on Google TPU hardware while Stockfish ran on conventional CPUs, and questioned whether the comparison was fair. Tord Romstad, one of Stockfish's developers, also noted that the conditions in the original 2017 preprint were suboptimal for Stockfish. These concerns were partially addressed in the 2018 Science paper, which gave Stockfish improved hardware settings and access to endgame tablebases.

Specific chess discoveries

AlphaZero showed a preference for certain openings that had fallen out of favor in top-level human play. It frequently employed the English Opening and various flank openings as White, and showed that certain positions previously considered equal or slightly better for one side actually contained hidden resources. Its willingness to accept isolated, doubled, or backward pawns in return for piece activity ran counter to decades of computer chess orthodoxy, where engines strongly penalized such structural weaknesses.

Impact on chess theory and practice

AlphaZero's games and approach have had a measurable effect on how humans think about and play chess.

Influence on human players

Magnus Carlsen, the World Chess Champion at the time of AlphaZero's publication, cited AlphaZero as a source of inspiration for his play in 2019 and beyond. The willingness to sacrifice material for dynamic compensation, a hallmark of AlphaZero's style, became more common in top-level human games after 2018. Players became more open to positions where traditional engines gave small material disadvantages but where the compensation in activity and initiative was real.

Neural network chess engines

AlphaZero's success inspired the development of open-source neural network chess engines, most notably Leela Chess Zero (Lc0). The Leela Chess Zero project, announced on January 9, 2018 (just weeks after AlphaZero's preprint), attempted to reproduce AlphaZero's approach using distributed computing. Volunteers contributed computing power to generate self-play training games, and over time Lc0 became one of the strongest chess engines in the world, competing directly with Stockfish in major computer chess championships.

Stockfish itself eventually adopted neural network evaluation with the introduction of NNUE (Efficiently Updatable Neural Network) in 2020, moving away from its traditional handcrafted evaluation function. Modern versions of Stockfish combine NNUE evaluation with alpha-beta search, representing a hybrid approach influenced in part by the success of neural network methods demonstrated by AlphaZero.

Reimagining chess variants

In a 2020 study published in collaboration with former World Chess Champion Vladimir Kramnik, DeepMind researchers used AlphaZero to evaluate alternative chess rule sets. The paper, "Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess," examined nine chess variants including No-Castling chess, Torpedo chess (where pawns can advance two squares from any rank), Self-Capture chess, and Stalemate-equals-win. By training separate AlphaZero instances on each variant, the researchers could simulate the equivalent of decades of human play within a day and assess which rule changes produced more dynamic, decisive games.

2017 preprint vs. 2018 Science paper

The initial preprint released in December 2017 attracted both excitement and criticism. Several issues were raised by the computer chess community, and DeepMind addressed many of these in the final 2018 Science paper.

Key differences

Aspect	2017 Preprint	2018 Science Paper
Time control	1 minute per move	3 hours + 15 seconds/move
Stockfish version	Stockfish 8	Stockfish 8 and development Stockfish (Jan 2018)
Stockfish hash table	1 GB	32 GB
Stockfish threads	64	44 (matching TCEC conditions)
Endgame tablebases	Not used	6-piece Syzygy tablebases
Number of chess games	100	1,000
Openings	Fixed starting position	Fixed starting position
PUCT variant	Same as AlphaGo Zero	Updated dynamic variant

The 2018 paper provided Stockfish with substantially better conditions, including access to endgame tablebases (which are critical for precise endgame play) and a much larger hash table. Despite these improvements, AlphaZero's dominance remained clear.

Comparison of DeepMind game-playing systems

The following table compares the four major iterations of DeepMind's game-playing AI systems.

Feature	AlphaGo	AlphaGo Zero	AlphaZero	MuZero
Year	2015-2017	2017	2017-2018	2019-2020
Publication venue	Nature (2016)	Nature (2017)	Science (2018)	Nature (2020)
Games played	Go only	Go only	Chess, shogi, Go	Chess, shogi, Go, Atari (57 games)
Human data required	Yes (160,000 expert games)	No	No	No
Game rules required	Yes	Yes	Yes	No (learns a model of the environment)
Network architecture	Separate policy and value networks	Single dual-headed ResNet (20 or 40 blocks)	Single dual-headed ResNet (19 blocks, 256 filters)	Representation, dynamics, and prediction networks
Training method	Supervised learning + RL self-play	RL self-play only	RL self-play only	RL self-play only
Board symmetry exploitation	Yes	Yes (8x augmentation)	No	No
Search algorithm	MCTS with rollouts	MCTS with neural network evaluation	MCTS with neural network evaluation	MCTS with learned model
Key advancement	First to beat professional Go player	Removed need for human data in Go	Generalized to multiple games	Removed need for known game rules

MuZero: the successor

MuZero, published by DeepMind as a preprint in November 2019 and in Nature in December 2020, extended AlphaZero's approach by removing the requirement for a known game model. While AlphaZero needed the exact rules of each game to simulate future positions during MCTS, MuZero learned its own internal model of the environment. This model did not attempt to reconstruct the full game state; instead, it learned to predict only the quantities relevant to planning: the reward, the policy, and the value.

MuZero achieved this through three neural networks working together:

Representation network: Converts an observation (e.g., the current board state) into a hidden state.
Dynamics network: Given a hidden state and an action, predicts the next hidden state and the immediate reward.
Prediction network: Given a hidden state, outputs a policy and value estimate (similar to AlphaZero's dual-headed network).

MuZero matched AlphaZero's performance in chess and shogi, surpassed it in Go, and also achieved state-of-the-art results on a suite of 57 Atari games, surpassing the previous best method (R2D2, Recurrent Replay Distributed DQN) in both mean and median performance across the game suite. The ability to operate without known game rules opened the door to applying the same planning-based approach to environments where the dynamics are not known in advance, which is the case for most real-world problems.

Technical details

Input representation

The input to AlphaZero's neural network varies by game:

Chess: 119 input planes of 8x8. These encode the positions of each piece type and color for the current position and the seven preceding positions (to detect repetitions), plus additional planes for castling rights, the fifty-move counter, and the color of the current player.
Shogi: 362 input planes of 9x9. Shogi requires more input planes because captured pieces ("pieces in hand") must be represented, and there are more piece types due to promotion.
Go: 362 input planes of 19x19. These encode the stone positions for the last eight moves and additional information about the game state.

Move representation

Moves are encoded as output planes of the neural network's policy head:

Chess: 4,672 possible moves, encoded as 73 planes of 8x8. Each plane corresponds to a possible move type (queen moves in eight directions and various distances, knight moves, and underpromotions).
Shogi: 11,259 possible moves, encoded as 139 planes of 9x9.
Go: 362 possible moves (one for each intersection plus a pass move).

Loss function

The neural network is trained to minimize a combined loss:

l = (z - v)^2 - pi^T * log(p) + c * ||theta||^2

where z is the actual game outcome (+1 for win, -1 for loss, 0 for draw), v is the predicted value, pi is the MCTS search probability vector, p is the predicted policy, and c * ||theta||^2 is an L2 regularization term that prevents overfitting. The first term trains the value head, the second term trains the policy head, and the regularization term keeps the weights small.

Limitations and criticisms

AlphaZero, despite its achievements, has several notable limitations.

Hardware requirements

Training AlphaZero required 5,000 first-generation TPUs for self-play generation and 64 second-generation TPUs for neural network training. This level of hardware is not available to most researchers or organizations. The Leela Chess Zero project demonstrated that similar results could eventually be achieved with distributed consumer hardware, but the training process took months rather than hours.

Closed-source and non-reproducible

AlphaZero was never released as open-source software, and the trained models were not made publicly available. The exact training data and hyperparameters, while described in the paper, could not be independently verified. This led to some skepticism in the computer chess community and motivated the creation of open-source alternatives like Leela Chess Zero.

Perfect information games only

AlphaZero (and AlphaGo Zero before it) requires a perfect simulator of the game environment to perform MCTS. This limits its direct applicability to perfect information, deterministic games where the complete state is known to both players. Games with hidden information (such as poker), stochastic elements (such as backgammon), or continuous action spaces cannot be directly addressed by AlphaZero's algorithm. MuZero partially addressed this limitation by learning the environment model rather than requiring it.

Match conditions debate

The fairness of the Stockfish comparison was debated. Even in the improved 2018 evaluation, AlphaZero ran on TPU hardware specifically designed for neural network computation, while Stockfish ran on general-purpose CPUs. Some argued that a fair comparison would require both programs to run on equivalent hardware budgets. Additionally, Stockfish has continued to improve significantly since 2018; modern versions of Stockfish (with NNUE evaluation) are estimated to be hundreds of Elo points stronger than the version tested against AlphaZero.

Legacy and broader significance

AlphaZero demonstrated that a single, relatively simple algorithm (self-play reinforcement learning with MCTS and a deep neural network) could achieve superhuman performance across multiple board games without any human knowledge. This was a significant result for the field of artificial intelligence because it showed that domain-specific expertise and hand-engineered features, which had been the foundation of game-playing AI for decades, could be entirely replaced by learned representations.

The approach also revealed something about the nature of these games themselves. The fact that AlphaZero could discover, in a matter of hours, strategies that humans had spent centuries developing (and in some cases, strategies that humans had never discovered) raised interesting questions about how much of existing game theory was optimal and how much was simply the result of historical accident and convention.

Chess Grandmaster Matthew Sadler and Women's International Master Natasha Regan documented AlphaZero's chess strategies in their book Game Changer: AlphaZero's Groundbreaking Chess Strategies and the Promise of AI, published in January 2019 by New in Chess. The book won the English Chess Foundation 2019 Book of the Year award and the FIDE Averbakh-Boleslavsky Award for 2019. It included a foreword by Garry Kasparov and an introduction by Demis Hassabis.

The broader impact of AlphaZero extends beyond games. The general principle of combining learned evaluation functions with tree search has been applied to problems in protein structure prediction, mathematics, and code generation. The demonstration that tabula rasa learning (starting from scratch) could match or exceed decades of accumulated human knowledge inspired new research directions across machine learning and AI.

References

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." *Science*, 362(6419), 1140-1144. https://doi.org/10.1126/science.aar6404
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2017). "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm." *arXiv preprint arXiv:1712.01815*. https://arxiv.org/abs/1712.01815
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., & Hassabis, D. (2017). "Mastering the game of Go without human knowledge." *Nature*, 550(7676), 354-359. https://doi.org/10.1038/nature24270
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). "Mastering the game of Go with deep neural networks and tree search." *Nature*, 529(7587), 484-489. https://doi.org/10.1038/nature16961
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., & Silver, D. (2020). "Mastering Atari, Go, chess and shogi by planning with a learned model." *Nature*, 588(7839), 604-609. https://doi.org/10.1038/s41586-020-03051-4
Sadler, M., & Regan, N. (2019). *Game Changer: AlphaZero's Groundbreaking Chess Strategies and the Promise of AI*. New in Chess.
Tomašev, N., Paquet, U., Hassabis, D., & Kramnik, V. (2020). "Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess." *arXiv preprint arXiv:2009.04374*. https://arxiv.org/abs/2009.04374
"AlphaZero: Shedding new light on chess, shogi, and Go." Google DeepMind Blog. https://deepmind.google/blog/alphazero-shedding-new-light-on-chess-shogi-and-go/
"AlphaZero Crushes Stockfish In New 1,000-Game Match." Chess.com. https://www.chess.com/news/view/updated-alphazero-crushes-stockfish-in-new-1-000-game-match
McGrath, T., Kapishnikov, A., Tomasev, N., Pearce, A., Wattenberg, M., Hassabis, D., Kim, B., Paquet, U., & Kramnik, V. (2022). "Acquisition of chess knowledge in AlphaZero." *Proceedings of the National Academy of Sciences*, 119(47), e2206625119. https://doi.org/10.1073/pnas.2206625119

Background and motivation

Evolution from AlphaGo to AlphaZero

AlphaGo (2015-2017)

AlphaGo Zero (October 2017)

AlphaZero (December 2017)

Architecture

Neural network

Monte Carlo tree search

Training

Training infrastructure

Training duration and games generated

Evaluation hardware

Match results

Chess: AlphaZero vs. Stockfish

Shogi: AlphaZero vs. Elmo

Go: AlphaZero vs. AlphaGo Zero

Playing style

Material sacrifice and long-term planning

Grandmaster and expert reactions

Specific chess discoveries

Impact on chess theory and practice

Influence on human players

Neural network chess engines

Reimagining chess variants

2017 preprint vs. 2018 Science paper

Key differences

Comparison of DeepMind game-playing systems

MuZero: the successor

Technical details

Input representation

Move representation

Loss function

Limitations and criticisms

Hardware requirements

Closed-source and non-reproducible

Perfect information games only

Match conditions debate

Legacy and broader significance

See also

References

Related Articles

AlphaStar

OpenAI Five

AlphaCode

Agent

Reinforcement learning

Embodied AI

Background and motivation

Evolution from AlphaGo to AlphaZero

AlphaGo (2015-2017)

AlphaGo Zero (October 2017)

AlphaZero (December 2017)

Architecture

Neural network

Monte Carlo tree search

Training

Training infrastructure

Training duration and games generated

Evaluation hardware

Match results

Chess: AlphaZero vs. Stockfish

Shogi: AlphaZero vs. Elmo

Go: AlphaZero vs. AlphaGo Zero

Playing style

Material sacrifice and long-term planning

Grandmaster and expert reactions

Specific chess discoveries

Impact on chess theory and practice

Influence on human players

Neural network chess engines

Reimagining chess variants

2017 preprint vs. 2018 Science paper

Key differences

Comparison of DeepMind game-playing systems

MuZero: the successor

Technical details

Input representation

Move representation

Loss function

Limitations and criticisms