AlphaGo Zero
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,461 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,461 words
Add missing citations, update stale details, or suggest a clearer explanation.
AlphaGo Zero is a computer program developed by DeepMind that learned to play the board game Go at a superhuman level entirely through self-play, without any human game data. It was introduced in the paper "Mastering the game of Go without human knowledge," published in the journal Nature in October 2017. Unlike the earlier versions of AlphaGo, which were trained in part on records of games played by human experts, AlphaGo Zero began from random play and learned solely from the rules of Go and the outcomes of games it played against itself. DeepMind reported that the resulting program defeated the version that had beaten world champion Lee Sedol by 100 games to 0.[1][2][3]
The first version of AlphaGo defeated the professional player Fan Hui in 2015 and then beat Lee Sedol, one of the strongest players in the world, by four games to one in March 2016. That system combined supervised learning from roughly 30 million positions taken from human games with reinforcement learning from self-play, and it relied on two separate neural networks together with handcrafted features and fast random "rollout" simulations. A later, stronger iteration known as AlphaGo Master won 60 straight online games against top professionals in early 2017 and defeated the world's number-one ranked player, Ke Jie, in a three-game match in May 2017.[2][3][4]
AlphaGo Zero was DeepMind's attempt to remove human knowledge from the loop almost entirely. The motivating question, as the team framed it, was whether a program could reach or exceed the strongest versions of AlphaGo using reinforcement learning alone, starting from a blank slate. The research was led by David Silver, with Demis Hassabis among the authors, and the work was carried out at DeepMind.[1][2][5]
AlphaGo Zero was trained tabula rasa, a Latin phrase meaning "blank slate." At the start of training it knew nothing about Go beyond the rules of the game, and it received no examples of human play. Learning proceeded purely through self-play reinforcement learning: the program played games against itself, used the results to update a single neural network, and then played stronger games with the improved network. Over the course of millions of self-play games, the system progressively discovered the principles of strong Go play on its own.[1][2][6]
The procedure works by using the network's predictions to guide a tree search, which in turn produces better moves than the raw network alone. Those search results then serve as training targets, so the network is trained to imitate its own improved search behavior and to predict the eventual winner of each game. Because the search consistently outperforms the network it is built on, each cycle yields a stronger player, which generates higher-quality games for the next cycle. DeepMind characterized the early phase of training as the program rediscovering well-known Go knowledge such as standard corner sequences, before moving on to novel strategies that departed from established human play.[1][2][6]
According to DeepMind, AlphaGo Zero surpassed the strength of the Lee Sedol version after about three days of training, during which it played roughly 4.9 million games against itself. Reporting on the work noted that the system reached the level of AlphaGo Master after about 21 days and continued improving through a full training run of 40 days.[1][2][3][6]
AlphaGo Zero used a single deep neural network rather than the two networks used by earlier versions. In the original AlphaGo, a "policy network" selected candidate moves and a separate "value network" estimated the probability of winning from a position. AlphaGo Zero combined these into one network with two outputs, a move-probability output and a position-evaluation output, sharing the same body. The network was a residual (ResNet-style) convolutional architecture, evaluated in versions with 20 and 40 residual blocks, and it took as input only the raw board position and the history of stones rather than handcrafted features.[1][2][6]
That network was paired with Monte Carlo tree search. During play and self-play, the search used the network to evaluate positions and to bias its exploration toward promising moves, then returned an improved set of move probabilities. A notable simplification compared with earlier versions was that AlphaGo Zero did not use rollouts, the fast, randomized playouts that previous Go programs (including the first AlphaGo) had relied on to estimate the value of a position. Instead it depended entirely on the network's own evaluations. Reviewers and DeepMind alike described the resulting design as simpler yet more powerful than its predecessors.[1][2][6]
The program was also far more efficient in hardware terms. AlphaGo Zero ran on a single machine with four tensor processing units (TPUs), whereas the Lee Sedol version of AlphaGo had been distributed across many machines and used 48 TPUs.[1][2]
DeepMind reported a series of head-to-head results and Elo-style strength estimates showing that AlphaGo Zero overtook every earlier version. The most widely cited figure is the 100-0 sweep against AlphaGo Lee, the configuration that had defeated Lee Sedol in 2016. In a 100-game match against AlphaGo Master, AlphaGo Zero won 89 games to 11.[1][2][3]
| Version | Training data | Hardware | Approx. Elo (DeepMind) | Notable result |
|---|---|---|---|---|
| AlphaGo Lee (2016) | Human games + self-play | 48 TPUs, distributed | ~3,739 | Beat Lee Sedol 4-1 |
| AlphaGo Master (2017) | Human games + self-play | Single machine | ~4,858 | Beat Ke Jie; 60-0 online vs pros |
| AlphaGo Zero (40 days) | Self-play only (tabula rasa) | Single machine, 4 TPUs | ~5,185 | 100-0 vs Lee; 89-11 vs Master |
The Elo figures above are the internal ratings reported by DeepMind for comparison across versions and are not directly comparable to human tournament ratings. The headline points are consistent across DeepMind's own account and independent press coverage: a program trained without human data became the strongest Go player DeepMind had produced, and it did so in a fraction of the time and computing power used by the version that beat Lee Sedol.[1][2][3]
AlphaGo Zero was widely covered as evidence that a system could reach superhuman performance in a complex domain through reinforcement learning alone, without bootstrapping from human expertise. Because human game records were removed from the pipeline, the program was not anchored to human conventions and at times produced moves and openings that diverged from centuries of accumulated Go theory. DeepMind's leadership presented this as a step toward more general learning systems, suggesting that similar self-play approaches might apply to problems where high-quality human data is scarce or unavailable. Commentators were careful to note the limits of the claim, since Go is a deterministic, perfect-information game with clearly defined rules, which makes self-play training tractable in a way that many real-world problems are not.[1][3][5]
The result also simplified the recipe established by the original AlphaGo. By folding two networks into one, dropping handcrafted features, and removing rollouts, AlphaGo Zero showed that a cleaner architecture could outperform a more complex predecessor, a finding that influenced how later self-play systems were designed.[2][6]
AlphaGo Zero is closely related to, but distinct from, AlphaZero. AlphaGo Zero was specialized for Go. In December 2017, DeepMind released a preprint describing AlphaZero, a single algorithm based on the same self-play and search principles that was applied without game-specific modification to chess, shogi, and Go. AlphaZero was reported to reach a superhuman level in each of those games within about 24 hours of training and to defeat strong existing programs, including the chess engine Stockfish, the shogi engine Elmo, and the three-day version of AlphaGo Zero itself.[7][8]
The lineage continued with MuZero, described by DeepMind in 2019 and 2020, which extended the approach to settings where the rules of the environment are not given in advance, learning a model of the environment's dynamics as part of training. Together, AlphaGo, AlphaGo Zero, AlphaZero, and MuZero are often presented as successive steps in DeepMind's program of game-playing reinforcement learning, each removing another piece of built-in knowledge.[7][8]