AlphaGo Zero

AI in Gaming Google DeepMind Reinforcement Learning

10 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 2,026 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AlphaGo Zero is a Go-playing computer program developed by DeepMind that reached a superhuman level entirely through self-play reinforcement learning, starting from random play with no human game data. It was introduced in the paper "Mastering the game of Go without human knowledge," published in the journal Nature (volume 550, pages 354-359) on 19 October 2017.^[1]^[2] Unlike the earlier versions of AlphaGo, which were trained in part on records of games played by human experts, AlphaGo Zero began from random play and learned solely from the rules of Go and the outcomes of games it played against itself, using a single neural network (combining policy and value) guided by Monte Carlo tree search. DeepMind reported that after about three days of training it defeated AlphaGo Lee, the version that had beaten world champion Lee Sedol, by 100 games to 0.^[1]^[2]^[3] The same self-play approach was generalized weeks later into AlphaZero, which mastered chess, shogi, and Go from a single algorithm.^[7]

The Nature paper summarized the result in its abstract: "Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo."^[2]

What is AlphaGo Zero?

AlphaGo Zero is the version of DeepMind's AlphaGo program that learned to play Go without any human knowledge beyond the rules of the game. Where the original AlphaGo bootstrapped from a database of human expert games, AlphaGo Zero started from a blank slate (tabula rasa) and improved purely by playing against itself. It used one deep neural network with two outputs, a move-probability (policy) output and a position-evaluation (value) output, paired with Monte Carlo tree search to choose moves. The headline outcome was a 100-0 record against AlphaGo Lee, the configuration that defeated Lee Sedol 4-1 in March 2016.^[1]^[2]^[3]

The research was led by David Silver, with Demis Hassabis among the 17 listed authors, and was carried out at DeepMind.^[1]^[2]

Background

The first version of AlphaGo defeated the professional player Fan Hui in 2015 and then beat Lee Sedol, one of the strongest players in the world, by four games to one in March 2016. That system combined supervised learning from roughly 30 million positions taken from human games with reinforcement learning from self-play, and it relied on two separate neural networks together with handcrafted features and fast random "rollout" simulations. A later, stronger iteration known as AlphaGo Master won 60 straight online games against top professionals in early 2017 and defeated the world's number-one ranked player, Ke Jie, in a three-game match in May 2017.^[2]^[3]^[4]

AlphaGo Zero was DeepMind's attempt to remove human knowledge from the loop almost entirely. The motivating question, as the team framed it, was whether a program could reach or exceed the strongest versions of AlphaGo using reinforcement learning alone, starting from a blank slate.^[1]^[2]^[5]

How did AlphaGo Zero learn without human data?

AlphaGo Zero was trained tabula rasa, a Latin phrase meaning "blank slate." At the start of training it knew nothing about Go beyond the rules of the game, and it received no examples of human play. As DeepMind described it, "AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play."^[1] Learning proceeded purely through self-play reinforcement learning: the program played games against itself, used the results to update a single neural network, and then played stronger games with the improved network. Over the course of millions of self-play games, the system progressively discovered the principles of strong Go play on its own.^[1]^[2]^[6]

The procedure works by using the network's predictions to guide a tree search, which in turn produces better moves than the raw network alone. Those search results then serve as training targets, so the network is trained to imitate its own improved search behavior and to predict the eventual winner of each game. Because the search consistently outperforms the network it is built on, each cycle yields a stronger player, which generates higher-quality games for the next cycle. DeepMind characterized the early phase of training as the program rediscovering well-known Go knowledge such as standard corner sequences, before moving on to novel strategies that departed from established human play.^[1]^[2]^[6]

According to DeepMind, AlphaGo Zero surpassed the strength of the Lee Sedol version after about three days (roughly 72 hours) of training, during which it played about 4.9 million games against itself.^[1]^[3] Reporting on the work noted that the system reached the level of AlphaGo Master after about 21 days and exceeded all previous versions during a full training run of 40 days.^[1]^[2]^[3]^[6]

Training milestone	Approximate time	Outcome
Surpasses AlphaGo Lee	~3 days (~72 hours)	100-0 vs AlphaGo Lee; ~4.9M self-play games
Reaches AlphaGo Master level	~21 days	Matches the version that beat Ke Jie
Exceeds all prior versions	~40 days	Strongest version DeepMind had produced

What is the architecture of AlphaGo Zero?

AlphaGo Zero used a single deep neural network rather than the two networks used by earlier versions. As DeepMind put it, "It uses one neural network rather than two. Earlier versions of AlphaGo used a 'policy network' to select the next move to play and a 'value network' to predict the winner of the game from each position."^[1] AlphaGo Zero combined these into one network with two outputs, a move-probability output and a position-evaluation output, sharing the same body. The network was a residual (ResNet-style) convolutional architecture, evaluated in versions with 20 and 40 residual blocks, and it took as input only the raw board position and the history of stones rather than handcrafted features.^[1]^[2]^[6]

That network was paired with Monte Carlo tree search. During play and self-play, the search used the network to evaluate positions and to bias its exploration toward promising moves, then returned an improved set of move probabilities. A notable simplification compared with earlier versions was that AlphaGo Zero did not use rollouts, the fast, randomized playouts that previous Go programs (including the first AlphaGo) had relied on to estimate the value of a position. Instead it depended entirely on the network's own evaluations. Reviewers and DeepMind alike described the resulting design as simpler yet more powerful than its predecessors.^[1]^[2]^[6]

The program was also far more efficient in hardware terms. AlphaGo Zero ran on a single machine with four tensor processing units (TPUs), whereas the Lee Sedol version of AlphaGo had been distributed across many machines and used 48 TPUs.^[1]^[2]

How strong was AlphaGo Zero?

DeepMind reported a series of head-to-head results and Elo-style strength estimates showing that AlphaGo Zero overtook every earlier version. The most widely cited figure is the 100-0 sweep against AlphaGo Lee, the configuration that had defeated Lee Sedol in 2016. In a 100-game match against AlphaGo Master, AlphaGo Zero won 89 games to 11.^[1]^[2]^[3]

Version	Training data	Hardware	Approx. Elo (DeepMind)	Notable result
AlphaGo Lee (2016)	Human games + self-play	48 TPUs, distributed	~3,739	Beat Lee Sedol 4-1
AlphaGo Master (2017)	Human games + self-play	Single machine	~4,858	Beat Ke Jie; 60-0 online vs pros
AlphaGo Zero (40 days)	Self-play only (tabula rasa)	Single machine, 4 TPUs	~5,185	100-0 vs Lee; 89-11 vs Master

The Elo figures above are the internal ratings reported by DeepMind for comparison across versions and are not directly comparable to human tournament ratings. The headline points are consistent across DeepMind's own account and independent press coverage: a program trained without human data became the strongest Go player DeepMind had produced, and it did so in a fraction of the time and computing power used by the version that beat Lee Sedol.^[1]^[2]^[3]

Why does AlphaGo Zero matter?

AlphaGo Zero was widely covered as evidence that a system could reach superhuman performance in a complex domain through reinforcement learning alone, without bootstrapping from human expertise. Because human game records were removed from the pipeline, the program was not anchored to human conventions and at times produced moves and openings that diverged from centuries of accumulated Go theory. DeepMind's leadership presented this as a step toward more general learning systems, suggesting that similar self-play approaches might apply to problems where high-quality human data is scarce or unavailable. Commentators were careful to note the limits of the claim, since Go is a deterministic, perfect-information game with clearly defined rules, which makes self-play training tractable in a way that many real-world problems are not.^[1]^[3]^[5]

The result also simplified the recipe established by the original AlphaGo. By folding two networks into one, dropping handcrafted features, and removing rollouts, AlphaGo Zero showed that a cleaner architecture could outperform a more complex predecessor, a finding that influenced how later self-play systems were designed.^[2]^[6]

How is AlphaGo Zero different from AlphaGo and AlphaZero?

AlphaGo Zero is closely related to, but distinct from, both the original AlphaGo and AlphaZero. The original AlphaGo learned in part from human expert games and used two networks plus rollouts; AlphaGo Zero removed human data, folded the two networks into one, and dropped rollouts, while remaining specialized for Go.^[1]^[2]

In December 2017, DeepMind released a preprint describing AlphaZero, a single algorithm based on the same self-play and search principles that was applied without game-specific modification to chess, shogi, and Go. AlphaZero was reported to reach a superhuman level in each of those games within about 24 hours of training and to defeat strong existing programs, including the chess engine Stockfish, the shogi engine Elmo, and the three-day version of AlphaGo Zero itself.^[7]^[8]

Program	Year	Human data?	Networks	Domain
AlphaGo (Lee/Fan)	2015-2016	Yes	Two (policy + value), with rollouts	Go
AlphaGo Zero	2017	No (tabula rasa)	One (combined policy + value), no rollouts	Go
AlphaZero	2017	No (tabula rasa)	One (combined policy + value)	Chess, shogi, Go

The lineage continued with MuZero, described by DeepMind in 2019 and 2020, which extended the approach to settings where the rules of the environment are not given in advance, learning a model of the environment's dynamics as part of training. Together, AlphaGo, AlphaGo Zero, AlphaZero, and MuZero are often presented as successive steps in DeepMind's program of game-playing reinforcement learning, each removing another piece of built-in knowledge.^[7]^[8]

ELI5

Imagine a robot that wants to get really good at a board game but is never allowed to watch a single human play. It only knows the rules. So it plays the game against itself, over and over, millions of times. At first it just makes random moves, but every time it finishes a game it learns a little about which moves tend to win. After about three days of practicing against itself, it became so strong that it beat the best game-playing robot from before, which had studied tons of human games, by a perfect score of 100 to 0. That is AlphaGo Zero: it taught itself, from scratch, with no human help.

References

DeepMind, "AlphaGo Zero: Starting from scratch." https://deepmind.google/blog/alphago-zero-starting-from-scratch/ ↩
Silver, D., et al. "Mastering the game of Go without human knowledge." *Nature* 550, 354-359 (2017). https://www.nature.com/articles/nature24270 ↩
"AlphaGo Zero Shows Machines Can Become Superhuman Without Any Help." *MIT Technology Review*, October 18, 2017. https://www.technologyreview.com/2017/10/18/148511/alphago-zero-shows-machines-can-become-superhuman-without-any-help/ ↩
"AlphaGo Zero," Wikipedia. https://en.wikipedia.org/wiki/AlphaGo_Zero ↩
"Google AlphaGo Zero masters the game in three days." Queensland Brain Institute, University of Queensland, October 2017. https://qbi.uq.edu.au/blog/2017/10/google-alphago-zero-masters-game-three-days ↩
"Mastering the game of Go without human knowledge" (preprint copy of the Nature paper), UCL Discovery. https://discovery.ucl.ac.uk/10045895/ ↩
Silver, D., et al. "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm." arXiv:1712.01815 (December 5, 2017). https://arxiv.org/abs/1712.01815 ↩
DeepMind, "AlphaZero: Shedding new light on chess, shogi, and Go." https://deepmind.google/blog/alphazero-shedding-new-light-on-chess-shogi-and-go/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AlphaZero Gaming Karén Simonyan Reinforcement learning

What is AlphaGo Zero?

Background

How did AlphaGo Zero learn without human data?

What is the architecture of AlphaGo Zero?

How strong was AlphaGo Zero?

Why does AlphaGo Zero matter?

How is AlphaGo Zero different from AlphaGo and AlphaZero?

ELI5

See also

References

Improve this article

Related Articles

AlphaStar

AlphaZero

SIMA (DeepMind)

OpenAI Five

Monte Carlo Tree Search

Pluribus (poker AI)

What links here

Related Articles

AlphaStar

AlphaZero

SIMA (DeepMind)

OpenAI Five

Monte Carlo Tree Search

Pluribus (poker AI)

What links here