AlphaStar

AI in Gaming Artificial Intelligence Google DeepMind Reinforcement Learning

20 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v7 · 4,019 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AlphaStar is an artificial intelligence system built by Google DeepMind that in 2019 became the first AI to reach Grandmaster level in the real-time strategy game StarCraft II, ranking above 99.8% of active human players on the official Battle.net servers (the top 0.2%) for all three in-game races.^[1] It learned to play through a combination of imitation learning from human replays and large-scale multi-agent reinforcement learning, a self-play system DeepMind called the AlphaStar League. The work was published on October 30, 2019 in the journal Nature under the title "Grandmaster level in StarCraft II using multi-agent reinforcement learning," authored by Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, and more than 40 collaborators at DeepMind.^[1]

AlphaStar first reached the public in a January 24, 2019 livestream, where pre-recorded matches showed it beating two professional players from Team Liquid 5-0 each.^[3] It is considered one of the most significant milestones in game-playing AI, alongside AlphaGo, AlphaZero, and OpenAI Five. It was the first AI system to reach the top league of a widely played esport without any game restrictions, operating under conditions comparable to human players.^[1]

Why was StarCraft II a grand challenge for AI?

StarCraft II, developed by Blizzard Entertainment, has long been regarded as one of the most demanding environments for artificial intelligence research.^[4] Several properties of the game make it substantially harder than board games like chess or Go, which prior AI systems had already conquered.

Imperfect Information

Unlike chess or Go, where both players can see the entire board, StarCraft II features a "fog of war" that hides regions of the map the player has not scouted.^[4] Players must actively move a camera to observe different parts of the battlefield, and enemy actions remain hidden unless units are positioned to detect them. This partial observability forces the agent to reason under uncertainty, make inferences about opponent strategies from incomplete data, and decide when to invest resources in scouting.

Real-Time Decision-Making

StarCraft II is not turn-based. Both players issue commands simultaneously and continuously. Actions must be executed in real time, with decisions made on the scale of milliseconds. The agent cannot pause to deliberate; it must integrate information, plan, and act under constant time pressure.

Enormous Action Space

At any given moment, a StarCraft II player may have access to approximately 10^26 possible actions.^[1] This dwarfs the action spaces of board games (chess has roughly 10^120 possible game states, but the branching factor per move is around 35). In StarCraft II, each action involves selecting one or more units, choosing an ability or command, and specifying a target location or target unit. The combinatorial explosion of these choices creates one of the largest action spaces in any game studied by AI researchers.

Long-Term Strategic Planning

A typical StarCraft II game lasts between 10 and 30 minutes and can involve thousands of individual decision steps.^[1] Players must balance short-term tactical micro-management (controlling individual units in battle) with long-term macro-strategy (building an economy, choosing a technology path, timing attacks). The reward signal is extremely sparse: the agent only learns whether it won or lost at the very end of the game, making credit assignment across thousands of steps exceptionally difficult.

Multi-Scale Complexity

The game requires simultaneous mastery of several distinct skills: economic management (gathering resources and building infrastructure), unit production (choosing which units to build and when), technology research (selecting upgrades and unlocking new capabilities), scouting (gathering intelligence about the opponent), tactical combat (positioning and controlling units in battles), and strategic planning (choosing when to attack, defend, or expand). Excelling at any single dimension is not enough; the agent must coordinate all of these at once.

Development History

Origins and Blizzard Collaboration

As early as 2011, DeepMind co-founder Demis Hassabis identified StarCraft as "the next step up" for AI after board games. Following AlphaGo's historic victory over Lee Sedol in March 2016, Hassabis publicly discussed the possibility of building an AI for StarCraft, citing it as a strategic game with incomplete information where much of the "board" is invisible.

In November 2016, DeepMind and Blizzard Entertainment announced a formal collaboration at BlizzCon, alongside plans to release an open development environment for AI research in StarCraft II. This led to the release of the StarCraft II Learning Environment (SC2LE) and PySC2 in August 2017, providing researchers worldwide with tools to develop AI agents for the game.^[4]

Path to AlphaStar

Development of AlphaStar proceeded through several phases during 2018. The team first built a supervised learning pipeline to train agents from human replay data, then developed the multi-agent reinforcement learning system known as the AlphaStar League.^[1] By December 2018, the system was strong enough to defeat professional players.^[3]

How does AlphaStar's neural network work?

AlphaStar's neural network architecture is a sophisticated combination of several components, totaling approximately 139 million parameters (with 55 million required during inference).^[1] The architecture was designed to handle the unique challenges of StarCraft II: processing diverse input types, maintaining memory over long game sequences, and producing structured, combinatorial actions.

Input Encoders

The architecture uses three specialized encoders to process different aspects of the game state:

Encoder	Input Type	Method
Scalar Encoder	Global game information (resources, supply, game time, player statistics)	Linear layers with ReLU activations
Entity Encoder	Information about individual game units (type, health, position, ownership)	Transformer with self-attention over entities
Spatial Encoder	2D map features (terrain, unit positions, visibility)	2D convolutions followed by ResBlocks

The entity encoder is particularly notable. It applies a transformer to process information about all visible units on the map, treating each unit as a token in a sequence.^[1] This allows the network to learn relationships between units, such as which enemy units threaten which friendly units, or which buildings are part of a coordinated production strategy.

A scatter connection combines spatial and non-spatial features, allowing information from individual entities to be projected onto the spatial map representation and vice versa.

Core: Deep LSTM

The encoded observations are combined and fed into a deep LSTM (Long Short-Term Memory) network, which serves as the central memory and decision-making component.^[1] The LSTM maintains a hidden state across time steps, enabling the agent to remember past observations, track the progress of its strategy, and reason about events that occurred earlier in the game.

Auto-Regressive Policy Head

AlphaStar produces actions through an auto-regressive policy head. Rather than predicting all action components simultaneously, the network generates each part of an action sequentially, with each subsequent component conditioned on all previous ones:^[1]

Action type (e.g., move, attack, build, train unit)
Delay (how long to wait before the next action)
Queue (whether to queue the action)
Selected units (which units to control, using a pointer network)
Target unit (if applicable, selected via a pointer network)
Target location (a point on the map)

The pointer network component is critical for handling the variable number of units in the game. Since the number of controllable units changes constantly as units are produced and destroyed, a fixed-size output layer cannot represent unit selection. The pointer network instead attends over the set of available entities, producing a probability distribution over them.

Centralized Value Baseline

For training, AlphaStar uses a centralized value function that has access to additional information not available to the policy (such as opponent information). This helps stabilize training by providing better estimates of state values, while the policy itself only uses information available to a human player.^[1]

How was AlphaStar trained?

AlphaStar's training followed a two-phase approach: supervised learning from human replays, followed by multi-agent reinforcement learning in the AlphaStar League.^[1] DeepMind noted that "the League training is fully automated, and starts only with agents trained by supervised learning."^[2]

Phase 1: Supervised Learning from Human Replays

The initial training phase used approximately 971,000 anonymized human game replays provided by Blizzard, drawn from players with MMR (matchmaking rating) above 3,500 (roughly the top 22% of the player population).^[1] The agent learned to predict human actions given the current game state, effectively imitating the strategies and tactics used by skilled human players. According to DeepMind, this imitation learning produced "an initial policy which played the game better than 84% of active players."^[2]

This supervised learning phase served two critical purposes. First, it provided a strong behavioral foundation, teaching the agent basic strategies, build orders, unit compositions, and tactical patterns. Second, it solved the exploration problem: discovering viable strategies from scratch through random exploration would be like finding a needle in a haystack, given the vast action space.

After supervised training, the agent (called AlphaStar Supervised) achieved an MMR of approximately 3,699, placing it above 84% of human players.^[1] It could also defeat Blizzard's built-in Elite AI in 95% of matches.

To preserve strategic diversity, the supervised learning phase also trained the agent conditioned on a latent variable z, sampled from the distribution of human strategies.^[1] This meant the agent could produce different opening builds and strategic approaches depending on the value of z, rather than collapsing to a single dominant strategy.

Phase 2: Multi-Agent Reinforcement Learning (AlphaStar League)

The second phase used a novel multi-agent training framework called the AlphaStar League. Instead of simple self-play (where an agent trains against copies of itself), the League maintains a diverse population of agents that train against each other under varying objectives.^[1] As DeepMind described it, the League contains "main agents whose goal is to win versus everyone, and also exploiter agents that focus on helping the main agent grow stronger by exposing its flaws."^[2]

The League contained three types of agents:

Agent Type	Objective	Opponent Selection
Main Agents	Maximize win rate against all opponents in the league	Prioritized Fictitious Self-Play (PFSP): opponents selected with probability proportional to the main agent's loss rate against them
Main Exploiters	Find and exploit weaknesses specifically in the current main agents	Trained against the latest main agents
League Exploiters	Find systemic weaknesses across the entire league	PFSP across all agents in the league

The exploiter agents served a crucial role: they acted as adversarial stress-testers, discovering degenerate strategies or blind spots in the main agents.^[1] When a main exploiter found a strategy that consistently beat a main agent, the main agent would then be trained to defend against that exploit. Both types of exploiters periodically reset their weights to encourage exploration of new attack strategies.

This league structure prevented the "forgetting" problem common in simple self-play, where an agent learns to counter its current opponent but loses the ability to handle earlier strategies. The League preserved strategic diversity while still driving improvement.

The AlphaStar League was run for 14 days, using 16 third-generation TPUs for each agent.^[1] During training, each agent experienced up to 200 years of real-time StarCraft gameplay through accelerated simulation.^[2] The entire training infrastructure used Google's v3 TPUs with a highly scalable distributed setup supporting thousands of parallel StarCraft II instances.

Reinforcement Learning Algorithm

AlphaStar's RL algorithm combined several techniques:^[1]

V-trace: An off-policy correction method for the policy gradient, addressing the fact that training data was collected by earlier versions of the policy.
TD(lambda): Temporal difference learning for updating the value function.
UPGO (Upgoing Policy Gradient Operator): A novel self-imitation algorithm that biases the policy toward trajectories where the actual outcome exceeded the expected value. When an action led to a better-than-expected result, the agent learned from it; when it led to a worse-than-expected result, it bootstrapped from the value estimate instead.
KL divergence penalty: A regularization term that prevented the RL policy from drifting too far from the supervised learning policy, helping to maintain human-like play patterns.

The reward signal was binary and sparse: +1 for a win, -1 for a loss, received only at the end of the game.^[1] No intermediate rewards (such as resources gathered or units killed) were used for the final version of the agent.

What happened in the January 2019 demonstration?

On January 24, 2019, DeepMind publicly unveiled AlphaStar in a livestreamed event, showcasing pre-recorded matches against two professional StarCraft II players from Team Liquid.^[3]

AlphaStar vs. TLO

The first series pitted AlphaStar (playing Protoss) against Dario "TLO" Wunsch, a top professional Zerg player. TLO is primarily a Zerg specialist but played Protoss for this match to enable a mirror matchup. AlphaStar won all five games (5-0), deploying distinct strategies in each game.^[3] After the series, TLO said the agent "feels very fair, like it is playing a 'real' game of StarCraft," adding that "AlphaStar has excellent and precise control, it doesn't feel superhuman."^[2]

AlphaStar vs. MaNa

The second series, played on December 19, 2018 under professional match conditions, featured AlphaStar (playing Protoss) against Grzegorz "MaNa" Komincz, one of the world's top Protoss players, ranked among the top 10 Protoss specialists globally. AlphaStar again won all five games (5-0).^[3] Both series were played on the competitive ladder map CatalystLE, using StarCraft II version 4.6.2.

AlphaStar averaged approximately 280 actions per minute (APM) during these matches, with an average reaction delay of 350 milliseconds between observation and action.^[3] Both figures are within the range of professional human players.

The Live Exhibition Match

Following the broadcast of the pre-recorded matches, DeepMind arranged a live exhibition match between MaNa and a newer version of AlphaStar that had been trained with camera interface restrictions (limiting it to view the game through a movable camera, just as human players do). This version had only been trained for seven days with the camera restriction. MaNa won the live game, dealing AlphaStar its first loss against a professional player.^[3] MaNa exploited the camera-restricted agent's weaknesses, demonstrating that the camera constraint meaningfully affected performance.

Criticism of the Initial Demonstration

The January 2019 matches drew significant criticism from the StarCraft community and AI researchers on several grounds:

Global camera access: In the pre-recorded matches, AlphaStar observed the game through a raw interface that provided information about all visible units simultaneously, rather than viewing a limited screen area through a camera as human players do. This gave the agent a meaningful informational advantage.
Burst APM: While AlphaStar's average APM was comparable to professional play, observers noted that its instantaneous action rate could spike dramatically during critical moments. Reports indicated burst APM reaching 900 or even 1,500 in short windows, far exceeding human capabilities.
API-level precision: AlphaStar interacted with the game engine through a programmatic API rather than through a visual display with mouse and keyboard. This allowed pixel-perfect targeting and instantaneous unit selection that human players cannot replicate.

These criticisms motivated DeepMind to develop a significantly more constrained version of AlphaStar for the Battle.net ladder evaluation.

How did AlphaStar reach Grandmaster on Battle.net?

In response to criticism about the fairness of the initial demonstration, DeepMind retrained AlphaStar with substantially tighter constraints, designed in collaboration with professional player TLO.^[2]

Human-Like Constraints

The updated AlphaStar operated under the following restrictions:

| Constraint | Details | |---|---|---| | Camera interface | The agent viewed the game through a movable camera, receiving only the visual information available to a human player at any given moment | | Action rate cap | Maximum of 22 non-duplicate actions per 5-second window | | Action counting | One agent action (select units + choose ability + pick target) could count as up to 3 in-game APM; camera movements also counted against the action budget | | Observation format | Processed structured game state data (unit lists, map features) rather than raw pixels, but limited to camera-visible information |

Ladder Deployment

Starting in July 2019, AlphaStar was deployed anonymously on the European Battle.net 1v1 competitive ladder.^[5] Players who opted in to a special research program could be matched against the AI without knowing its identity. AlphaStar played with all three StarCraft II races (Protoss, Terran, and Zerg), each controlled by a separately trained agent.^[1]

Blizzard announced the deployment through an official blog post, and the StarCraft community was aware that an AI was competing on the ladder, though they did not know which specific accounts belonged to AlphaStar.^[5]

Grandmaster Results

By late October 2019, AlphaStar had achieved Grandmaster rank for all three races on the European server, placing it above 99.8% of active players (the top 0.2%) among roughly 90,000 ranked players.^[1] The final agent's average rating was within the top 0.15%.^[1] The specific MMR ratings achieved by the final version (AlphaStar Final) were:

Race	MMR Rating	Approximate Percentile
Protoss	6,275	Top 0.15%
Terran	6,048	Top 0.15%
Zerg	5,835	Top 0.15%

This made AlphaStar the first AI agent to reach the top league of a major esport under conditions comparable to human play.^[1] The research team noted that professional players who reviewed AlphaStar's gameplay confirmed that it felt "fair" and "real," without a superhuman quality to its mechanics.^[2]

Performance Progression

The paper documented three stages of AlphaStar's development, illustrating the contribution of each training phase:^[1]

Version	Training Stage	Approximate Percentile
AlphaStar Supervised	Supervised learning only	Top 16%
AlphaStar Mid	Midpoint of RL training	Top 0.5%
AlphaStar Final	Full league training with camera constraints	Top 0.15%

Technical Details Summary

Property	Value
Total parameters	~139 million
Inference parameters	~55 million
Training hardware	Google TPU v3 (16 TPUs per agent)
League training duration	14 days
Supervised learning dataset	~971,000 human replays (MMR > 3,500)
Gameplay experience per agent	Up to 200 years of real-time play
Reward signal	Binary win/loss (sparse)
Action space	~10^26 possible actions per time step
Average APM (Battle.net version)	Capped at 22 actions per 5 seconds
Races mastered	All three (Protoss, Terran, Zerg)
Peak MMR achieved	6,275 (Protoss, European server)
Final ranking	Above 99.8% of active human players (top 0.2%)

How does AlphaStar compare with OpenAI Five?

AlphaStar and OpenAI Five were developed around the same period and represent the two most prominent achievements in AI for complex multiplayer video games. While both systems demonstrated superhuman performance, they tackled different games with different approaches.

Feature	AlphaStar (StarCraft II)	OpenAI Five (Dota 2)
Developer	Google DeepMind	OpenAI
Game	StarCraft II	Dota 2
Game type	1v1 real-time strategy	5v5 multiplayer online battle arena
Information	Imperfect (fog of war)	Imperfect (fog of war)
Architecture	Transformer + Deep LSTM + Pointer Network	Single-layer 4,096-unit LSTM per hero
Total parameters	~139 million	~159 million
Training method	Supervised learning + multi-agent RL (league)	Pure self-play (no human data)
RL algorithm	V-trace + TD(lambda) + UPGO	Proximal Policy Optimization (PPO)
Training hardware	16 TPUs per agent (Google TPU v3)	256 GPUs + 128,000 CPU cores
Training compute	14 days (league phase)	~10 months (~770 PFlops/s-days)
Gameplay experience	Up to 200 years per agent	~180 years per day (collective)
Key achievement	Grandmaster in all 3 races (top 0.2%)	Defeated Dota 2 world champions OG (2-0)
Date of key result	October 2019 (Nature publication)	April 13, 2019 (OG match)
Game restrictions	None (full game, all races, all maps)	Restricted to 17 heroes
Publication	Nature (October 30, 2019)	arXiv preprint (December 2019)

Both systems used distinct strategies to handle their respective games. AlphaStar's league-based multi-agent approach was designed to maintain strategic diversity and prevent cyclic weaknesses (rock-paper-scissors dynamics), while OpenAI Five relied on massive-scale self-play with no human data at all.^[8] AlphaStar's use of supervised pretraining from human replays gave it a strong initial behavioral prior, whereas OpenAI Five learned entirely from scratch.

One notable difference is that AlphaStar played the full, unrestricted version of StarCraft II during its Battle.net evaluation (all races, all maps in the competitive pool), while OpenAI Five's victory over OG used a restricted hero pool of 17 out of over 100 available heroes.^[7]

Legacy and Impact

AlphaStar's achievement had significant repercussions across several areas of AI research and the gaming community.

Contributions to AI Research

The AlphaStar project introduced or validated several techniques that have influenced subsequent research:

Multi-agent league training: The concept of maintaining a diverse population of agents with different training objectives (main agents, exploiters) has been adopted and extended in numerous multi-agent reinforcement learning studies. The league structure demonstrated a practical solution to the non-transitivity problem in game-theoretic training.
Transformer-based entity processing: Using self-attention mechanisms to process variable-length sets of game entities (units) has become a standard approach in game AI and other domains requiring reasoning over sets of objects.
UPGO algorithm: The upgoing policy gradient operator provided a new technique for self-imitation learning in sparse-reward environments.
Scalable distributed training: The infrastructure for running thousands of parallel game instances with population-based training demonstrated a template for large-scale RL systems.

AlphaStar Unplugged (2023)

In August 2023, DeepMind released "AlphaStar Unplugged," a large-scale offline reinforcement learning benchmark built on the AlphaStar codebase.^[6] The benchmark includes a dataset of 2.8 million game episodes (representing over 30 years of gameplay), standardized evaluation protocols, and baseline implementations of offline RL algorithms including behavior cloning, offline actor-critic, and offline MuZero.^[6] Offline RL agents trained on this benchmark achieved a 90% win rate against the original AlphaStar Supervised agent, demonstrating the potential of learning from pre-collected data without online interaction.^[6]

Open-Source Release

DeepMind released the AlphaStar codebase on GitHub, enabling the research community to study, reproduce, and build upon the system. This release has supported numerous follow-up projects, including mini-AlphaStar implementations that reduced computational requirements while preserving the core multi-agent RL components.

Broader Significance

AlphaStar demonstrated that AI could handle an environment combining imperfect information, real-time constraints, enormous action spaces, and long-term planning. These properties are far more representative of real-world decision-making challenges than the perfect-information, turn-based games previously conquered by AI systems like Deep Blue or AlphaGo. The techniques developed for AlphaStar have potential applications in robotics, autonomous systems, resource management, and any domain requiring sequential decision-making under uncertainty.

The Nature paper has accumulated thousands of citations and remains one of the most referenced works in deep reinforcement learning.^[1] It established StarCraft II as a benchmark domain for multi-agent RL and demonstrated that the combination of supervised pretraining, multi-agent league training, and carefully designed neural architectures could produce agents capable of competing with the best human players in one of the most complex games ever created.

References

Vinyals, O., Babuschkin, I., Czarnecki, W.M. et al. "Grandmaster level in StarCraft II using multi-agent reinforcement learning." *Nature* 575, 350-354 (2019). https://doi.org/10.1038/s41586-019-1724-z ↩
DeepMind. "AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning." DeepMind Blog, October 30, 2019. https://deepmind.google/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning/ ↩
DeepMind. "AlphaStar: Mastering the real-time strategy game StarCraft II." DeepMind Blog, January 24, 2019. https://deepmind.google/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii/ ↩
Vinyals, O., Ewalds, T., Bartunov, S. et al. "StarCraft II: A New Challenge for Reinforcement Learning." arXiv:1708.04782 (2017). https://arxiv.org/abs/1708.04782 ↩
Blizzard Entertainment. "DeepMind Research on Ladder." Blizzard News, July 2019. https://news.blizzard.com/en-us/starcraft2/22933138/deepmind-research-on-ladder ↩
Mathieu, M., Ozair, S., Srinivasan, S. et al. "AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning." arXiv:2308.03526 (2023). https://arxiv.org/abs/2308.03526 ↩
OpenAI. "OpenAI Five defeats Dota 2 world champions." OpenAI Blog, April 15, 2019. https://openai.com/index/openai-five-defeats-dota-2-world-champions/ ↩
Berner, C., Brockman, G., Chan, B. et al. "Dota 2 with Large Scale Deep Reinforcement Learning." arXiv:1912.06680 (2019). https://arxiv.org/abs/1912.06680 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What links here

AI agents AI in gaming Action (Reinforcement Learning)Agent David Silver Deep Learning Gaming Google DeepMind Igor Babuschkin Ineffable Intelligence Karén Simonyan LSTM Machine learning terms/Reinforcement Learning OpenAI Five Oriol Vinyals Reinforcement learning SIMA (DeepMind)

Why was StarCraft II a grand challenge for AI?

Imperfect Information

Real-Time Decision-Making

Enormous Action Space

Long-Term Strategic Planning

Multi-Scale Complexity

Development History

Origins and Blizzard Collaboration

Path to AlphaStar

How does AlphaStar's neural network work?

Input Encoders

Core: Deep LSTM

Auto-Regressive Policy Head

Centralized Value Baseline

How was AlphaStar trained?

Phase 1: Supervised Learning from Human Replays

Phase 2: Multi-Agent Reinforcement Learning (AlphaStar League)

Reinforcement Learning Algorithm

What happened in the January 2019 demonstration?

AlphaStar vs. TLO

AlphaStar vs. MaNa

The Live Exhibition Match

Criticism of the Initial Demonstration

How did AlphaStar reach Grandmaster on Battle.net?

Human-Like Constraints

Ladder Deployment

Grandmaster Results

Performance Progression

Technical Details Summary

How does AlphaStar compare with OpenAI Five?

Legacy and Impact

Contributions to AI Research

AlphaStar Unplugged (2023)

Open-Source Release

Broader Significance

See Also

References

Improve this article

Related Articles

AlphaZero

AlphaGo Zero

OpenAI Five

SIMA (DeepMind)

Monte Carlo Tree Search

Pluribus (poker AI)

What links here

Related Articles

AlphaZero

AlphaGo Zero

OpenAI Five

SIMA (DeepMind)

Monte Carlo Tree Search

Pluribus (poker AI)

What links here