# AlphaStar

> Source: https://aiwiki.ai/wiki/alphastar
> Updated: 2026-06-22
> Categories: AI in Gaming, Artificial Intelligence, Google DeepMind, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**AlphaStar** is an artificial intelligence system built by [Google DeepMind](/wiki/google_deepmind) that in 2019 became the first AI to reach Grandmaster level in the real-time strategy game [StarCraft II](/wiki/starcraft_ii), ranking above 99.8% of active human players on the official Battle.net servers (the top 0.2%) for all three in-game races.[1] It learned to play through a combination of imitation learning from human replays and large-scale multi-agent [reinforcement learning](/wiki/reinforcement_learning), a self-play system DeepMind called the AlphaStar League. The work was published on October 30, 2019 in the journal *Nature* under the title "Grandmaster level in StarCraft II using multi-agent reinforcement learning," authored by Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, and more than 40 collaborators at DeepMind.[1]

AlphaStar first reached the public in a January 24, 2019 livestream, where pre-recorded matches showed it beating two professional players from Team Liquid 5-0 each.[3] It is considered one of the most significant milestones in game-playing AI, alongside [AlphaGo](/wiki/alphago), [AlphaZero](/wiki/alphazero), and [OpenAI Five](/wiki/openai_five). It was the first AI system to reach the top league of a widely played esport without any game restrictions, operating under conditions comparable to human players.[1]

## Why was StarCraft II a grand challenge for AI?

StarCraft II, developed by Blizzard Entertainment, has long been regarded as one of the most demanding environments for artificial intelligence research.[4] Several properties of the game make it substantially harder than board games like chess or Go, which prior AI systems had already conquered.

### Imperfect Information

Unlike chess or Go, where both players can see the entire board, StarCraft II features a "fog of war" that hides regions of the map the player has not scouted.[4] Players must actively move a camera to observe different parts of the battlefield, and enemy actions remain hidden unless units are positioned to detect them. This partial observability forces the agent to reason under uncertainty, make inferences about opponent strategies from incomplete data, and decide when to invest resources in scouting.

### Real-Time Decision-Making

StarCraft II is not turn-based. Both players issue commands simultaneously and continuously. Actions must be executed in real time, with decisions made on the scale of milliseconds. The agent cannot pause to deliberate; it must integrate information, plan, and act under constant time pressure.

### Enormous Action Space

At any given moment, a StarCraft II player may have access to approximately 10^26 possible actions.[1] This dwarfs the action spaces of board games (chess has roughly 10^120 possible game states, but the branching factor per move is around 35). In StarCraft II, each action involves selecting one or more units, choosing an ability or command, and specifying a target location or target unit. The combinatorial explosion of these choices creates one of the largest action spaces in any game studied by AI researchers.

### Long-Term Strategic Planning

A typical StarCraft II game lasts between 10 and 30 minutes and can involve thousands of individual decision steps.[1] Players must balance short-term tactical micro-management (controlling individual units in battle) with long-term macro-strategy (building an economy, choosing a technology path, timing attacks). The reward signal is extremely sparse: the agent only learns whether it won or lost at the very end of the game, making credit assignment across thousands of steps exceptionally difficult.

### Multi-Scale Complexity

The game requires simultaneous mastery of several distinct skills: economic management (gathering resources and building infrastructure), unit production (choosing which units to build and when), technology research (selecting upgrades and unlocking new capabilities), scouting (gathering intelligence about the opponent), tactical combat (positioning and controlling units in battles), and strategic planning (choosing when to attack, defend, or expand). Excelling at any single dimension is not enough; the agent must coordinate all of these at once.

## Development History

### Origins and Blizzard Collaboration

As early as 2011, DeepMind co-founder [Demis Hassabis](/wiki/demis_hassabis) identified StarCraft as "the next step up" for AI after board games. Following [AlphaGo](/wiki/alphago)'s historic victory over Lee Sedol in March 2016, Hassabis publicly discussed the possibility of building an AI for StarCraft, citing it as a strategic game with incomplete information where much of the "board" is invisible.

In November 2016, DeepMind and Blizzard Entertainment announced a formal collaboration at BlizzCon, alongside plans to release an open development environment for AI research in StarCraft II. This led to the release of the StarCraft II Learning Environment (SC2LE) and PySC2 in August 2017, providing researchers worldwide with tools to develop [AI agents](/wiki/ai_agents) for the game.[4]

### Path to AlphaStar

Development of AlphaStar proceeded through several phases during 2018. The team first built a [supervised learning](/wiki/supervised_learning) pipeline to train agents from human replay data, then developed the multi-agent [reinforcement learning](/wiki/reinforcement_learning) system known as the AlphaStar League.[1] By December 2018, the system was strong enough to defeat professional players.[3]

## How does AlphaStar's neural network work?

AlphaStar's neural network architecture is a sophisticated combination of several components, totaling approximately 139 million parameters (with 55 million required during inference).[1] The architecture was designed to handle the unique challenges of StarCraft II: processing diverse input types, maintaining memory over long game sequences, and producing structured, combinatorial actions.

### Input Encoders

The architecture uses three specialized encoders to process different aspects of the game state:

| Encoder | Input Type | Method |
|---|---|---|
| **Scalar Encoder** | Global game information (resources, supply, game time, player statistics) | Linear layers with ReLU activations |
| **Entity Encoder** | Information about individual game units (type, health, position, ownership) | [Transformer](/wiki/transformer) with self-attention over entities |
| **Spatial Encoder** | 2D map features (terrain, unit positions, visibility) | 2D convolutions followed by ResBlocks |

The entity encoder is particularly notable. It applies a [transformer](/wiki/transformer) to process information about all visible units on the map, treating each unit as a token in a sequence.[1] This allows the network to learn relationships between units, such as which enemy units threaten which friendly units, or which buildings are part of a coordinated production strategy.

A scatter connection combines spatial and non-spatial features, allowing information from individual entities to be projected onto the spatial map representation and vice versa.

### Core: Deep LSTM

The encoded observations are combined and fed into a deep [LSTM](/wiki/long_short-term_memory_lstm) (Long Short-Term Memory) network, which serves as the central memory and decision-making component.[1] The LSTM maintains a hidden state across time steps, enabling the agent to remember past observations, track the progress of its strategy, and reason about events that occurred earlier in the game.

### Auto-Regressive Policy Head

AlphaStar produces actions through an auto-regressive policy head. Rather than predicting all action components simultaneously, the network generates each part of an action sequentially, with each subsequent component conditioned on all previous ones:[1]

1. **Action type** (e.g., move, attack, build, train unit)
2. **Delay** (how long to wait before the next action)
3. **Queue** (whether to queue the action)
4. **Selected units** (which units to control, using a pointer network)
5. **Target unit** (if applicable, selected via a pointer network)
6. **Target location** (a point on the map)

The pointer network component is critical for handling the variable number of units in the game. Since the number of controllable units changes constantly as units are produced and destroyed, a fixed-size output layer cannot represent unit selection. The pointer network instead attends over the set of available entities, producing a probability distribution over them.

### Centralized Value Baseline

For training, AlphaStar uses a centralized value function that has access to additional information not available to the policy (such as opponent information). This helps stabilize training by providing better estimates of state values, while the policy itself only uses information available to a human player.[1]

## How was AlphaStar trained?

AlphaStar's training followed a two-phase approach: supervised learning from human replays, followed by multi-agent reinforcement learning in the AlphaStar League.[1] DeepMind noted that "the League training is fully automated, and starts only with agents trained by supervised learning."[2]

### Phase 1: Supervised Learning from Human Replays

The initial training phase used approximately 971,000 anonymized human game replays provided by Blizzard, drawn from players with MMR (matchmaking rating) above 3,500 (roughly the top 22% of the player population).[1] The agent learned to predict human actions given the current game state, effectively imitating the strategies and tactics used by skilled human players. According to DeepMind, this imitation learning produced "an initial policy which played the game better than 84% of active players."[2]

This supervised learning phase served two critical purposes. First, it provided a strong behavioral foundation, teaching the agent basic strategies, build orders, unit compositions, and tactical patterns. Second, it solved the exploration problem: discovering viable strategies from scratch through random exploration would be like finding a needle in a haystack, given the vast action space.

After supervised training, the agent (called AlphaStar Supervised) achieved an MMR of approximately 3,699, placing it above 84% of human players.[1] It could also defeat Blizzard's built-in Elite AI in 95% of matches.

To preserve strategic diversity, the supervised learning phase also trained the agent conditioned on a latent variable *z*, sampled from the distribution of human strategies.[1] This meant the agent could produce different opening builds and strategic approaches depending on the value of *z*, rather than collapsing to a single dominant strategy.

### Phase 2: Multi-Agent Reinforcement Learning (AlphaStar League)

The second phase used a novel multi-agent training framework called the AlphaStar League. Instead of simple self-play (where an agent trains against copies of itself), the League maintains a diverse population of agents that train against each other under varying objectives.[1] As DeepMind described it, the League contains "main agents whose goal is to win versus everyone, and also exploiter agents that focus on helping the main agent grow stronger by exposing its flaws."[2]

The League contained three types of agents:

| Agent Type | Objective | Opponent Selection |
|---|---|---|
| **Main Agents** | Maximize win rate against all opponents in the league | Prioritized Fictitious Self-Play (PFSP): opponents selected with probability proportional to the main agent's loss rate against them |
| **Main Exploiters** | Find and exploit weaknesses specifically in the current main agents | Trained against the latest main agents |
| **League Exploiters** | Find systemic weaknesses across the entire league | PFSP across all agents in the league |

The exploiter agents served a crucial role: they acted as adversarial stress-testers, discovering degenerate strategies or blind spots in the main agents.[1] When a main exploiter found a strategy that consistently beat a main agent, the main agent would then be trained to defend against that exploit. Both types of exploiters periodically reset their weights to encourage exploration of new attack strategies.

This league structure prevented the "forgetting" problem common in simple self-play, where an agent learns to counter its current opponent but loses the ability to handle earlier strategies. The League preserved strategic diversity while still driving improvement.

The AlphaStar League was run for 14 days, using 16 third-generation [TPUs](/wiki/tensor_processing_unit_tpu) for each agent.[1] During training, each agent experienced up to 200 years of real-time StarCraft gameplay through accelerated simulation.[2] The entire training infrastructure used Google's v3 TPUs with a highly scalable distributed setup supporting thousands of parallel StarCraft II instances.

### Reinforcement Learning Algorithm

AlphaStar's RL algorithm combined several techniques:[1]

- **V-trace**: An off-policy correction method for the policy gradient, addressing the fact that training data was collected by earlier versions of the policy.
- **TD(lambda)**: Temporal difference learning for updating the value function.
- **UPGO (Upgoing Policy Gradient Operator)**: A novel self-imitation algorithm that biases the policy toward trajectories where the actual outcome exceeded the expected value. When an action led to a better-than-expected result, the agent learned from it; when it led to a worse-than-expected result, it bootstrapped from the value estimate instead.
- **KL divergence penalty**: A regularization term that prevented the RL policy from drifting too far from the supervised learning policy, helping to maintain human-like play patterns.

The reward signal was binary and sparse: +1 for a win, -1 for a loss, received only at the end of the game.[1] No intermediate rewards (such as resources gathered or units killed) were used for the final version of the agent.

## What happened in the January 2019 demonstration?

On January 24, 2019, DeepMind publicly unveiled AlphaStar in a livestreamed event, showcasing pre-recorded matches against two professional StarCraft II players from Team Liquid.[3]

### AlphaStar vs. TLO

The first series pitted AlphaStar (playing Protoss) against Dario "TLO" Wunsch, a top professional Zerg player. TLO is primarily a Zerg specialist but played Protoss for this match to enable a mirror matchup. AlphaStar won all five games (5-0), deploying distinct strategies in each game.[3] After the series, TLO said the agent "feels very fair, like it is playing a 'real' game of StarCraft," adding that "AlphaStar has excellent and precise control, it doesn't feel superhuman."[2]

### AlphaStar vs. MaNa

The second series, played on December 19, 2018 under professional match conditions, featured AlphaStar (playing Protoss) against Grzegorz "MaNa" Komincz, one of the world's top Protoss players, ranked among the top 10 Protoss specialists globally. AlphaStar again won all five games (5-0).[3] Both series were played on the competitive ladder map CatalystLE, using StarCraft II version 4.6.2.

AlphaStar averaged approximately 280 actions per minute (APM) during these matches, with an average reaction delay of 350 milliseconds between observation and action.[3] Both figures are within the range of professional human players.

### The Live Exhibition Match

Following the broadcast of the pre-recorded matches, DeepMind arranged a live exhibition match between MaNa and a newer version of AlphaStar that had been trained with camera interface restrictions (limiting it to view the game through a movable camera, just as human players do). This version had only been trained for seven days with the camera restriction. MaNa won the live game, dealing AlphaStar its first loss against a professional player.[3] MaNa exploited the camera-restricted agent's weaknesses, demonstrating that the camera constraint meaningfully affected performance.

### Criticism of the Initial Demonstration

The January 2019 matches drew significant criticism from the StarCraft community and AI researchers on several grounds:

- **Global camera access**: In the pre-recorded matches, AlphaStar observed the game through a raw interface that provided information about all visible units simultaneously, rather than viewing a limited screen area through a camera as human players do. This gave the agent a meaningful informational advantage.
- **Burst APM**: While AlphaStar's average APM was comparable to professional play, observers noted that its instantaneous action rate could spike dramatically during critical moments. Reports indicated burst APM reaching 900 or even 1,500 in short windows, far exceeding human capabilities.
- **API-level precision**: AlphaStar interacted with the game engine through a programmatic API rather than through a visual display with mouse and keyboard. This allowed pixel-perfect targeting and instantaneous unit selection that human players cannot replicate.

These criticisms motivated DeepMind to develop a significantly more constrained version of AlphaStar for the Battle.net ladder evaluation.

## How did AlphaStar reach Grandmaster on Battle.net?

In response to criticism about the fairness of the initial demonstration, DeepMind retrained AlphaStar with substantially tighter constraints, designed in collaboration with professional player TLO.[2]

### Human-Like Constraints

The updated AlphaStar operated under the following restrictions:

| Constraint | Details |
|---|---|---|
| **Camera interface** | The agent viewed the game through a movable camera, receiving only the visual information available to a human player at any given moment |
| **Action rate cap** | Maximum of 22 non-duplicate actions per 5-second window |
| **Action counting** | One agent action (select units + choose ability + pick target) could count as up to 3 in-game APM; camera movements also counted against the action budget |
| **Observation format** | Processed structured game state data (unit lists, map features) rather than raw pixels, but limited to camera-visible information |

### Ladder Deployment

Starting in July 2019, AlphaStar was deployed anonymously on the European Battle.net 1v1 competitive ladder.[5] Players who opted in to a special research program could be matched against the AI without knowing its identity. AlphaStar played with all three StarCraft II races (Protoss, Terran, and Zerg), each controlled by a separately trained agent.[1]

Blizzard announced the deployment through an official blog post, and the StarCraft community was aware that an AI was competing on the ladder, though they did not know which specific accounts belonged to AlphaStar.[5]

### Grandmaster Results

By late October 2019, AlphaStar had achieved Grandmaster rank for all three races on the European server, placing it above 99.8% of active players (the top 0.2%) among roughly 90,000 ranked players.[1] The final agent's average rating was within the top 0.15%.[1] The specific MMR ratings achieved by the final version (AlphaStar Final) were:

| Race | MMR Rating | Approximate Percentile |
|---|---|---|
| [Protoss](/wiki/protoss) | 6,275 | Top 0.15% |
| [Terran](/wiki/terran) | 6,048 | Top 0.15% |
| [Zerg](/wiki/zerg) | 5,835 | Top 0.15% |

This made AlphaStar the first AI agent to reach the top league of a major esport under conditions comparable to human play.[1] The research team noted that professional players who reviewed AlphaStar's gameplay confirmed that it felt "fair" and "real," without a superhuman quality to its mechanics.[2]

### Performance Progression

The paper documented three stages of AlphaStar's development, illustrating the contribution of each training phase:[1]

| Version | Training Stage | Approximate Percentile |
|---|---|---|
| AlphaStar Supervised | Supervised learning only | Top 16% |
| AlphaStar Mid | Midpoint of RL training | Top 0.5% |
| AlphaStar Final | Full league training with camera constraints | Top 0.15% |

## Technical Details Summary

| Property | Value |
|---|---|
| Total parameters | ~139 million |
| Inference parameters | ~55 million |
| Training hardware | Google TPU v3 (16 TPUs per agent) |
| League training duration | 14 days |
| Supervised learning dataset | ~971,000 human replays (MMR > 3,500) |
| Gameplay experience per agent | Up to 200 years of real-time play |
| Reward signal | Binary win/loss (sparse) |
| Action space | ~10^26 possible actions per time step |
| Average APM (Battle.net version) | Capped at 22 actions per 5 seconds |
| Races mastered | All three (Protoss, Terran, Zerg) |
| Peak MMR achieved | 6,275 (Protoss, European server) |
| Final ranking | Above 99.8% of active human players (top 0.2%) |

## How does AlphaStar compare with OpenAI Five?

AlphaStar and [OpenAI Five](/wiki/openai_five) were developed around the same period and represent the two most prominent achievements in AI for complex multiplayer video games. While both systems demonstrated superhuman performance, they tackled different games with different approaches.

| Feature | AlphaStar (StarCraft II) | OpenAI Five (Dota 2) |
|---|---|---|
| Developer | [Google DeepMind](/wiki/google_deepmind) | [OpenAI](/wiki/openai) |
| Game | StarCraft II | Dota 2 |
| Game type | 1v1 real-time strategy | 5v5 multiplayer online battle arena |
| Information | Imperfect (fog of war) | Imperfect (fog of war) |
| Architecture | Transformer + Deep LSTM + Pointer Network | Single-layer 4,096-unit LSTM per hero |
| Total parameters | ~139 million | ~159 million |
| Training method | Supervised learning + multi-agent RL (league) | Pure self-play (no human data) |
| RL algorithm | V-trace + TD(lambda) + UPGO | [Proximal Policy Optimization (PPO)](/wiki/reinforcement_learning) |
| Training hardware | 16 TPUs per agent (Google TPU v3) | 256 GPUs + 128,000 CPU cores |
| Training compute | 14 days (league phase) | ~10 months (~770 PFlops/s-days) |
| Gameplay experience | Up to 200 years per agent | ~180 years per day (collective) |
| Key achievement | Grandmaster in all 3 races (top 0.2%) | Defeated Dota 2 world champions OG (2-0) |
| Date of key result | October 2019 (Nature publication) | April 13, 2019 (OG match) |
| Game restrictions | None (full game, all races, all maps) | Restricted to 17 heroes |
| Publication | *Nature* (October 30, 2019) | arXiv preprint (December 2019) |

Both systems used distinct strategies to handle their respective games. AlphaStar's league-based multi-agent approach was designed to maintain strategic diversity and prevent cyclic weaknesses (rock-paper-scissors dynamics), while OpenAI Five relied on massive-scale self-play with no human data at all.[8] AlphaStar's use of supervised pretraining from human replays gave it a strong initial behavioral prior, whereas OpenAI Five learned entirely from scratch.

One notable difference is that AlphaStar played the full, unrestricted version of StarCraft II during its Battle.net evaluation (all races, all maps in the competitive pool), while OpenAI Five's victory over OG used a restricted hero pool of 17 out of over 100 available heroes.[7]

## Legacy and Impact

AlphaStar's achievement had significant repercussions across several areas of AI research and the gaming community.

### Contributions to AI Research

The AlphaStar project introduced or validated several techniques that have influenced subsequent research:

- **Multi-agent league training**: The concept of maintaining a diverse population of agents with different training objectives (main agents, exploiters) has been adopted and extended in numerous multi-agent reinforcement learning studies. The league structure demonstrated a practical solution to the non-transitivity problem in game-theoretic training.
- **Transformer-based entity processing**: Using self-attention mechanisms to process variable-length sets of game entities (units) has become a standard approach in game AI and other domains requiring reasoning over sets of objects.
- **UPGO algorithm**: The upgoing policy gradient operator provided a new technique for self-imitation learning in sparse-reward environments.
- **Scalable distributed training**: The infrastructure for running thousands of parallel game instances with population-based training demonstrated a template for large-scale RL systems.

### AlphaStar Unplugged (2023)

In August 2023, DeepMind released "AlphaStar Unplugged," a large-scale offline [reinforcement learning](/wiki/reinforcement_learning) benchmark built on the AlphaStar codebase.[6] The benchmark includes a dataset of 2.8 million game episodes (representing over 30 years of gameplay), standardized evaluation protocols, and baseline implementations of offline RL algorithms including behavior cloning, offline actor-critic, and offline [MuZero](/wiki/muzero).[6] Offline RL agents trained on this benchmark achieved a 90% win rate against the original AlphaStar Supervised agent, demonstrating the potential of learning from pre-collected data without online interaction.[6]

### Open-Source Release

DeepMind released the AlphaStar codebase on GitHub, enabling the research community to study, reproduce, and build upon the system. This release has supported numerous follow-up projects, including mini-AlphaStar implementations that reduced computational requirements while preserving the core multi-agent RL components.

### Broader Significance

AlphaStar demonstrated that AI could handle an environment combining imperfect information, real-time constraints, enormous action spaces, and long-term planning. These properties are far more representative of real-world decision-making challenges than the perfect-information, turn-based games previously conquered by AI systems like [Deep Blue](/wiki/deep_blue) or AlphaGo. The techniques developed for AlphaStar have potential applications in robotics, autonomous systems, resource management, and any domain requiring sequential decision-making under uncertainty.

The *Nature* paper has accumulated thousands of citations and remains one of the most referenced works in [deep reinforcement learning](/wiki/reinforcement_learning).[1] It established StarCraft II as a benchmark domain for multi-agent RL and demonstrated that the combination of supervised pretraining, multi-agent league training, and carefully designed neural architectures could produce agents capable of competing with the best human players in one of the most complex games ever created.

## See Also

- [AlphaGo](/wiki/alphago)
- [AlphaZero](/wiki/alphazero)
- [Google DeepMind](/wiki/google_deepmind)
- [Reinforcement Learning](/wiki/reinforcement_learning)
- [OpenAI Five](/wiki/openai_five)
- [Transformer](/wiki/transformer)
- [Multi-Agent Reinforcement Learning](/wiki/reinforcement_learning)

## References

1. Vinyals, O., Babuschkin, I., Czarnecki, W.M. et al. "Grandmaster level in StarCraft II using multi-agent reinforcement learning." *Nature* 575, 350-354 (2019). https://doi.org/10.1038/s41586-019-1724-z
2. DeepMind. "AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning." DeepMind Blog, October 30, 2019. https://deepmind.google/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning/
3. DeepMind. "AlphaStar: Mastering the real-time strategy game StarCraft II." DeepMind Blog, January 24, 2019. https://deepmind.google/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii/
4. Vinyals, O., Ewalds, T., Bartunov, S. et al. "StarCraft II: A New Challenge for Reinforcement Learning." arXiv:1708.04782 (2017). https://arxiv.org/abs/1708.04782
5. Blizzard Entertainment. "DeepMind Research on Ladder." Blizzard News, July 2019. https://news.blizzard.com/en-us/starcraft2/22933138/deepmind-research-on-ladder
6. Mathieu, M., Ozair, S., Srinivasan, S. et al. "AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning." arXiv:2308.03526 (2023). https://arxiv.org/abs/2308.03526
7. OpenAI. "OpenAI Five defeats Dota 2 world champions." OpenAI Blog, April 15, 2019. https://openai.com/index/openai-five-defeats-dota-2-world-champions/
8. Berner, C., Brockman, G., Chan, B. et al. "Dota 2 with Large Scale Deep Reinforcement Learning." arXiv:1912.06680 (2019). https://arxiv.org/abs/1912.06680

