OpenAI Five was a reinforcement learning system developed by OpenAI to play the competitive multiplayer video game Dota 2 at a professional level. The project spanned from 2017 to 2019 and culminated in a landmark achievement: on April 13, 2019, OpenAI Five became the first AI system to defeat reigning world champions in a major esports title, beating Team OG 2-0 in a best-of-three series in San Francisco. The system trained using Proximal Policy Optimization (PPO) at massive scale, accumulating roughly 45,000 years of in-game experience over ten months of real-time training on 256 GPUs and 128,000 CPU cores.
Dota 2 is a multiplayer online battle arena (MOBA) game developed by Valve Corporation. In a standard match, two teams of five human players compete to destroy the opposing team's base structure called the Ancient. Each player controls a "hero" selected from a pool of over 115 characters, each with unique abilities. Matches typically last 30 to 60 minutes and require a combination of mechanical skill, strategic planning, team coordination, and real-time adaptation.
OpenAI selected Dota 2 as a research challenge because the game presented several properties that made it far more difficult for AI systems than previous game-playing benchmarks like chess or Go. Greg Brockman, OpenAI's co-founder and then-CTO, described the game as a stepping stone toward building AI systems that could handle the complexity and unpredictability of real-world problems.
Dota 2 posed a number of challenges that distinguished it from earlier AI game-playing milestones:
Partial observability. Unlike chess or Go, where both players can see the entire board, Dota 2 uses a "fog of war" mechanic. Players can only see areas of the map near their own units or wards, meaning much of the game state is hidden at any given time. The AI had to make decisions under significant uncertainty about enemy positions, intentions, and item builds.
Long time horizons. A typical Dota 2 match involves around 20,000 timesteps (at the frame rate used by OpenAI Five). Actions taken in the early game, such as resource allocation and lane assignments, can have consequences that only become apparent 20 or 30 minutes later. This made credit assignment extremely difficult.
Enormous action space. At each timestep, a hero can choose from roughly 8,000 to 80,000 valid actions depending on the situation. The theoretical action space, factoring in all possible combinations of action type, target, and positioning, reaches approximately 1.8 million dimensions. For comparison, Go has a branching factor of roughly 250 moves per turn, and chess about 35.
High-dimensional observation space. Each hero receives approximately 16,000 numerical inputs per timestep describing the game state, including unit positions, health values, ability cooldowns, item inventories, and more. Rather than processing raw screen pixels, OpenAI Five consumed structured data through Valve's bot API.
Five-player coordination. Dota 2 is a team game requiring tight coordination among five players. Successful play demands role specialization, shared map control, coordinated team fights, and collective resource management. Each of the five AI agents needed to learn cooperative behavior without explicit communication channels.
Complex game mechanics. Dota 2's rules are implemented in hundreds of thousands of lines of code, with intricate interactions between hero abilities, items, terrain, and neutral creeps. The game receives frequent patches that alter balance and mechanics.
Before tackling the full 5v5 game, OpenAI built a bot that played 1v1 mid-lane matches using the hero Shadow Fiend. Development of the underlying algorithms began in November 2016. The bot learned entirely through self-play, starting with no prior knowledge of the game and gradually discovering effective strategies by playing against copies of itself. According to Greg Brockman, the bot required approximately two weeks of training to reach a competitive level.
The progression was rapid. By March 2017, the system achieved its first classical reinforcement learning results in a simplified Dota environment. By early June 2017, it could beat a tester at 1,500 matchmaking rating (MMR). By June 30, it won the majority of games against a 3,000 MMR tester. By July 8, it secured its first win against a 7,500 MMR semi-professional tester.
On August 11, 2017, OpenAI staged a surprise demonstration at The International 2017 (TI7), Dota 2's premier annual championship tournament, held in Seattle. The bot was matched against Danylo "Dendi" Ishutin, a Ukrainian professional player and former world champion widely regarded as one of the most recognizable figures in competitive Dota 2.
The match was played under standard 1v1 Shadow Fiend rules: first to two kills or first to destroy the enemy tower, with no neutral creeps and item restrictions. The bot won the first game in under ten minutes, establishing a commanding lead in last hits (34 to Dendi's 14, with 15 denies to Dendi's 2). Dendi conceded the second game shortly after it began. During the match, Dendi repeatedly remarked, "This guy is scary."
In the days surrounding TI7, the bot also defeated several other top players in private matches, including Arteezy (rated approximately 10,000 MMR, one of the highest-rated players in the world) with a 10-0 record, SumaiL (a top 1v1 specialist) 6-0, Pajkatt (a professional player rated 8,500 MMR) 2-1, and Blitz (a former professional rated 6,200 MMR) 3-0.
The 1v1 bot operated under significant constraints. It played only one hero (Shadow Fiend) in a simplified 1v1 mid-lane format that eliminated most of the strategic complexity of the full game. There were no allied or enemy teammates, no jungle, limited items, and no need for map awareness or team coordination. Critics noted that 1v1 mid was largely a test of mechanical execution and lane control rather than deep strategic reasoning.
OpenAI Five consisted of five independent neural networks, one controlling each hero on the team. Each network shared the same architecture and weights but received different observations indicating which of the five heroes it controlled. The core of each network was a single-layer Long Short-Term Memory (LSTM) network with 4,096 hidden units. The full model contained approximately 159 million parameters, with the LSTM accounting for roughly 84% of the total parameter count.
The observation space was flattened into a single vector of approximately 16,000 values per hero, representing all game-state information available to a human player (unit positions, health, mana, cooldowns, items, and so on). All floating-point observations were normalized using z-scores (subtracting the mean and dividing by the standard deviation) and clipped to the range (-5, 5) for training stability.
The action space used a factored structure. At each timestep, a hero selected a primary action (from up to 30 possibilities, averaging 8.1 available per timestep), plus parameters for delay (4 dimensions), unit selection (189 dimensions), and spatial offset (81 dimensions). The combined theoretical action space reached approximately 1,837,080 dimensions, though invalid actions were filtered based on cooldowns, valid targets, and situational constraints.
Critically, the five hero networks did not communicate directly with each other. There was no shared memory, messaging system, or centralized coordinator. All coordination emerged purely from training through self-play. Each agent learned to anticipate what its teammates would do based on the observable game state.
OpenAI Five was trained using Proximal Policy Optimization (PPO), a policy gradient reinforcement learning algorithm developed at OpenAI. PPO was chosen for its stability and scalability when applied to large-scale distributed training.
The training ran on a custom distributed platform called "Rapid," hosted on Google Cloud Platform. The infrastructure consisted of:
The system was organized into several components:
| Component | Role |
|---|---|
| Rollout Workers (CPUs) | Simulated Dota 2 games and collected experience data |
| Forward Pass GPUs | Computed actions for rollout workers during gameplay |
| Optimizer GPUs | Sampled experience from the buffer, computed gradients, and updated model parameters |
| Controller | Distributed updated parameters to all components |
| Experience Buffer | Stored gameplay data for the optimizers to sample from |
Rollout workers and optimizers operated asynchronously. The system targeted a sample reuse ratio close to 1, meaning optimizers consumed experience data at roughly the same rate that rollout workers produced it. Stale data was treated as harmful; game data was sent every 30 seconds, and model parameters were updated approximately once per minute.
At peak throughput, OpenAI Five played the equivalent of 180 years of Dota 2 per day during its initial training phase. Over the full ten-month training period (June 30, 2018 to April 22, 2019), the system accumulated approximately 45,000 years of in-game experience. The total compute used was estimated at 770 plus or minus 50 PetaFLOP/s-days.
Training a reinforcement learning agent on a game as complex as Dota 2 with only a win/loss signal at the end of a 45-minute match would be extremely slow. OpenAI addressed this by designing a shaped reward function that provided intermediate feedback throughout the game.
The reward function included components tied to game metrics that human players use to evaluate performance:
| Reward Component | Description |
|---|---|
| Kills | Reward for killing enemy heroes |
| Deaths | Penalty for being killed |
| Assists | Reward for contributing to teammate kills |
| Last Hits | Reward for landing the killing blow on enemy creeps (gold income) |
| Net Worth | Reward tied to total gold and item value |
| Tower Damage | Reward for damaging or destroying enemy towers |
Two important mechanisms modified the raw rewards:
Zero-sum adjustment. Each hero's reward was adjusted by subtracting the average reward of the enemy team, preventing agents from discovering positive-sum exploits that would not translate to competitive play.
Exponential time weighting. Rewards were scaled based on game time to prevent agents from overvaluing late-game actions, where power levels naturally increase and rewards grow larger in absolute terms.
OpenAI ran experiments comparing the shaped reward to a pure win/loss signal. The win/loss-only version trained an order of magnitude slower and plateaued at a lower skill level.
One of the more notable design choices in OpenAI Five was a hyperparameter called "team spirit," denoted by the Greek letter tau. This parameter controlled the balance between individual and collective reward for each agent, using a simple formula:
effective_reward[i] = tau * mean(all_hero_rewards) + (1 - tau) * hero_reward[i]
At tau = 0, each hero cared only about its own individual reward (kills, last hits, net worth). At tau = 1, each hero weighted the team's average reward equally, promoting fully cooperative behavior.
During training, tau was annealed from 0.2 at the start to 0.97 near the end. Early in training, lower team spirit allowed agents to learn basic individual skills like farming and fighting. As training progressed, higher team spirit pushed the agents toward coordinated team play, sacrificing individual advantage for collective benefit.
OpenAI Five trained entirely through self-play, with no human demonstration data or imitation learning. The system played 80% of its games against the latest version of itself and 20% against past checkpoints sampled from its training history. This mixture helped prevent "strategy collapse," where the agent might develop a narrow set of tactics that work well against its current self but fail against diverse opponents.
Past opponents were selected using a dynamic quality scoring system that prioritized informative matchups over random historical snapshots.
Over the 296-day (approximately 10-month) training run, OpenAI needed to modify the model multiple times to accommodate game patches, changes to the hero pool, and architectural improvements. The team developed a technique called "surgery" to handle these transitions.
When model changes maintained the same input-output structure, the new model was initialized to replicate the old model's behavior as closely as possible. When this was not feasible (for example, when the observation space changed due to a game patch), the team gradually increased the proportion of games played with the new version, allowing the model to adapt incrementally rather than starting from scratch.
| Date | Event | Result |
|---|---|---|
| November 2016 | Algorithm development begins | N/A |
| March 2017 | First RL results in simplified Dota environment | N/A |
| August 11, 2017 | 1v1 bot vs. Dendi at TI7 | Bot wins 2-0 |
| June 25, 2018 | OpenAI Five announced; beats amateur teams | OpenAI Five wins |
| August 5, 2018 | OpenAI Five Benchmark vs. casters/ex-pros (~4,200 MMR) | OpenAI Five wins 2-1 |
| August 22-23, 2018 | TI8 Showmatches vs. paiN Gaming and a Chinese all-star team | OpenAI Five loses both matches |
| April 13, 2019 | OpenAI Five Finals vs. OG (TI8 world champions) | OpenAI Five wins 2-0 |
| April 18-21, 2019 | OpenAI Five Arena (public online event) | 99.4% win rate (7,215 wins, 42 losses) |
| April 22, 2019 | Training officially ends; project retired | N/A |
OpenAI Five operated under a set of game restrictions that simplified the full Dota 2 experience. These restrictions were gradually relaxed over the project's lifetime, but some remained in place through the final matches against OG.
Earlier versions of OpenAI Five also lacked wards (vision-granting items) and Roshan (a powerful neutral boss that grants significant team advantages when killed). Both were reintroduced before the later matches, adding strategic depth.
The restricted hero pool was one of the most commonly cited limitations. Hero selection ("drafting") is a fundamental part of competitive Dota 2. Teams spend significant effort constructing hero compositions that synergize well and counter the opponent's picks. With only 17 heroes available, this dimension of the game was severely limited. OpenAI reportedly attempted to expand the hero pool to 25 before the OG match but found that the system was not learning quickly enough to reach professional level with the larger pool.
On April 13, 2019, OpenAI hosted the "OpenAI Five Finals" in San Francisco. The headline match pitted OpenAI Five against OG, the winners of The International 2018 (TI8) and the reigning Dota 2 world champions at the time. The event was broadcast on Twitch with commentary from well-known Dota 2 personalities, including William "Blitz" Lee, Austin "Capitalist" Walsh, Owen "ODPixel" Davies, Kevin "Purge" Godec, and Jorien "Sheever" van der Heijden.
OG's roster for the event included their full TI8-winning lineup:
| Player | Real Name | Nationality | Position |
|---|---|---|---|
| ana | Anathan Pham | Australia | Carry (Position 1) |
| Topson | Topias Taavitsainen | Finland | Mid (Position 2) |
| 7ckngMad (Ceb) | Sebastien Debs | France | Offlane (Position 3) |
| JerAx | Jesse Vainikka | Finland | Support (Position 4) |
| N0tail | Johan Sundstein | Denmark | Support (Position 5) |
OpenAI Five won both games decisively.
Game 1 lasted 38 minutes and 18 seconds. OpenAI Five drafted Sniper, Earthshaker, Viper, Riki, and Shadow Fiend. OG played Gyrocopter, Witch Doctor, Death Prophet, Tidehunter, and Crystal Maiden. After the draft phase, OpenAI Five's internal model estimated a 95% win probability. The AI established map control in the mid game and systematically dismantled OG's defenses.
Game 2 lasted just 20 minutes and 51 seconds. OpenAI Five drafted Crystal Maiden, Gyrocopter, Sven, Witch Doctor, and Viper. OG picked Sniper, Earthshaker, Death Prophet, Slark, and Lion. The second game was a dominant performance by the AI, with OG calling "GG" (conceding defeat) before the 21-minute mark.
This result marked the first time an AI system had defeated the reigning world champions in a major esports title in a public, live-streamed competition.
Following the OG match, OpenAI opened the system to the public through the "OpenAI Five Arena," an online event running from April 18 to April 21, 2019. During the event, anyone could form a five-player team and challenge OpenAI Five under the same rules used in the OG match.
The results were overwhelming. Over the four-day period:
Around the same period, DeepMind developed AlphaStar, an AI system for Blizzard Entertainment's real-time strategy game StarCraft II. The two projects represent the most prominent examples of applying deep reinforcement learning to complex competitive video games. While they shared broad similarities, they differed substantially in their technical approaches and the challenges they faced.
| Attribute | OpenAI Five | AlphaStar |
|---|---|---|
| Game | Dota 2 (MOBA) | StarCraft II (RTS) |
| Developer | OpenAI | DeepMind |
| Year of peak result | 2019 | 2019 |
| Players per team | 5 AI agents (cooperative) | 1 AI agent |
| Core architecture | Single-layer 4,096-unit LSTM | Deep LSTM with Transformer encoder and pointer network |
| Model parameters | ~159 million | ~139 million (55 million at inference) |
| Training method | Pure self-play with PPO | Supervised learning on human replays, then RL with league-based self-play |
| Human data used | None | 971,000 human replays |
| Compute hardware | 256 NVIDIA P100 GPUs + 128,000 CPU cores | 16-32 TPUv3s per agent; 384 TPUv3s for league |
| Training duration | ~10 months (continuous) | ~44 days (league training phase) |
| Game experience | ~45,000 years equivalent | Not directly comparable (league-based) |
| Input representation | Structured game-state vectors (bot API) | Structured game-state data (no raw pixels) |
| Game restrictions | 17-hero pool, no items like Divine Rapier, no summons | Full game; camera and APM constraints added later |
| Peak achievement | Defeated TI8 champions OG 2-0 | Reached Grandmaster (top 0.2%) in all three races |
| Publication | arXiv, December 2019 | Nature, October 2019 |
One notable difference was the use of human data. AlphaStar bootstrapped its training with supervised learning on nearly one million human replays before transitioning to reinforcement learning, while OpenAI Five learned entirely from scratch through self-play. AlphaStar also used a "league" training approach where multiple agents specialized against different strategies, whereas OpenAI Five used a single population trained against its current and past selves.
Another key distinction involved game restrictions. AlphaStar eventually played the full StarCraft II game with all three races and no gameplay restrictions (though with constrained action rates to approximate human physical limitations). OpenAI Five never played with the full hero roster or all game mechanics enabled.
OpenAI Five demonstrated that relatively simple reinforcement learning algorithms, when scaled to sufficient compute, could solve problems of remarkable complexity. The system used no search trees, no explicit planning modules, and no hand-coded strategies beyond the reward function. Its performance came entirely from the combination of a large LSTM network, self-play, and massive computational scale.
The project provided evidence for a hypothesis that would become increasingly central to AI research in subsequent years: that scale, in terms of both model size and training compute, could substitute for algorithmic complexity in many domains.
OpenAI reused the same reinforcement learning algorithms and training code from OpenAI Five for Dactyl, a project that trained a robotic hand (a Shadow Dexterous Hand) to manipulate a Rubik's Cube using reinforcement learning. Dactyl ran on the same "Rapid" distributed training platform, using 6,144 CPU cores and 8 GPUs, and collected approximately 100 years of simulated experience in 50 hours. The successful transfer demonstrated that the infrastructure and algorithms developed for game-playing could generalize to physical robotics tasks.
One of the more surprising findings was the degree of team coordination that emerged without explicit communication. The five agents learned to execute complex team fights, set up ambushes, coordinate ability usage, and make collective decisions about when to push objectives or retreat. This coordination arose purely from the team spirit reward mechanism and shared training through self-play. No agent could send messages to its teammates; each simply learned to predict what the others would do.
OpenAI constrained the agents' reaction time to between 167 and 267 milliseconds (5 to 8 frames at the game's tick rate), placing them in a range comparable to human professional players. The effective action rate was approximately 7.5 actions per second. This was done deliberately to ensure that the AI's advantages came from strategic and tactical decision-making rather than superhuman reflexes.
OpenAI was transparent about the system's limitations. The restricted hero pool meant that OpenAI Five never experienced the full strategic complexity of Dota 2's drafting phase. The system also showed weaknesses in very late-game scenarios during its TI8 losses, where long-term strategic planning and item choices mattered most. Professional players who competed against OpenAI Five noted that the bots sometimes made unusual item decisions and struggled with certain late-game strategies that required careful resource management.
Additionally, the enormous computational cost raised questions about sample efficiency. The 45,000 years of simulated gameplay and hundreds of thousands of dollars in cloud computing costs were far beyond what any human player would need to reach a similar skill level.
After the OpenAI Five Arena concluded on April 21, 2019, OpenAI officially retired the project. Training was halted on April 22, 2019, and the system was not updated further. The team published its findings in a paper titled "Dota 2 with Large Scale Deep Reinforcement Learning" on arXiv in December 2019.
The project left a lasting mark on the field of reinforcement learning and AI research more broadly. It demonstrated that cooperative multi-agent reinforcement learning could produce coordinated behavior in complex, partially observable environments. It validated the effectiveness of PPO as a general-purpose RL algorithm at scale. And it contributed to a growing body of evidence that compute scaling could unlock capabilities that had previously seemed to require fundamental algorithmic breakthroughs.
OpenAI Five also influenced the public perception of AI capabilities. The live-streamed matches at TI7, TI8, and the OpenAI Five Finals attracted millions of viewers and introduced a broad audience to the state of modern AI research. For many in the gaming community, the matches against Dendi and OG served as tangible demonstrations of how far machine learning had progressed.