OpenAI Five

OpenAI Five was a reinforcement learning system developed by OpenAI to play the competitive multiplayer video game Dota 2 at a professional level. The project spanned from 2017 to 2019 and culminated in a landmark achievement: on April 13, 2019, OpenAI Five became the first AI system to defeat reigning world champions in a major esports title, beating Team OG 2-0 in a best-of-three series in San Francisco. The system trained using Proximal Policy Optimization (PPO) at massive scale, accumulating roughly 45,000 years of in-game experience over ten months of real-time training on 256 GPUs and 128,000 CPU cores.

Background

Dota 2 is a multiplayer online battle arena (MOBA) game developed by Valve Corporation. In a standard match, two teams of five human players compete to destroy the opposing team's base structure called the Ancient. Each player controls a "hero" selected from a pool of over 115 characters, each with unique abilities. Matches typically last 30 to 60 minutes and require a combination of mechanical skill, strategic planning, team coordination, and real-time adaptation.

OpenAI selected Dota 2 as a research challenge because the game presented several properties that made it far more difficult for AI systems than previous game-playing benchmarks like chess or Go. Greg Brockman, OpenAI's co-founder and then-CTO, described the game as a stepping stone toward building AI systems that could handle the complexity and unpredictability of real-world problems.

Why Dota 2 Was a Hard Problem

Dota 2 posed a number of challenges that distinguished it from earlier AI game-playing milestones:

Partial observability. Unlike chess or Go, where both players can see the entire board, Dota 2 uses a "fog of war" mechanic. Players can only see areas of the map near their own units or wards, meaning much of the game state is hidden at any given time. The AI had to make decisions under significant uncertainty about enemy positions, intentions, and item builds.

Long time horizons. A typical Dota 2 match involves around 20,000 timesteps (at the frame rate used by OpenAI Five). Actions taken in the early game, such as resource allocation and lane assignments, can have consequences that only become apparent 20 or 30 minutes later. This made credit assignment extremely difficult.

Enormous action space. At each timestep, a hero can choose from roughly 8,000 to 80,000 valid actions depending on the situation. The theoretical action space, factoring in all possible combinations of action type, target, and positioning, reaches approximately 1.8 million dimensions. For comparison, Go has a branching factor of roughly 250 moves per turn, and chess about 35.

High-dimensional observation space. Each hero receives approximately 16,000 numerical inputs per timestep describing the game state, including unit positions, health values, ability cooldowns, item inventories, and more. Rather than processing raw screen pixels, OpenAI Five consumed structured data through Valve's bot API.

Five-player coordination. Dota 2 is a team game requiring tight coordination among five players. Successful play demands role specialization, shared map control, coordinated team fights, and collective resource management. Each of the five AI agents needed to learn cooperative behavior without explicit communication channels.

Complex game mechanics. Dota 2's rules are implemented in hundreds of thousands of lines of code, with intricate interactions between hero abilities, items, terrain, and neutral creeps. The game receives frequent patches that alter balance and mechanics.

The 1v1 Bot (August 2017)

Development

Before tackling the full 5v5 game, OpenAI built a bot that played 1v1 mid-lane matches using the hero Shadow Fiend. Development of the underlying algorithms began in November 2016. The bot learned entirely through self-play, starting with no prior knowledge of the game and gradually discovering effective strategies by playing against copies of itself. According to Greg Brockman, the bot required approximately two weeks of training to reach a competitive level.

The progression was rapid. By March 2017, the system achieved its first classical reinforcement learning results in a simplified Dota environment. By early June 2017, it could beat a tester at 1,500 matchmaking rating (MMR). By June 30, it won the majority of games against a 3,000 MMR tester. By July 8, it secured its first win against a 7,500 MMR semi-professional tester.

The International 2017 Demonstration

On August 11, 2017, OpenAI staged a surprise demonstration at The International 2017 (TI7), Dota 2's premier annual championship tournament, held in Seattle. The bot was matched against Danylo "Dendi" Ishutin, a Ukrainian professional player and former world champion widely regarded as one of the most recognizable figures in competitive Dota 2.

The match was played under standard 1v1 Shadow Fiend rules: first to two kills or first to destroy the enemy tower, with no neutral creeps and item restrictions. The bot won the first game in under ten minutes, establishing a commanding lead in last hits (34 to Dendi's 14, with 15 denies to Dendi's 2). Dendi conceded the second game shortly after it began. During the match, Dendi repeatedly remarked, "This guy is scary."

In the days surrounding TI7, the bot also defeated several other top players in private matches, including Arteezy (rated approximately 10,000 MMR, one of the highest-rated players in the world) with a 10-0 record, SumaiL (a top 1v1 specialist) 6-0, Pajkatt (a professional player rated 8,500 MMR) 2-1, and Blitz (a former professional rated 6,200 MMR) 3-0.

Limitations

The 1v1 bot operated under significant constraints. It played only one hero (Shadow Fiend) in a simplified 1v1 mid-lane format that eliminated most of the strategic complexity of the full game. There were no allied or enemy teammates, no jungle, limited items, and no need for map awareness or team coordination. Critics noted that 1v1 mid was largely a test of mechanical execution and lane control rather than deep strategic reasoning.

OpenAI Five: The 5v5 System

Architecture

OpenAI Five consisted of five independent neural networks, one controlling each hero on the team. Each network shared the same architecture and weights but received different observations indicating which of the five heroes it controlled. The core of each network was a single-layer Long Short-Term Memory (LSTM) network with 4,096 hidden units. The full model contained approximately 159 million parameters, with the LSTM accounting for roughly 84% of the total parameter count.

The observation space was flattened into a single vector of approximately 16,000 values per hero, representing all game-state information available to a human player (unit positions, health, mana, cooldowns, items, and so on). All floating-point observations were normalized using z-scores (subtracting the mean and dividing by the standard deviation) and clipped to the range (-5, 5) for training stability.

The action space used a factored structure. At each timestep, a hero selected a primary action (from up to 30 possibilities, averaging 8.1 available per timestep), plus parameters for delay (4 dimensions), unit selection (189 dimensions), and spatial offset (81 dimensions). The combined theoretical action space reached approximately 1,837,080 dimensions, though invalid actions were filtered based on cooldowns, valid targets, and situational constraints.

Critically, the five hero networks did not communicate directly with each other. There was no shared memory, messaging system, or centralized coordinator. All coordination emerged purely from training through self-play. Each agent learned to anticipate what its teammates would do based on the observable game state.

Training Algorithm: PPO at Scale

OpenAI Five was trained using Proximal Policy Optimization (PPO), a policy gradient reinforcement learning algorithm developed at OpenAI. PPO was chosen for its stability and scalability when applied to large-scale distributed training.

The training ran on a custom distributed platform called "Rapid," hosted on Google Cloud Platform. The infrastructure consisted of:

256 NVIDIA P100 GPUs for optimization (gradient computation and parameter updates)
128,000 preemptible CPU cores for running game simulations (rollouts)

The system was organized into several components:

Component	Role
Rollout Workers (CPUs)	Simulated Dota 2 games and collected experience data
Forward Pass GPUs	Computed actions for rollout workers during gameplay
Optimizer GPUs	Sampled experience from the buffer, computed gradients, and updated model parameters
Controller	Distributed updated parameters to all components
Experience Buffer	Stored gameplay data for the optimizers to sample from

Rollout workers and optimizers operated asynchronously. The system targeted a sample reuse ratio close to 1, meaning optimizers consumed experience data at roughly the same rate that rollout workers produced it. Stale data was treated as harmful; game data was sent every 30 seconds, and model parameters were updated approximately once per minute.

At peak throughput, OpenAI Five played the equivalent of 180 years of Dota 2 per day during its initial training phase. Over the full ten-month training period (June 30, 2018 to April 22, 2019), the system accumulated approximately 45,000 years of in-game experience. The total compute used was estimated at 770 plus or minus 50 PetaFLOP/s-days.

Reward Shaping

Training a reinforcement learning agent on a game as complex as Dota 2 with only a win/loss signal at the end of a 45-minute match would be extremely slow. OpenAI addressed this by designing a shaped reward function that provided intermediate feedback throughout the game.

The reward function included components tied to game metrics that human players use to evaluate performance:

Reward Component	Description
Kills	Reward for killing enemy heroes
Deaths	Penalty for being killed
Assists	Reward for contributing to teammate kills
Last Hits	Reward for landing the killing blow on enemy creeps (gold income)
Net Worth	Reward tied to total gold and item value
Tower Damage	Reward for damaging or destroying enemy towers

Two important mechanisms modified the raw rewards:

Zero-sum adjustment. Each hero's reward was adjusted by subtracting the average reward of the enemy team, preventing agents from discovering positive-sum exploits that would not translate to competitive play.
Exponential time weighting. Rewards were scaled based on game time to prevent agents from overvaluing late-game actions, where power levels naturally increase and rewards grow larger in absolute terms.

OpenAI ran experiments comparing the shaped reward to a pure win/loss signal. The win/loss-only version trained an order of magnitude slower and plateaued at a lower skill level.

Team Spirit

One of the more notable design choices in OpenAI Five was a hyperparameter called "team spirit," denoted by the Greek letter tau. This parameter controlled the balance between individual and collective reward for each agent, using a simple formula:

effective_reward[i] = tau * mean(all_hero_rewards) + (1 - tau) * hero_reward[i]

At tau = 0, each hero cared only about its own individual reward (kills, last hits, net worth). At tau = 1, each hero weighted the team's average reward equally, promoting fully cooperative behavior.

During training, tau was annealed from 0.2 at the start to 0.97 near the end. Early in training, lower team spirit allowed agents to learn basic individual skills like farming and fighting. As training progressed, higher team spirit pushed the agents toward coordinated team play, sacrificing individual advantage for collective benefit.

Self-Play and Opponent Sampling

OpenAI Five trained entirely through self-play, with no human demonstration data or imitation learning. The system played 80% of its games against the latest version of itself and 20% against past checkpoints sampled from its training history. This mixture helped prevent "strategy collapse," where the agent might develop a narrow set of tactics that work well against its current self but fail against diverse opponents.

Past opponents were selected using a dynamic quality scoring system that prioritized informative matchups over random historical snapshots.

Surgery: Adapting to Change

Over the 296-day (approximately 10-month) training run, OpenAI needed to modify the model multiple times to accommodate game patches, changes to the hero pool, and architectural improvements. The team developed a technique called "surgery" to handle these transitions.

When model changes maintained the same input-output structure, the new model was initialized to replicate the old model's behavior as closely as possible. When this was not feasible (for example, when the observation space changed due to a game patch), the team gradually increased the proportion of games played with the new version, allowing the model to adapt incrementally rather than starting from scratch.

Timeline of Key Events and Matches

Date	Event	Result
November 2016	Algorithm development begins	N/A
March 2017	First RL results in simplified Dota environment	N/A
August 11, 2017	1v1 bot vs. Dendi at TI7	Bot wins 2-0
June 25, 2018	OpenAI Five announced; beats amateur teams	OpenAI Five wins
August 5, 2018	OpenAI Five Benchmark vs. casters/ex-pros (~4,200 MMR)	OpenAI Five wins 2-1
August 22-23, 2018	TI8 Showmatches vs. paiN Gaming and a Chinese all-star team	OpenAI Five loses both matches
April 13, 2019	OpenAI Five Finals vs. OG (TI8 world champions)	OpenAI Five wins 2-0
April 18-21, 2019	OpenAI Five Arena (public online event)	99.4% win rate (7,215 wins, 42 losses)
April 22, 2019	Training officially ends; project retired	N/A

Game Restrictions

OpenAI Five operated under a set of game restrictions that simplified the full Dota 2 experience. These restrictions were gradually relaxed over the project's lifetime, but some remained in place through the final matches against OG.

Restrictions in Place Throughout

Restricted hero pool. The system supported only 17 heroes in the final version (down from 18 after Lich was removed due to a major rework in Dota 2 patch 7.20). The full game features over 115 heroes. The 17-hero roster consisted of: Axe, Crystal Maiden, Death Prophet, Earthshaker, Gyrocopter, Lion, Necrophos, Queen of Pain, Razor, Riki, Shadow Fiend, Slark, Sniper, Sven, Tidehunter, Viper, and Witch Doctor.
No Divine Rapier. This high-risk, high-reward item was excluded.
No Bottle. This commonly used regeneration item was not available.
No summons or illusions. Heroes that create additional controllable units or illusory copies were excluded.
Five invulnerable couriers. Each hero received its own courier (delivery unit) that could not be killed, removing courier management as a gameplay element.
No Scan. The Scan ability, which lets teams detect enemy presence in an area, was disabled.

Restrictions Removed Over Time

Earlier versions of OpenAI Five also lacked wards (vision-granting items) and Roshan (a powerful neutral boss that grants significant team advantages when killed). Both were reintroduced before the later matches, adding strategic depth.

The restricted hero pool was one of the most commonly cited limitations. Hero selection ("drafting") is a fundamental part of competitive Dota 2. Teams spend significant effort constructing hero compositions that synergize well and counter the opponent's picks. With only 17 heroes available, this dimension of the game was severely limited. OpenAI reportedly attempted to expand the hero pool to 25 before the OG match but found that the system was not learning quickly enough to reach professional level with the larger pool.

Defeating OG: The OpenAI Five Finals

The Event

On April 13, 2019, OpenAI hosted the "OpenAI Five Finals" in San Francisco. The headline match pitted OpenAI Five against OG, the winners of The International 2018 (TI8) and the reigning Dota 2 world champions at the time. The event was broadcast on Twitch with commentary from well-known Dota 2 personalities, including William "Blitz" Lee, Austin "Capitalist" Walsh, Owen "ODPixel" Davies, Kevin "Purge" Godec, and Jorien "Sheever" van der Heijden.

OG's roster for the event included their full TI8-winning lineup:

Player	Real Name	Nationality	Position
ana	Anathan Pham	Australia	Carry (Position 1)
Topson	Topias Taavitsainen	Finland	Mid (Position 2)
7ckngMad (Ceb)	Sebastien Debs	France	Offlane (Position 3)
JerAx	Jesse Vainikka	Finland	Support (Position 4)
N0tail	Johan Sundstein	Denmark	Support (Position 5)

Match Results

OpenAI Five won both games decisively.

Game 1 lasted 38 minutes and 18 seconds. OpenAI Five drafted Sniper, Earthshaker, Viper, Riki, and Shadow Fiend. OG played Gyrocopter, Witch Doctor, Death Prophet, Tidehunter, and Crystal Maiden. After the draft phase, OpenAI Five's internal model estimated a 95% win probability. The AI established map control in the mid game and systematically dismantled OG's defenses.

Game 2 lasted just 20 minutes and 51 seconds. OpenAI Five drafted Crystal Maiden, Gyrocopter, Sven, Witch Doctor, and Viper. OG picked Sniper, Earthshaker, Death Prophet, Slark, and Lion. The second game was a dominant performance by the AI, with OG calling "GG" (conceding defeat) before the 21-minute mark.

This result marked the first time an AI system had defeated the reigning world champions in a major esports title in a public, live-streamed competition.

The OpenAI Five Arena

Following the OG match, OpenAI opened the system to the public through the "OpenAI Five Arena," an online event running from April 18 to April 21, 2019. During the event, anyone could form a five-player team and challenge OpenAI Five under the same rules used in the OG match.

The results were overwhelming. Over the four-day period:

30,937 human players participated.
OpenAI Five played 7,257 games, winning 7,215 and losing only 42.
The overall win rate was 99.4%.
It took 459 games before a human team recorded the first victory.
Only 22 unique teams managed to defeat the AI.
The total in-game time played amounted to approximately 10.7 years.

Comparison with AlphaStar

Around the same period, DeepMind developed AlphaStar, an AI system for Blizzard Entertainment's real-time strategy game StarCraft II. The two projects represent the most prominent examples of applying deep reinforcement learning to complex competitive video games. While they shared broad similarities, they differed substantially in their technical approaches and the challenges they faced.

Attribute	OpenAI Five	AlphaStar
Game	Dota 2 (MOBA)	StarCraft II (RTS)
Developer	OpenAI	DeepMind
Year of peak result	2019	2019
Players per team	5 AI agents (cooperative)	1 AI agent
Core architecture	Single-layer 4,096-unit LSTM	Deep LSTM with Transformer encoder and pointer network
Model parameters	~159 million	~139 million (55 million at inference)
Training method	Pure self-play with PPO	Supervised learning on human replays, then RL with league-based self-play
Human data used	None	971,000 human replays
Compute hardware	256 NVIDIA P100 GPUs + 128,000 CPU cores	16-32 TPUv3s per agent; 384 TPUv3s for league
Training duration	~10 months (continuous)	~44 days (league training phase)
Game experience	~45,000 years equivalent	Not directly comparable (league-based)
Input representation	Structured game-state vectors (bot API)	Structured game-state data (no raw pixels)
Game restrictions	17-hero pool, no items like Divine Rapier, no summons	Full game; camera and APM constraints added later
Peak achievement	Defeated TI8 champions OG 2-0	Reached Grandmaster (top 0.2%) in all three races
Publication	arXiv, December 2019	Nature, October 2019

One notable difference was the use of human data. AlphaStar bootstrapped its training with supervised learning on nearly one million human replays before transitioning to reinforcement learning, while OpenAI Five learned entirely from scratch through self-play. AlphaStar also used a "league" training approach where multiple agents specialized against different strategies, whereas OpenAI Five used a single population trained against its current and past selves.

Another key distinction involved game restrictions. AlphaStar eventually played the full StarCraft II game with all three races and no gameplay restrictions (though with constrained action rates to approximate human physical limitations). OpenAI Five never played with the full hero roster or all game mechanics enabled.

Technical Contributions and Impact

Scaling Reinforcement Learning

OpenAI Five demonstrated that relatively simple reinforcement learning algorithms, when scaled to sufficient compute, could solve problems of remarkable complexity. The system used no search trees, no explicit planning modules, and no hand-coded strategies beyond the reward function. Its performance came entirely from the combination of a large LSTM network, self-play, and massive computational scale.

The project provided evidence for a hypothesis that would become increasingly central to AI research in subsequent years: that scale, in terms of both model size and training compute, could substitute for algorithmic complexity in many domains.

Transfer to Robotics

OpenAI reused the same reinforcement learning algorithms and training code from OpenAI Five for Dactyl, a project that trained a robotic hand (a Shadow Dexterous Hand) to manipulate a Rubik's Cube using reinforcement learning. Dactyl ran on the same "Rapid" distributed training platform, using 6,144 CPU cores and 8 GPUs, and collected approximately 100 years of simulated experience in 50 hours. The successful transfer demonstrated that the infrastructure and algorithms developed for game-playing could generalize to physical robotics tasks.

Emergent Coordination

One of the more surprising findings was the degree of team coordination that emerged without explicit communication. The five agents learned to execute complex team fights, set up ambushes, coordinate ability usage, and make collective decisions about when to push objectives or retreat. This coordination arose purely from the team spirit reward mechanism and shared training through self-play. No agent could send messages to its teammates; each simply learned to predict what the others would do.

Reaction Time and Mechanical Skill

OpenAI constrained the agents' reaction time to between 167 and 267 milliseconds (5 to 8 frames at the game's tick rate), placing them in a range comparable to human professional players. The effective action rate was approximately 7.5 actions per second. This was done deliberately to ensure that the AI's advantages came from strategic and tactical decision-making rather than superhuman reflexes.

Limitations Acknowledged

OpenAI was transparent about the system's limitations. The restricted hero pool meant that OpenAI Five never experienced the full strategic complexity of Dota 2's drafting phase. The system also showed weaknesses in very late-game scenarios during its TI8 losses, where long-term strategic planning and item choices mattered most. Professional players who competed against OpenAI Five noted that the bots sometimes made unusual item decisions and struggled with certain late-game strategies that required careful resource management.

Additionally, the enormous computational cost raised questions about sample efficiency. The 45,000 years of simulated gameplay and hundreds of thousands of dollars in cloud computing costs were far beyond what any human player would need to reach a similar skill level.

Retirement and Legacy

After the OpenAI Five Arena concluded on April 21, 2019, OpenAI officially retired the project. Training was halted on April 22, 2019, and the system was not updated further. The team published its findings in a paper titled "Dota 2 with Large Scale Deep Reinforcement Learning" on arXiv in December 2019.

The project left a lasting mark on the field of reinforcement learning and AI research more broadly. It demonstrated that cooperative multi-agent reinforcement learning could produce coordinated behavior in complex, partially observable environments. It validated the effectiveness of PPO as a general-purpose RL algorithm at scale. And it contributed to a growing body of evidence that compute scaling could unlock capabilities that had previously seemed to require fundamental algorithmic breakthroughs.

OpenAI Five also influenced the public perception of AI capabilities. The live-streamed matches at TI7, TI8, and the OpenAI Five Finals attracted millions of viewers and introduced a broad audience to the state of modern AI research. For many in the gaming community, the matches against Dendi and OG served as tangible demonstrations of how far machine learning had progressed.

References

OpenAI. "Dota 2 with Large Scale Deep Reinforcement Learning." arXiv:1912.06680, December 2019. https://arxiv.org/abs/1912.06680
OpenAI. "OpenAI Five." OpenAI Blog, June 25, 2018. https://openai.com/index/openai-five/
OpenAI. "OpenAI Five defeats Dota 2 world champions." OpenAI Blog, April 15, 2019. https://openai.com/index/openai-five-defeats-dota-2-world-champions/
OpenAI. "The International 2018: Results." OpenAI Blog, August 2018. https://openai.com/index/the-international-2018-results/
OpenAI. "OpenAI Five Benchmark." OpenAI Blog, August 2018. https://openai.com/index/openai-five-benchmark/
OpenAI. "Dota 2." OpenAI Blog, August 2017. https://openai.com/index/dota-2/
OpenAI. "Learning Dexterity." OpenAI Blog, July 2018. https://openai.com/index/learning-dexterity/
Vinyals, O. et al. "Grandmaster level in StarCraft II using multi-agent reinforcement learning." Nature 575, 350-354, October 2019. https://www.nature.com/articles/s41586-019-1724-z
Liquipedia. "OpenAI Five Finals." https://liquipedia.net/dota2/OpenAI_Five_Finals
Liquipedia. "OpenAI Five Arena." https://liquipedia.net/dota2/OpenAI_Five_Arena

Background

Why Dota 2 Was a Hard Problem

The 1v1 Bot (August 2017)

Development

The International 2017 Demonstration

Limitations

OpenAI Five: The 5v5 System

Architecture

Training Algorithm: PPO at Scale

Reward Shaping

Team Spirit

Self-Play and Opponent Sampling

Surgery: Adapting to Change

Timeline of Key Events and Matches

Game Restrictions

Restrictions in Place Throughout

Restrictions Removed Over Time

Defeating OG: The OpenAI Five Finals

The Event

Match Results

The OpenAI Five Arena

Comparison with AlphaStar

Technical Contributions and Impact

Scaling Reinforcement Learning

Transfer to Robotics

Emergent Coordination

Reaction Time and Mechanical Skill

Limitations Acknowledged

Retirement and Legacy

References

Improve this article

Related Articles

AlphaGo

AlphaStar

AlphaZero

Machine learning terms/Fairness

Machine learning terms/Reinforcement Learning

Monte Carlo Tree Search

Background

Why Dota 2 Was a Hard Problem

The 1v1 Bot (August 2017)

Development

The International 2017 Demonstration

Limitations

OpenAI Five: The 5v5 System

Architecture

Training Algorithm: PPO at Scale

Reward Shaping

Team Spirit

Self-Play and Opponent Sampling

Surgery: Adapting to Change

Timeline of Key Events and Matches

Game Restrictions

Restrictions in Place Throughout

Restrictions Removed Over Time

Defeating OG: The OpenAI Five Finals

The Event

Match Results

The OpenAI Five Arena

Comparison with AlphaStar

Technical Contributions and Impact

Scaling Reinforcement Learning

Transfer to Robotics

Emergent Coordination

Reaction Time and Mechanical Skill

Limitations Acknowledged

Retirement and Legacy

References

Related Articles

AlphaGo

AlphaStar

AlphaZero

Machine learning terms/Fairness

Machine learning terms/Reinforcement Learning

Monte Carlo Tree Search