Gym Retro
Last reviewed
May 10, 2026
Sources
11 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
11 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gym Retro is a platform for reinforcement learning (RL) research on classic video games developed by OpenAI and released in 2018. It turns retro console games into Gym environments by wrapping emulators that support the Libretro API, which makes it relatively easy to add new systems, and ships with a companion integration tool so researchers can mark up new ROMs with reward signals, save states, and termination conditions. [1] [2] [3] The full release shipped with roughly 1,000 games from systems made by Atari, Sega, Nintendo, and NEC. [1]
The preliminary release earlier in 2018 included 62 Atari 2600 games from the Arcade Learning Environment plus 30 Sega Genesis games from the SEGA Mega Drive and Genesis Classics Steam bundle, and that smaller corpus shipped with the Retro Contest in April 2018. The May 2018 full release expanded the lineup to over 1,000 titles spanning Sega Genesis, Sega Master System, NES, SNES, and Game Boy, with preliminary support for Sega Game Gear, Game Boy Color, Game Boy Advance, and the NEC TurboGrafx-16. [1] [4]
Moving past Atari was the point. Atari 2600 games run on 1977 hardware with simple sprites and a tiny action space, a natural early bench for deep RL after DQN in 2013 but increasingly stale. Sega Genesis titles run on a 16-bit Motorola 68000 with a larger frame buffer, more buttons, and richer game logic, so agents must read richer pixels and act over longer horizons. Retro built on lessons from Universe, OpenAI's late-2016 platform that ran browser and Flash games inside VNC sessions and proved unreliable because of real-time stepping and screen-scraped state. Retro fixed that by emulating consoles in-process and reading state straight out of emulator RAM. [1] [3]
| Attribute | Value |
|---|---|
| Developer | OpenAI |
| Initial preliminary release | April 5, 2018 (with the Retro Contest) |
| Full release | May 2018 |
| Latest OpenAI version | 0.8.0 (May 1, 2020) |
| License | MIT |
| Source | github.com/openai/retro |
| Maintenance status | Maintenance only since 2020 |
| Active fork | stable-retro by the Farama Foundation |
| Game count at full release | ~1,000 across 8 systems |
| Languages | C (~69%), C++ (~27%), Python bindings |
| Companion paper | Gotta Learn Fast (Nichol et al., 2018) [5] |
By 2017 the standard RL benchmark for pixel-based agents was the Arcade Learning Environment (ALE), exposing a few dozen Atari 2600 ROMs through a Gym-compatible interface. ALE had carried the field through DQN, A3C, Rainbow, and PPO, but the limits were obvious: an 18-action joystick space, hand-tuned rewards, many titles already mastered, and agents that overfit to the specific ROM they were trained on. [1] [5]
OpenAI's earlier attempt at scale, Universe (late 2016), exposed thousands of browser and Flash titles through VNC but ran in wall-clock time and depended on fragile screen scraping. Retro took a different bet: run emulators inside the Python process and read game state directly out of emulator RAM, which unlocked deterministic resets, save states, and faster than real time training. The Retro Learning Environment, an earlier academic project for SNES and Genesis RL, used a similar trick; OpenAI credits it as inspiration but argues Gym Retro is more flexible because it abstracts over Libretro cores rather than baking in specific emulators. [1] [3] [4]
Gym Retro is a thin Python wrapper around emulator binaries plus a per-game data layer.
Libretro is an emulator API that compiles each emulator into a single shared library called a core. A Libretro frontend (RetroArch is the best known) loads the core, sends it inputs, and pulls back video, audio, and memory. Retro acts as the frontend: each supported system ships its Libretro core inside the package, and adding a new console mostly means dropping in a new core. The Libretro API includes retro_get_memory_data and retro_get_memory_size, which Retro uses to read game variables for reward shaping. [3] [6]
Each integrated game has four files plus at least one save state.
| File | Purpose |
|---|---|
data.json | Maps named variables (lives, score, x position, ring count) to RAM addresses and types. |
scenario.json | Defines the reward function and the done condition using variables from data.json. |
metadata.json | Stores the default starting save state and other game-level settings. |
script.lua | Optional Lua hooks for rewards or termination conditions that need logic beyond simple expressions. |
*.state | Binary save states marking levels, checkpoints, or specific scenarios. |
ROM hashes are stored in a rom.sha file, and most hashes match No-Intro SHA-1 sums. ROMs themselves are not shipped: users have to provide them, although a few non-commercial homebrew titles such as Airstriker-Genesis come bundled for testing. [2] [3]
The integration UI is a Qt desktop app that lets researchers step through a game frame by frame, watch RAM, and bookmark addresses as named variables. Lives, score, and progress counters get found by playing while sweeping memory for values that change in the expected way; the UI then exports them straight into a data.json for that ROM. [3]
The Python API mirrors Gym. After installation a typical session looks like:
import retro
env = retro.make(game='Airstriker-Genesis')
obs = env.reset()
while True:
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
if done:
break
env.close()
Observations are RGB frames, actions are multi-discrete vectors representing the original gamepad's buttons (Genesis has eight, NES has six, and so on), and info contains the data.json variables. Multiple states per game can be loaded by passing the state= argument to retro.make. [2]
| System | Emulator core | Notes |
|---|---|---|
| Atari 2600 | Stella | Same hardware as ALE; Retro re-uses ALE-style ROMs. |
| NEC TurboGrafx-16 / PC Engine | Mednafen / Beetle PCE Fast | Preliminary support at full release. |
| Nintendo Game Boy / Game Boy Color | gambatte | Game Boy Color marked preliminary. |
| Nintendo Game Boy Advance | mGBA | Preliminary support. |
| Nintendo NES | FCEUmm | Standard 8-bit Nintendo. |
| Nintendo SNES | Snes9x | 16-bit Nintendo. |
| Sega Game Gear | Genesis Plus GX | Preliminary. |
| Sega Genesis / Mega Drive | Genesis Plus GX | The flagship console for the Retro Contest. |
| Sega Master System | Genesis Plus GX | Same core covers Master System, Game Gear, and Genesis. |
The full release covered eight emulated systems with about 1,000 ROMs integrated overall, and the Genesis catalog is the most thoroughly annotated because it was the focus of the contest. [2] [3]
Gym Retro shipped pre-built wheels for Windows 7/8/10, macOS 10.13 (High Sierra) and 10.14 (Mojave), and manylinux1 Linux. It supports Python 3.6, 3.7, and 3.8, and OpenAI recommended a CPU with SSSE3 or better. The simplest install is pip install gym-retro, which still works against the May 2020 0.8.0 wheels for compatible Python versions. Building from source requires CMake, a C++ compiler, and the cloned repository at github.com/openai/retro. [2] [7]
OpenAI released a technical report titled "Gotta Learn Fast: A New Benchmark for Generalization in RL" by Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schulman, submitted to arXiv on April 10, 2018 and revised on April 23, 2018. The paper proposes a transfer-learning benchmark on the Sonic the Hedgehog Genesis trilogy: agents train on a large pool of training levels, then face a held-out set of test levels with one million timesteps (about 18 hours of in-game time at 60Hz) per test level. The report runs three baselines (Rainbow DQN, PPO, and a hand-coded random search baseline called JERK, short for 'Just Enough Retained Knowledge') and shows that joint PPO trained on the training levels then fine-tuned on each test level nearly doubles the performance of PPO trained from scratch on the test levels. The paper frames the benchmark as a complement to ALE rather than a replacement: ALE measures within-task performance, Sonic measures generalization. [5]
Between April 5 and June 5, 2018, OpenAI ran a public competition called the Retro Contest for the best agent on previously unseen levels of the Sonic the Hedgehog Genesis games. Participants received the training levels from the three Genesis titles, were free to use any data or compute at training time, and submitted Docker containers; at test time each container had a one million timestep budget per held-out level, roughly 18 hours of in-game play. [4] [8]
Submissions ran on OpenAI's evaluation infrastructure: train or script an agent on the public Sonic levels, wrap it in a Docker image, run against five public test levels (low-quality levels generated with a Sonic level editor) for the live leaderboard, then final standings on a separate set of secret evaluation levels that competitors never see. Leaderboard scores were averaged over levels, with each level capped at a normalized maximum of 10,000. OpenAI reported 923 registered teams, 229 of which submitted at least one solution; the evaluation cluster ran 4,448 evaluations over two months, roughly twenty per submitting team. Most entries started from the tuned PPO and Rainbow DQN baselines shipped with the contest. [8]
Top scores came from tuning existing model-free RL algorithms, not new architectures:
| Rank | Team | Approach | Notes |
|---|---|---|---|
| 1 | Dharmaraja | Joint PPO with modifications | Six-member team (Qing Da, Jing-Cheng Shi, Anxiang Zeng, Guangda Huzhang, Run-Ze Li, Yang Yu); added a CNN layer, tuned n-step Q-learning, lower DQN target update interval. |
| 2 | Mistake | Custom PPO variant | Edged out aborg narrowly. |
| 3 | Aborg | Joint PPO with extra training data | Solo entry from Alexandre Borghi; mixed in Sonic levels from Master System and Game Boy Advance ports with a different network. |
The top final score was 4,692 against a theoretical maximum of 10,000, which OpenAI took as a sign the benchmark was hard but not saturated and that Sonic-style transfer remained an open problem. The 'low quality' label on the evaluation set refers to the levels being editor-generated rather than crafted by Sega. [8]
The Retro Contest got coverage in TechCrunch, The Register, and Wired, mostly framed as an OpenAI publicity move around Sonic. In academic circles Retro became a common base for transfer-learning and meta-RL work, often cited alongside ProcGen and the Obstacle Tower Challenge once those benchmarks appeared in 2019. The 16-bit games tested long-horizon credit assignment more cleanly than Atari, and the Sonic levels were the canonical transfer-learning benchmark thanks to Gotta Learn Fast. The platform shows up in work on world models, exploration bonuses, curriculum learning, and self-supervised learning with RL, and it remains a teaching staple in university RL courses. [1] [4] [5] [9]
OpenAI moved Gym Retro into maintenance mode soon after release; the GitHub README has carried "Status: Maintenance (expect bug fixes and minor updates)" since 2018, and the last upstream release on PyPI was 0.8.0 on May 1, 2020. After that, the public RL gym ecosystem migrated away from OpenAI: Gym was forked into Gymnasium under the Farama Foundation, and Gym Retro followed the same path under the name stable-retro. [2] [10]
Stable-retro is led by Mathieu Poliquin and the Farama Foundation, and accepts pull requests for new games, emulator cores, and bug fixes that upstream no longer takes. The fork adds Sega Saturn, Sega CD, Sega 32X, Sega Dreamcast, Nintendo 64, Nintendo DS, and arcade machines while keeping the core Gym Retro API; Python support has been broadened to 3.7 through 3.12, and the Windows route runs through WSL2. The documentation lists more than 1,000 integrated games. For new RL projects on retro consoles, stable-retro is now the recommended starting point. [10] [11]
| Platform | Year | Scope | Reset model | Notes |
|---|---|---|---|---|
| Arcade Learning Environment | 2013 | ~60 Atari 2600 games | Deterministic | The original RL pixel benchmark; small action space and short horizons. |
| Universe | 2016 | Browser, Flash, commercial titles | Real-time, screen-scraped | Discontinued; reliability problems with VNC and timing. |
| Retro Learning Environment | 2016 | SNES, Genesis | Deterministic | Academic precursor to Gym Retro. |
| Gym Retro | 2018 | ~1,000 games across 8 retro consoles | Deterministic save states | Maintained by OpenAI until 2020. |
| stable-retro | 2022 onward | Same as Retro plus Saturn, N64, DS, arcade | Deterministic save states | Active maintained fork by the Farama Foundation. |
| ProcGen | 2019 | 16 procedurally generated game-like environments | Deterministic | Designed specifically for testing generalization, with no licensed ROMs. |
| MiniGrid | 2018 onward | Gridworlds | Deterministic | Lightweight benchmark for instruction following and planning. |
Gym Retro and ProcGen ended up filling complementary roles: ProcGen tests generalization across procedurally generated variations of a single game family, while Retro tests it across human-designed levels that share style and mechanics. Researchers often cite both. [4] [5]
ROMs are not bundled, so reproducible benchmarks depend on every contributor sourcing identical ROM hashes. Reward functions are extracted from RAM, which means each new game needs manual integration to find the right addresses. The 0.8.0 wheels target Python 3.6 to 3.8 on older operating systems; on modern macOS or Linux it is often easier to install stable-retro than to fight legacy build tooling. The Sonic benchmark has not been updated since 2018, and although Dharmaraja's 4,692 remains a useful reference, it predates diffusion policies, world models, and large-scale pre-training. [2] [10]