See also: Machine learning terms, Environment ChatGPT Plugins
In reinforcement learning (RL), an environment is the external system that an agent interacts with. The environment receives the agent's actions, transitions to a new state, and returns an observation along with a numerical reward signal. Everything outside the agent's decision-making boundary is considered part of the environment. This formulation, popularized by Richard Sutton and Andrew Barto in Reinforcement Learning: An Introduction, provides the foundation for modeling sequential decision problems as Markov decision processes (MDPs).[1]
While the term "environment" also appears in broader machine learning contexts (referring to data sources, hardware, and experimental setups), this article focuses on the RL-specific meaning, where the environment defines the task an agent must learn to solve.
An RL environment is typically formalized as part of a Markov decision process. An MDP is described by a tuple (S, A, P, R, γ), where:[2]
| Symbol | Component | Description |
|---|---|---|
| S | State space | The set of all possible states the environment can occupy |
| A | Action space | The set of all valid actions the agent can take |
| P | Transition function | P(s' | s, a): the probability of moving to state s' given state s and action a |
| R | Reward function | R(s, a, s'): the numerical signal returned after a transition |
| γ | Discount factor | A value in [0, 1] that determines how much future rewards are weighted |
At each discrete time step t, the agent observes a state s_t, selects an action a_t according to its policy π, and the environment responds with a new state s_{t+1} and a reward r_{t+1}. The agent's objective is to learn a policy that maximizes the expected cumulative (discounted) reward over time.[1]
The state space S represents every possible configuration the environment can be in. States can be discrete (a finite set of board positions in chess) or continuous (joint angles and velocities of a robotic arm). In many practical applications, the agent does not see the full state directly but instead receives an observation that may be a subset or noisy version of the true state.[3]
The action space A defines what the agent is allowed to do at each step. Action spaces fall into two broad categories:[4]
| Type | Description | Examples |
|---|---|---|
| Discrete | A finite set of distinct actions | Moving left, right, up, or down in a grid world; selecting one of several Atari game buttons |
| Continuous | Actions are real-valued vectors | Torque applied to each joint of a robotic arm; throttle and steering angle for an autonomous vehicle |
Some environments have mixed action spaces containing both discrete and continuous components.
The transition function P(s' | s, a) describes how the environment evolves in response to an action. In a deterministic environment, a given state-action pair always leads to the same next state. In a stochastic environment, the next state is sampled from a probability distribution, introducing uncertainty.[2] The agent typically does not have access to the transition function and must learn about the environment through trial and error, although model-based RL methods attempt to learn an approximate model of these dynamics.
The reward function provides the feedback that drives learning. Rewards can be:
Designing a good reward function is one of the most challenging aspects of applied RL. Poorly designed rewards can lead to reward hacking, where the agent finds unintended shortcuts that maximize reward without achieving the desired behavior.[5]
RL environments can be classified along several dimensions. The following table summarizes the major distinctions.
| Dimension | Type A | Type B | Key Difference |
|---|---|---|---|
| Observability | Fully observable (MDP) | Partially observable (POMDP) | Whether the agent sees the complete state or only a partial observation |
| Determinism | Deterministic | Stochastic | Whether the same action in the same state always produces the same outcome |
| Number of agents | Single-agent | Multi-agent | Whether one or multiple decision-makers interact with the environment |
| Task horizon | Episodic | Continuing | Whether the interaction has a clear terminal state or runs indefinitely |
In a fully observable environment, the agent can see the entire state at every time step. This setting corresponds to a standard MDP. Chess (ignoring clock considerations) is fully observable because both players can see all pieces on the board.
In a partially observable environment, the agent receives only an incomplete or noisy observation of the true state. This setting is modeled as a Partially Observable Markov Decision Process (POMDP). Poker is a classic example: each player can see their own hand but not the opponents' cards. Autonomous driving is another example, because sensors provide limited range and can be occluded. In POMDPs, agents often rely on memory (such as recurrent neural networks or belief states) to infer the hidden state from the history of observations.[6]
A deterministic environment produces the same next state every time a particular action is taken from a particular state. Board games like Go (without randomized elements) and simple grid worlds are deterministic.
A stochastic environment introduces randomness into its transitions. Backgammon, where dice rolls affect available moves, is a stochastic environment. Most real-world environments are stochastic to some degree because of sensor noise, unpredictable external factors, and other sources of uncertainty.[2]
A single-agent environment contains only one decision-maker interacting with the environment. Classic control tasks (balancing a cart-pole, navigating a maze) are single-agent problems.
A multi-agent environment has two or more agents that simultaneously or sequentially interact with the shared environment. Multi-agent settings introduce additional complexity because each agent's optimal strategy depends on the behavior of the other agents. This can involve cooperation, competition, or a mix of both. The StarCraft Multi-Agent Challenge (SMAC) is a widely used benchmark for cooperative multi-agent RL research.[7]
Episodic environments have a natural endpoint. Each episode starts from an initial state and ends when a terminal condition is met (winning a game, falling off a platform, or reaching a time limit). The agent's performance is typically measured by the total undiscounted reward per episode. Examples include playing a round of an Atari game or navigating a maze.
Continuing environments have no terminal state; the agent-environment interaction runs indefinitely. Stock trading systems, industrial process control, and server resource management are continuing tasks. In these settings, the discount factor γ becomes essential because the undiscounted sum of rewards over an infinite horizon would be unbounded.[1]
OpenAI Gym, released in 2016, established the first widely adopted API standard for RL environments. It defined a simple interface: reset() to initialize an episode and step(action) to advance the environment by one time step. After OpenAI stopped actively maintaining Gym in late 2020, the Farama Foundation took over and released Gymnasium as the official successor.[8]
The core Gymnasium API specifies that step() returns five values:
| Return Value | Type | Description |
|---|---|---|
observation | Depends on environment | The agent's observation of the current state |
reward | float | Numerical feedback for the action taken |
terminated | bool | True if the episode ended due to reaching a terminal state |
truncated | bool | True if the episode ended due to a time limit or boundary violation |
info | dict | Auxiliary diagnostic information |
The separation of terminated and truncated (introduced in Gymnasium v0.26) was an important change from the original Gym API, which used a single done boolean. This distinction matters for algorithms that need to differentiate between true episode termination and artificial cutoffs.
Gymnasium provides a modular wrapper system that allows users to modify an environment's behavior without changing its source code. Common wrapper types include:[8]
Wrappers can be stacked, so a single environment can have multiple transformations applied in sequence. This composability makes it easy to preprocess observations, add time limits, or record video without duplicating code.
Creating a custom Gymnasium environment involves subclassing gymnasium.Env and implementing two core methods: reset() and step(). The developer must also define the observation_space and action_space attributes using Gymnasium's space classes (such as Discrete, Box, MultiBinary, or Dict). Once registered with the Gymnasium registry, custom environments can be instantiated with gymnasium.make() just like built-in ones.[8]
For multi-agent environments, the Farama Foundation maintains PettingZoo, which extends the Gymnasium philosophy to settings with multiple agents. PettingZoo uses an Agent Environment Cycle (AEC) API in which agents take turns acting, and also supports a parallel API for simultaneous-action games. PettingZoo includes environment families covering Atari multi-player games, classic board and card games, and cooperative tasks.[9]
Simulation environments are critical for RL research because they provide safe, repeatable, and scalable settings for training and evaluation. The following table lists some of the most influential simulation suites.
| Environment Suite | Domain | Description |
|---|---|---|
| Classic Control | Physics simulations | Simple tasks like CartPole, MountainCar, and Acrobot; included in Gymnasium |
| Atari (ALE) | Video games | 57 Atari 2600 games; the benchmark that demonstrated the potential of deep RL via DQN |
| MuJoCo | Continuous control | Physics-engine-based locomotion and manipulation tasks (Ant, HalfCheetah, Humanoid); open-sourced by DeepMind in 2022 |
| StarCraft II (SMAC) | Strategy games | Cooperative multi-agent micromanagement scenarios; a standard benchmark for MARL |
| Gymnasium-Robotics | Robotic manipulation | Fetch and Shadow Dexterous Hand tasks built on MuJoCo |
| Procgen | Procedurally generated games | 16 game-like environments designed to test generalization across randomized levels |
| DM Control Suite | Continuous control | DeepMind's suite of continuous control tasks built on MuJoCo |
The Arcade Learning Environment (ALE), which wraps Stella (an Atari 2600 emulator), became famous when Mnih et al. (2015) used it to show that a single deep neural network could learn to play Atari games from raw pixels. MuJoCo, originally a commercial physics engine, was acquired by DeepMind and open-sourced in 2022, making high-fidelity continuous control research freely accessible.[10]
Training RL agents in simulation offers advantages over real-world training: simulations are faster, cheaper, and safer. However, policies learned in simulation often fail when deployed on physical hardware because of the sim-to-real gap, the discrepancy between the simulator's physics and the real world.
Several techniques address this gap:
Sim-to-real transfer remains an active area of research, with recent work exploring online human correction (TRANSIC), simulation-guided fine-tuning, and learned latent-space models to bridge the gap.
Imagine you are teaching a puppy to do tricks. You (the environment) set up the room, show the puppy what is around it (the state), let the puppy try something (the action), and then give it a treat or say "no" (the reward). The puppy does not control the room or the treats; it can only decide what trick to try next. Over many attempts, the puppy figures out which tricks earn the most treats. In reinforcement learning, the "environment" is everything outside the computer program that is learning: it is the room, the rules, and the person handing out treats.