Environment

In reinforcement learning (RL), an environment is the external system that an agent interacts with. The environment receives the agent's actions, transitions to a new state, and returns an observation along with a numerical reward signal. Everything outside the agent's decision-making boundary is considered part of the environment. This formulation, popularized by Richard Sutton and Andrew Barto in Reinforcement Learning: An Introduction, provides the foundation for modeling sequential decision problems as Markov decision processes (MDPs).^[1]

While the term "environment" also appears in broader machine learning contexts (referring to data sources, hardware, and experimental setups), this article focuses on the RL-specific meaning, where the environment defines the task an agent must learn to solve.

Formal Definition

An RL environment is typically formalized as part of a Markov decision process. An MDP is described by a tuple (S, A, P, R, γ), where:^[2]

Symbol	Component	Description
S	State space	The set of all possible states the environment can occupy
A	Action space	The set of all valid actions the agent can take
P	Transition function	P(s' \| s, a): the probability of moving to state s' given state s and action a
R	Reward function	R(s, a, s'): the numerical signal returned after a transition
γ	Discount factor	A value in [0, 1] that determines how much future rewards are weighted

At each discrete time step t, the agent observes a state s_t, selects an action a_t according to its policy π, and the environment responds with a new state s_{t+1} and a reward r_{t+1}. The agent's objective is to learn a policy that maximizes the expected cumulative (discounted) reward over time.^[1]

Core Components

State Space

The state space S represents every possible configuration the environment can be in. States can be discrete (a finite set of board positions in chess) or continuous (joint angles and velocities of a robotic arm). In many practical applications, the agent does not see the full state directly but instead receives an observation that may be a subset or noisy version of the true state.^[3]

Action Space

The action space A defines what the agent is allowed to do at each step. Action spaces fall into two broad categories:^[4]

Type	Description	Examples
Discrete	A finite set of distinct actions	Moving left, right, up, or down in a grid world; selecting one of several Atari game buttons
Continuous	Actions are real-valued vectors	Torque applied to each joint of a robotic arm; throttle and steering angle for an autonomous vehicle

Some environments have mixed action spaces containing both discrete and continuous components.

Transition Dynamics

The transition function P(s' | s, a) describes how the environment evolves in response to an action. In a deterministic environment, a given state-action pair always leads to the same next state. In a stochastic environment, the next state is sampled from a probability distribution, introducing uncertainty.^[2] The agent typically does not have access to the transition function and must learn about the environment through trial and error, although model-based RL methods attempt to learn an approximate model of these dynamics.

Reward Signal

The reward function provides the feedback that drives learning. Rewards can be:

Dense: The agent receives informative feedback at every time step (for example, a small penalty for each step taken in a maze, encouraging shorter paths).
Sparse: The agent receives a reward signal only upon reaching a goal or failing (for example, +1 for winning a game and 0 otherwise).

Designing a good reward function is one of the most challenging aspects of applied RL. Poorly designed rewards can lead to reward hacking, where the agent finds unintended shortcuts that maximize reward without achieving the desired behavior.^[5]

Types of Environments

RL environments can be classified along several dimensions. The following table summarizes the major distinctions.

Dimension	Type A	Type B	Key Difference
Observability	Fully observable (MDP)	Partially observable (POMDP)	Whether the agent sees the complete state or only a partial observation
Determinism	Deterministic	Stochastic	Whether the same action in the same state always produces the same outcome
Number of agents	Single-agent	Multi-agent	Whether one or multiple decision-makers interact with the environment
Task horizon	Episodic	Continuing	Whether the interaction has a clear terminal state or runs indefinitely

Fully Observable vs. Partially Observable

In a fully observable environment, the agent can see the entire state at every time step. This setting corresponds to a standard MDP. Chess (ignoring clock considerations) is fully observable because both players can see all pieces on the board.

In a partially observable environment, the agent receives only an incomplete or noisy observation of the true state. This setting is modeled as a Partially Observable Markov Decision Process (POMDP). Poker is a classic example: each player can see their own hand but not the opponents' cards. Autonomous driving is another example, because sensors provide limited range and can be occluded. In POMDPs, agents often rely on memory (such as recurrent neural networks or belief states) to infer the hidden state from the history of observations.^[6]

Deterministic vs. Stochastic

A deterministic environment produces the same next state every time a particular action is taken from a particular state. Board games like Go (without randomized elements) and simple grid worlds are deterministic.

A stochastic environment introduces randomness into its transitions. Backgammon, where dice rolls affect available moves, is a stochastic environment. Most real-world environments are stochastic to some degree because of sensor noise, unpredictable external factors, and other sources of uncertainty.^[2]

Single-Agent vs. Multi-Agent

A single-agent environment contains only one decision-maker interacting with the environment. Classic control tasks (balancing a cart-pole, navigating a maze) are single-agent problems.

A multi-agent environment has two or more agents that simultaneously or sequentially interact with the shared environment. Multi-agent settings introduce additional complexity because each agent's optimal strategy depends on the behavior of the other agents. This can involve cooperation, competition, or a mix of both. The StarCraft Multi-Agent Challenge (SMAC) is a widely used benchmark for cooperative multi-agent RL research.^[7]

Episodic vs. Continuing

Episodic environments have a natural endpoint. Each episode starts from an initial state and ends when a terminal condition is met (winning a game, falling off a platform, or reaching a time limit). The agent's performance is typically measured by the total undiscounted reward per episode. Examples include playing a round of an Atari game or navigating a maze.

Continuing environments have no terminal state; the agent-environment interaction runs indefinitely. Stock trading systems, industrial process control, and server resource management are continuing tasks. In these settings, the discount factor γ becomes essential because the undiscounted sum of rewards over an infinite horizon would be unbounded.^[1]

Environment Software and Standards

OpenAI Gym and Gymnasium

OpenAI Gym, released in 2016, established the first widely adopted API standard for RL environments. It defined a simple interface: reset() to initialize an episode and step(action) to advance the environment by one time step. After OpenAI stopped actively maintaining Gym in late 2020, the Farama Foundation took over and released Gymnasium as the official successor.^[8]

The core Gymnasium API specifies that step() returns five values:

Return Value	Type	Description
`observation`	Depends on environment	The agent's observation of the current state
`reward`	float	Numerical feedback for the action taken
`terminated`	bool	True if the episode ended due to reaching a terminal state
`truncated`	bool	True if the episode ended due to a time limit or boundary violation
`info`	dict	Auxiliary diagnostic information

The separation of terminated and truncated (introduced in Gymnasium v0.26) was an important change from the original Gym API, which used a single done boolean. This distinction matters for algorithms that need to differentiate between true episode termination and artificial cutoffs.

Environment Wrappers

Gymnasium provides a modular wrapper system that allows users to modify an environment's behavior without changing its source code. Common wrapper types include:^[8]

ObservationWrapper: Transforms observations (for example, converting RGB images to grayscale or normalizing pixel values).
ActionWrapper: Modifies actions before they reach the environment (for example, discretizing a continuous action space).
RewardWrapper: Scales, clips, or reshapes the reward signal.

Wrappers can be stacked, so a single environment can have multiple transformations applied in sequence. This composability makes it easy to preprocess observations, add time limits, or record video without duplicating code.

Custom Environment Creation

Creating a custom Gymnasium environment involves subclassing gymnasium.Env and implementing two core methods: reset() and step(). The developer must also define the observation_space and action_space attributes using Gymnasium's space classes (such as Discrete, Box, MultiBinary, or Dict). Once registered with the Gymnasium registry, custom environments can be instantiated with gymnasium.make() just like built-in ones.^[8]

PettingZoo

For multi-agent environments, the Farama Foundation maintains PettingZoo, which extends the Gymnasium philosophy to settings with multiple agents. PettingZoo uses an Agent Environment Cycle (AEC) API in which agents take turns acting, and also supports a parallel API for simultaneous-action games. PettingZoo includes environment families covering Atari multi-player games, classic board and card games, and cooperative tasks.^[9]

Simulation Environments and Benchmarks

Simulation environments are critical for RL research because they provide safe, repeatable, and scalable settings for training and evaluation. The following table lists some of the most influential simulation suites.

Environment Suite	Domain	Description
Classic Control	Physics simulations	Simple tasks like CartPole, MountainCar, and Acrobot; included in Gymnasium
Atari (ALE)	Video games	57 Atari 2600 games; the benchmark that demonstrated the potential of deep RL via DQN
MuJoCo	Continuous control	Physics-engine-based locomotion and manipulation tasks (Ant, HalfCheetah, Humanoid); open-sourced by DeepMind in 2022
StarCraft II (SMAC)	Strategy games	Cooperative multi-agent micromanagement scenarios; a standard benchmark for MARL
Gymnasium-Robotics	Robotic manipulation	Fetch and Shadow Dexterous Hand tasks built on MuJoCo
Procgen	Procedurally generated games	16 game-like environments designed to test generalization across randomized levels
DM Control Suite	Continuous control	DeepMind's suite of continuous control tasks built on MuJoCo

The Arcade Learning Environment (ALE), which wraps Stella (an Atari 2600 emulator), became famous when Mnih et al. (2015) used it to show that a single deep neural network could learn to play Atari games from raw pixels. MuJoCo, originally a commercial physics engine, was acquired by DeepMind and open-sourced in 2022, making high-fidelity continuous control research freely accessible.^[10]

Sim-to-Real Transfer

Training RL agents in simulation offers advantages over real-world training: simulations are faster, cheaper, and safer. However, policies learned in simulation often fail when deployed on physical hardware because of the sim-to-real gap, the discrepancy between the simulator's physics and the real world.

Several techniques address this gap:

Domain randomization: Varying simulator parameters (lighting, friction, mass) during training so the agent learns policies that are robust to a range of conditions.
System identification: Measuring real-world physical parameters and calibrating the simulator to match.
Fine-tuning in the real world: Using a simulation-trained policy as initialization and then adapting it with a small amount of real-world data.
Real-to-sim-to-real: Scanning or reconstructing a real scene into a simulator, training in that realistic simulation, and then deploying back to the real world. MIT's RialTo framework demonstrated this approach for household robotic tasks, improving performance by 67% over pure imitation learning.^[10]

Sim-to-real transfer remains an active area of research, with recent work exploring online human correction (TRANSIC), simulation-guided fine-tuning, and learned latent-space models to bridge the gap.

Explain Like I'm 5 (ELI5)

Imagine you are teaching a puppy to do tricks. You (the environment) set up the room, show the puppy what is around it (the state), let the puppy try something (the action), and then give it a treat or say "no" (the reward). The puppy does not control the room or the treats; it can only decide what trick to try next. Over many attempts, the puppy figures out which tricks earn the most treats. In reinforcement learning, the "environment" is everything outside the computer program that is learning: it is the room, the rules, and the person handing out treats.

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
"Reinforcement learning." Wikipedia. https://en.wikipedia.org/wiki/reinforcement_learning
"States, Observation and Action Spaces in Reinforcement Learning." Medium. https://medium.com/swlh/states-observation-and-action-spaces-in-reinforcement-learning-569a30a8d2a1
"An Overview of the Action Space for Deep Reinforcement Learning." ACM Computing Surveys. https://dl.acm.org/doi/fullHtml/10.1145/3508546.3508598
"Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications." arXiv (2024). https://arxiv.org/html/2408.10215v1
"Partially observable Markov decision process." Wikipedia. https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process
Samvelyan, M., et al. (2019). "The StarCraft Multi-Agent Challenge." arXiv:1902.04043. https://arxiv.org/abs/1902.04043
"Gymnasium Documentation." Farama Foundation. https://gymnasium.farama.org/
"PettingZoo: A Standard API for Multi-Agent Reinforcement Learning." NeurIPS 2021. https://pettingzoo.farama.org/
"Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey." arXiv:2009.13303. https://arxiv.org/abs/2009.13303

Environment

Formal Definition

Core Components

State Space

Action Space

Transition Dynamics

Reward Signal

Types of Environments

Fully Observable vs. Partially Observable

Deterministic vs. Stochastic

Single-Agent vs. Multi-Agent

Episodic vs. Continuing

Environment Software and Standards

OpenAI Gym and Gymnasium

Environment Wrappers

Custom Environment Creation

PettingZoo

Simulation Environments and Benchmarks

Sim-to-Real Transfer

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Environment

Formal Definition

Core Components

State Space

Action Space

Transition Dynamics

Reward Signal

Types of Environments

Fully Observable vs. Partially Observable

Deterministic vs. Stochastic

Single-Agent vs. Multi-Agent

Episodic vs. Continuing

Environment Software and Standards

OpenAI Gym and Gymnasium

Environment Wrappers

Custom Environment Creation

PettingZoo

Simulation Environments and Benchmarks

Sim-to-Real Transfer

Explain Like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)