In reinforcement learning (RL), an action is a decision or move made by an agent that affects the state of the environment. At each time step, the agent observes the current state and selects an action from a set of available options called the action space. The environment then transitions to a new state and returns a reward signal, which the agent uses to learn better behavior over time.[1] Actions are the sole mechanism through which an agent influences its environment, making them a foundational element of the reinforcement learning framework.
Formally, an action is one component of the Markov decision process (MDP) tuple (S, A, T, R, γ), where S is the set of states, A is the set of actions, T is the state transition function, R is the reward function, and γ is the discount factor.[2] At each time step t, the agent in state s_t selects an action a_t from the action space A, receives reward r_t, and transitions to a new state s_{t+1} according to the transition probability T(s_{t+1} | s_t, a_t).
Imagine you are playing a video game. Every time you press a button on the controller, your character does something: it might jump, run left, or pick up an item. Each button press is an "action." The game then changes because of what you did. If you made a good move, you get points (that is the reward). Over time, you learn which buttons to press in different situations to get the highest score. In reinforcement learning, the computer is the player, and it figures out which "buttons" to press by trying different actions and seeing what happens.
The action space defines the complete set of actions available to an agent. The structure of the action space has a direct impact on which algorithms can be applied and how the agent learns. Action spaces are broadly classified into three categories: discrete, continuous, and hybrid.[3]
A discrete action space contains a finite number of distinct actions. The agent selects one action from a fixed set at each time step. Board games, grid worlds, and classic Atari games are common examples of environments with discrete action spaces.
In chess, for instance, the action space at any given board position consists of all legal moves available to the current player. The agent picks exactly one move from this finite list. Algorithms such as Q-learning and Deep Q-Networks (DQN) are well suited to discrete action spaces because they can estimate a value for every possible action in a given state.[4]
In the Gymnasium (formerly OpenAI Gym) toolkit, discrete action spaces are represented by the Discrete(n) space, where n is the number of possible actions.[5]
A continuous action space allows actions to take any real-valued number within a specified range. Instead of choosing from a list, the agent outputs one or more continuous parameters. Robotic control tasks are the classic example: a robotic arm might need to output joint torques as real-valued numbers, with each joint angle ranging from 0 to 360 degrees and each force value ranging from 0 to some maximum.
Self-driving cars also operate in continuous action spaces, where the steering angle, throttle, and braking force are all continuous values. Because there are infinitely many possible actions, value-based methods like DQN cannot enumerate them. Instead, policy gradient methods and actor-critic algorithms are used. Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO) are all designed to handle continuous action outputs.[6]
In Gymnasium, continuous action spaces are represented by the Box space, which defines lower and upper bounds for each dimension of the action vector.[5]
Some environments require both discrete choices and continuous parameters. For example, in robot soccer, an agent might choose a discrete action like "kick" and then specify continuous parameters such as kick power and direction. This type of action space is called a hybrid or parameterized action space.
Hybrid action spaces present a challenge because most standard RL algorithms handle either discrete or continuous actions, but not both simultaneously. Approaches to this problem include hierarchical architectures where a higher-level network selects the discrete action and lower-level networks determine the continuous parameters.[7] The HyAR (Hybrid Action Representation) method is one approach that learns a unified latent representation for both discrete and continuous components.[8]
Some environments feature multiple independent discrete choices that must be made simultaneously. A game controller, for example, requires the agent to decide on several buttons at once. Gymnasium provides MultiDiscrete for representing the Cartesian product of multiple discrete spaces and MultiBinary for actions represented as binary vectors (such as pressing or not pressing each of several buttons).[5]
| Action space type | Description | Example | Common algorithms |
|---|---|---|---|
| Discrete | Finite set of distinct actions | Board games, Atari games | DQN, Q-learning, SARSA |
| Continuous | Real-valued actions within a range | Robotic control, self-driving | DDPG, TD3, SAC, PPO |
| Hybrid (parameterized) | Discrete choice plus continuous parameters | Robot soccer, RTS games | HyAR, P-DQN, hierarchical actor-critic |
| Multi-discrete | Multiple independent discrete choices | Game controllers, multi-joint robots | PPO with multi-head output, A2C |
| Multi-binary | Binary vector of on/off decisions | Button-press combinations | PPO, A2C |
A policy is the function that determines which action an agent takes in a given state. Formally, a policy π maps states to actions (or to probability distributions over actions). The policy is the core of a reinforcement learning agent because it alone is sufficient to determine the agent's behavior.[1]
Policies can be deterministic, producing a single action for each state (a = π(s)), or stochastic, producing a probability distribution over actions (π(a|s)). Stochastic policies are useful because they naturally support exploration, allowing the agent to try different actions rather than always repeating the same one.
| Property | Deterministic policy | Stochastic policy |
|---|---|---|
| Output | Single action a = π(s) | Probability distribution π(a|s) |
| Exploration | Requires external noise (e.g., Ornstein-Uhlenbeck) | Built-in through sampling |
| Common algorithms | DDPG, TD3 | PPO, SAC, REINFORCE |
| Typical use case | Continuous control with low noise | Environments requiring exploration |
The action-value function, commonly written as Q(s, a), estimates the expected cumulative reward an agent will receive by taking action a in state s and then following a particular policy thereafter. This function is central to many RL algorithms and provides the basis for action selection in value-based methods.[9]
The Q-function satisfies the Bellman equation:
Q(s, a) = E[r + γ max_a' Q(s', a')]
where r is the immediate reward, γ is the discount factor, s' is the next state, and the max is taken over all possible next actions a'. This recursive relationship allows algorithms like Q-learning to iteratively update their estimates of Q-values until they converge to the optimal action-value function Q*.[9]
The advantage function A(s, a) = Q(s, a) - V(s) measures how much better a particular action is compared to the average action in that state, where V(s) is the state-value function. The advantage function is used in algorithms like A2C (Advantage Actor-Critic) and PPO to reduce variance in policy gradient updates.[10]
A fundamental challenge in reinforcement learning is the exploration-exploitation tradeoff. The agent must balance exploiting actions it already knows yield high rewards against exploring unfamiliar actions that might yield even higher rewards. Several strategies address this tradeoff.
The epsilon-greedy strategy selects a random action with probability ε and the action with the highest estimated value with probability 1 - ε. This is the simplest exploration method and is widely used with DQN and tabular Q-learning.[1] The value of ε is typically annealed (gradually reduced) over training so that the agent explores more at the beginning and exploits more as it learns.
Softmax exploration assigns a probability to each action proportional to its estimated value, regulated by a temperature parameter τ. At high temperatures, all actions are nearly equally likely (more exploration). At low temperatures, the highest-valued action dominates (more exploitation). Unlike epsilon-greedy, softmax exploration accounts for the relative differences in action values rather than treating all non-greedy actions equally.[1]
UCB methods select actions based on an optimistic estimate of their value, adding a bonus term that reflects how uncertain the agent is about each action. Actions that have been tried fewer times receive a larger bonus, encouraging the agent to try them. UCB1, introduced by Auer, Cesa-Bianchi, and Fischer (2002), is the most widely cited variant and provides theoretical regret bounds for the multi-armed bandit problem.[11]
Algorithms like SAC add an entropy bonus to the reward, encouraging the policy to remain as random as possible while still achieving high returns. This prevents premature convergence to a suboptimal deterministic policy and leads to more robust behavior.[6]
| Strategy | Mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Epsilon-greedy | Random action with probability ε | Simple to implement | Treats all non-greedy actions equally |
| Softmax (Boltzmann) | Action probabilities proportional to Q-values | Accounts for relative action values | Temperature tuning required |
| UCB | Optimistic value estimate with uncertainty bonus | Theoretical guarantees, principled | Harder to apply in deep RL |
| Entropy regularization | Bonus reward for policy randomness | Prevents premature convergence | Adds a hyperparameter (entropy coefficient) |
| Curiosity-driven | Intrinsic reward for novel states | Effective in sparse-reward settings | Can be distracted by noise |
In many real-world and game environments, not all actions are valid in every state. A chess agent cannot move a piece to an occupied square of the same color, and a robot cannot move through a wall. Action masking is a technique that prevents the agent from selecting invalid actions by zeroing out or assigning negative infinity to the logits of invalid actions before the policy computes its probability distribution.[12]
Action masking was used prominently in AlphaStar (DeepMind's StarCraft II agent) and OpenAI Five (OpenAI's Dota 2 agent). Research has shown that invalid action masking leads to faster training, lower variance, and better final performance compared to letting the agent learn to avoid invalid actions through negative rewards alone.[12]
Standard RL treats each action as a single-step primitive (e.g., move one cell, apply one torque value). The options framework, introduced by Sutton, Precup, and Singh (1999), extends this by defining "options" as temporally extended actions. An option is a sub-policy that, once initiated, runs for multiple time steps until a termination condition is met.[13]
Examples of options include high-level behaviors such as "navigate to the door," "pick up the object," or "turn left at the intersection." Each option encapsulates a sequence of primitive actions. This hierarchy allows agents to plan and learn at multiple time scales, reducing the effective horizon of the decision problem.
The options framework is formalized within Semi-Markov Decision Processes (SMDPs) and forms the basis of hierarchical reinforcement learning. The Option-Critic architecture (Bacon, Harb, and Precup, 2017) extended this work by allowing options to be learned end-to-end using policy gradient methods rather than being hand-designed.[14]
In multi-agent reinforcement learning (MARL), multiple agents act simultaneously in a shared environment. The joint action space is the Cartesian product of all individual agents' action spaces. If each of n agents has k possible actions, the joint action space has k^n elements, which grows exponentially. This combinatorial explosion is one of the primary challenges in MARL.[15]
Several approaches address this challenge:
Actions take different forms depending on the application domain.
| Application | Action space type | Example actions |
|---|---|---|
| Board games (chess, Go) | Discrete | Place a stone, move a piece |
| Atari games | Discrete | Move left, fire, no-op |
| Robotic manipulation | Continuous | Joint torques, gripper force |
| Autonomous driving | Continuous or hybrid | Steering angle, throttle, braking |
| Portfolio management | Continuous | Asset allocation percentages |
| Dialogue systems | Discrete | Select a response template, ask a clarification |
| Network routing | Discrete | Forward packet to a neighbor node |
| Recommender systems | Discrete | Select an item to recommend |
| Drug dosing | Continuous | Dosage amount for a treatment |
| Energy grid management | Hybrid | Turn generator on/off (discrete), set output level (continuous) |
AlphaGo, developed by DeepMind, demonstrated that RL agents could defeat world champions at Go by learning to select moves (actions) through a combination of Monte Carlo tree search and deep neural network evaluation.[16] In Atari game environments, DQN agents learn to map raw pixel observations directly to discrete joystick actions.[4]
Robotic control tasks require continuous actions such as joint torques, velocities, and gripper forces. Simulated environments like MuJoCo and Isaac Gym allow agents to learn these control policies through millions of trial-and-error interactions before transferring the learned policy to a physical robot (sim-to-real transfer).
Self-driving cars must continuously select acceleration, braking, and steering actions in dynamic traffic environments. RL-based approaches are used by companies such as Waymo and in academic research to train driving policies that handle complex scenarios like merging, lane changes, and intersection navigation.
The design of the action space can significantly affect learning performance. Action space shaping refers to the practice of modifying the action space to make learning easier without changing the underlying task.[17]
Common techniques include:
Different RL algorithms are designed for different action space types. The table below summarizes which algorithms support which action spaces.
| Algorithm | Discrete | Continuous | Hybrid | On/off-policy |
|---|---|---|---|---|
| Q-learning | Yes | No | No | Off-policy |
| DQN | Yes | No | No | Off-policy |
| REINFORCE | Yes | Yes | No | On-policy |
| A2C / A3C | Yes | Yes | No | On-policy |
| PPO | Yes | Yes | No | On-policy |
| DDPG | No | Yes | No | Off-policy |
| TD3 | No | Yes | No | Off-policy |
| SAC | Yes | Yes | Yes | Off-policy |
| P-DQN | No | No | Yes | Off-policy |
| HyAR | No | No | Yes | Off-policy |