Action (Reinforcement Learning)

In reinforcement learning (RL), an action is a decision or move made by an agent that affects the state of the environment. At each time step, the agent observes the current state and selects an action from a set of available options called the action space. The environment then transitions to a new state and returns a reward signal, which the agent uses to learn better behavior over time.^[1] Actions are the sole mechanism through which an agent influences its environment, making them a foundational element of the reinforcement learning framework.

Formally, an action is one component of the Markov decision process (MDP) tuple (S, A, T, R, γ), where S is the set of states, A is the set of actions, T is the state transition function, R is the reward function, and γ is the discount factor.^[2] At each time step t, the agent in state s_t selects an action a_t from the action space A, receives reward r_t, and transitions to a new state s_{t+1} according to the transition probability T(s_{t+1} | s_t, a_t).

Explain like I'm 5 (ELI5)

Imagine you are playing a video game. Every time you press a button on the controller, your character does something: it might jump, run left, or pick up an item. Each button press is an "action." The game then changes because of what you did. If you made a good move, you get points (that is the reward). Over time, you learn which buttons to press in different situations to get the highest score. In reinforcement learning, the computer is the player, and it figures out which "buttons" to press by trying different actions and seeing what happens.

Action space

The action space defines the complete set of actions available to an agent. The structure of the action space has a direct impact on which algorithms can be applied and how the agent learns. Action spaces are broadly classified into three categories: discrete, continuous, and hybrid.^[3]

Discrete action spaces

A discrete action space contains a finite number of distinct actions. The agent selects one action from a fixed set at each time step. Board games, grid worlds, and classic Atari games are common examples of environments with discrete action spaces.

In chess, for instance, the action space at any given board position consists of all legal moves available to the current player. The agent picks exactly one move from this finite list. Algorithms such as Q-learning and Deep Q-Networks (DQN) are well suited to discrete action spaces because they can estimate a value for every possible action in a given state.^[4]

In the Gymnasium (formerly OpenAI Gym) toolkit, discrete action spaces are represented by the Discrete(n) space, where n is the number of possible actions.^[5]

Continuous action spaces

A continuous action space allows actions to take any real-valued number within a specified range. Instead of choosing from a list, the agent outputs one or more continuous parameters. Robotic control tasks are the classic example: a robotic arm might need to output joint torques as real-valued numbers, with each joint angle ranging from 0 to 360 degrees and each force value ranging from 0 to some maximum.

Self-driving cars also operate in continuous action spaces, where the steering angle, throttle, and braking force are all continuous values. Because there are infinitely many possible actions, value-based methods like DQN cannot enumerate them. Instead, policy gradient methods and actor-critic algorithms are used. Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO) are all designed to handle continuous action outputs.^[6]

In Gymnasium, continuous action spaces are represented by the Box space, which defines lower and upper bounds for each dimension of the action vector.^[5]

Hybrid (parameterized) action spaces

Some environments require both discrete choices and continuous parameters. For example, in robot soccer, an agent might choose a discrete action like "kick" and then specify continuous parameters such as kick power and direction. This type of action space is called a hybrid or parameterized action space.

Hybrid action spaces present a challenge because most standard RL algorithms handle either discrete or continuous actions, but not both simultaneously. Approaches to this problem include hierarchical architectures where a higher-level network selects the discrete action and lower-level networks determine the continuous parameters.^[7] The HyAR (Hybrid Action Representation) method is one approach that learns a unified latent representation for both discrete and continuous components.^[8]

Multi-discrete and multi-binary action spaces

Some environments feature multiple independent discrete choices that must be made simultaneously. A game controller, for example, requires the agent to decide on several buttons at once. Gymnasium provides MultiDiscrete for representing the Cartesian product of multiple discrete spaces and MultiBinary for actions represented as binary vectors (such as pressing or not pressing each of several buttons).^[5]

Comparison of action space types

Action space type	Description	Example	Common algorithms
Discrete	Finite set of distinct actions	Board games, Atari games	DQN, Q-learning, SARSA
Continuous	Real-valued actions within a range	Robotic control, self-driving	DDPG, TD3, SAC, PPO
Hybrid (parameterized)	Discrete choice plus continuous parameters	Robot soccer, RTS games	HyAR, P-DQN, hierarchical actor-critic
Multi-discrete	Multiple independent discrete choices	Game controllers, multi-joint robots	PPO with multi-head output, A2C
Multi-binary	Binary vector of on/off decisions	Button-press combinations	PPO, A2C

Policy and action selection

A policy is the function that determines which action an agent takes in a given state. Formally, a policy π maps states to actions (or to probability distributions over actions). The policy is the core of a reinforcement learning agent because it alone is sufficient to determine the agent's behavior.^[1]

Policies can be deterministic, producing a single action for each state (a = π(s)), or stochastic, producing a probability distribution over actions (π(a|s)). Stochastic policies are useful because they naturally support exploration, allowing the agent to try different actions rather than always repeating the same one.

Deterministic vs. stochastic policies

Property	Deterministic policy	Stochastic policy
Output	Single action a = π(s)	Probability distribution π(a\|s)
Exploration	Requires external noise (e.g., Ornstein-Uhlenbeck)	Built-in through sampling
Common algorithms	DDPG, TD3	PPO, SAC, REINFORCE
Typical use case	Continuous control with low noise	Environments requiring exploration

Action-value function (Q-function)

The action-value function, commonly written as Q(s, a), estimates the expected cumulative reward an agent will receive by taking action a in state s and then following a particular policy thereafter. This function is central to many RL algorithms and provides the basis for action selection in value-based methods.^[9]

The Q-function satisfies the Bellman equation:

Q(s, a) = E[r + γ max_a' Q(s', a')]

where r is the immediate reward, γ is the discount factor, s' is the next state, and the max is taken over all possible next actions a'. This recursive relationship allows algorithms like Q-learning to iteratively update their estimates of Q-values until they converge to the optimal action-value function Q*.^[9]

The advantage function A(s, a) = Q(s, a) - V(s) measures how much better a particular action is compared to the average action in that state, where V(s) is the state-value function. The advantage function is used in algorithms like A2C (Advantage Actor-Critic) and PPO to reduce variance in policy gradient updates.^[10]

Exploration vs. exploitation in action selection

A fundamental challenge in reinforcement learning is the exploration-exploitation tradeoff. The agent must balance exploiting actions it already knows yield high rewards against exploring unfamiliar actions that might yield even higher rewards. Several strategies address this tradeoff.

Epsilon-greedy

The epsilon-greedy strategy selects a random action with probability ε and the action with the highest estimated value with probability 1 - ε. This is the simplest exploration method and is widely used with DQN and tabular Q-learning.^[1] The value of ε is typically annealed (gradually reduced) over training so that the agent explores more at the beginning and exploits more as it learns.

Softmax (Boltzmann) exploration

Softmax exploration assigns a probability to each action proportional to its estimated value, regulated by a temperature parameter τ. At high temperatures, all actions are nearly equally likely (more exploration). At low temperatures, the highest-valued action dominates (more exploitation). Unlike epsilon-greedy, softmax exploration accounts for the relative differences in action values rather than treating all non-greedy actions equally.^[1]

Upper confidence bound (UCB)

UCB methods select actions based on an optimistic estimate of their value, adding a bonus term that reflects how uncertain the agent is about each action. Actions that have been tried fewer times receive a larger bonus, encouraging the agent to try them. UCB1, introduced by Auer, Cesa-Bianchi, and Fischer (2002), is the most widely cited variant and provides theoretical regret bounds for the multi-armed bandit problem.^[11]

Entropy-based exploration

Algorithms like SAC add an entropy bonus to the reward, encouraging the policy to remain as random as possible while still achieving high returns. This prevents premature convergence to a suboptimal deterministic policy and leads to more robust behavior.^[6]

Comparison of exploration strategies

Strategy	Mechanism	Strengths	Weaknesses
Epsilon-greedy	Random action with probability ε	Simple to implement	Treats all non-greedy actions equally
Softmax (Boltzmann)	Action probabilities proportional to Q-values	Accounts for relative action values	Temperature tuning required
UCB	Optimistic value estimate with uncertainty bonus	Theoretical guarantees, principled	Harder to apply in deep RL
Entropy regularization	Bonus reward for policy randomness	Prevents premature convergence	Adds a hyperparameter (entropy coefficient)
Curiosity-driven	Intrinsic reward for novel states	Effective in sparse-reward settings	Can be distracted by noise

Action masking

In many real-world and game environments, not all actions are valid in every state. A chess agent cannot move a piece to an occupied square of the same color, and a robot cannot move through a wall. Action masking is a technique that prevents the agent from selecting invalid actions by zeroing out or assigning negative infinity to the logits of invalid actions before the policy computes its probability distribution.^[12]

Action masking was used prominently in AlphaStar (DeepMind's StarCraft II agent) and OpenAI Five (OpenAI's Dota 2 agent). Research has shown that invalid action masking leads to faster training, lower variance, and better final performance compared to letting the agent learn to avoid invalid actions through negative rewards alone.^[12]

Temporal abstraction and options

Standard RL treats each action as a single-step primitive (e.g., move one cell, apply one torque value). The options framework, introduced by Sutton, Precup, and Singh (1999), extends this by defining "options" as temporally extended actions. An option is a sub-policy that, once initiated, runs for multiple time steps until a termination condition is met.^[13]

Examples of options include high-level behaviors such as "navigate to the door," "pick up the object," or "turn left at the intersection." Each option encapsulates a sequence of primitive actions. This hierarchy allows agents to plan and learn at multiple time scales, reducing the effective horizon of the decision problem.

The options framework is formalized within Semi-Markov Decision Processes (SMDPs) and forms the basis of hierarchical reinforcement learning. The Option-Critic architecture (Bacon, Harb, and Precup, 2017) extended this work by allowing options to be learned end-to-end using policy gradient methods rather than being hand-designed.^[14]

Actions in multi-agent systems

In multi-agent reinforcement learning (MARL), multiple agents act simultaneously in a shared environment. The joint action space is the Cartesian product of all individual agents' action spaces. If each of n agents has k possible actions, the joint action space has k^n elements, which grows exponentially. This combinatorial explosion is one of the primary challenges in MARL.^[15]

Several approaches address this challenge:

Independent learners: Each agent treats other agents as part of the environment and learns its own policy independently (e.g., Independent Q-Learning).
Centralized training, decentralized execution (CTDE): Agents share information during training but act independently at test time. QMIX and MAPPO are examples of this paradigm.
Factorized value functions: The joint Q-function is decomposed into individual agent contributions, reducing the dimensionality of the problem. Factorized Q-Learning (FQL) scales to hundreds of agents.^[15]
Communication protocols: Agents learn to send messages to coordinate their actions without requiring a central controller.

Actions in real-world applications

Actions take different forms depending on the application domain.

Application	Action space type	Example actions
Board games (chess, Go)	Discrete	Place a stone, move a piece
Atari games	Discrete	Move left, fire, no-op
Robotic manipulation	Continuous	Joint torques, gripper force
Autonomous driving	Continuous or hybrid	Steering angle, throttle, braking
Portfolio management	Continuous	Asset allocation percentages
Dialogue systems	Discrete	Select a response template, ask a clarification
Network routing	Discrete	Forward packet to a neighbor node
Recommender systems	Discrete	Select an item to recommend
Drug dosing	Continuous	Dosage amount for a treatment
Energy grid management	Hybrid	Turn generator on/off (discrete), set output level (continuous)

Game playing

AlphaGo, developed by DeepMind, demonstrated that RL agents could defeat world champions at Go by learning to select moves (actions) through a combination of Monte Carlo tree search and deep neural network evaluation.^[16] In Atari game environments, DQN agents learn to map raw pixel observations directly to discrete joystick actions.^[4]

Robotics

Robotic control tasks require continuous actions such as joint torques, velocities, and gripper forces. Simulated environments like MuJoCo and Isaac Gym allow agents to learn these control policies through millions of trial-and-error interactions before transferring the learned policy to a physical robot (sim-to-real transfer).

Autonomous vehicles

Self-driving cars must continuously select acceleration, braking, and steering actions in dynamic traffic environments. RL-based approaches are used by companies such as Waymo and in academic research to train driving policies that handle complex scenarios like merging, lane changes, and intersection navigation.

Action space design and shaping

The design of the action space can significantly affect learning performance. Action space shaping refers to the practice of modifying the action space to make learning easier without changing the underlying task.^[17]

Common techniques include:

Action discretization: Converting a continuous action space into a discrete one by dividing each dimension into bins. This allows the use of discrete algorithms but may sacrifice precision.
Action normalization: Scaling all action dimensions to a common range (e.g., [-1, 1]) to help gradient-based optimization.
Action repeat (frame skipping): Repeating the same action for multiple consecutive frames, effectively reducing the decision frequency. This technique was used in the original DQN Atari experiments.
Curriculum over actions: Starting with a simplified action space and gradually expanding it as the agent improves.

Algorithms by action space support

Different RL algorithms are designed for different action space types. The table below summarizes which algorithms support which action spaces.

Algorithm	Discrete	Continuous	Hybrid	On/off-policy
Q-learning	Yes	No	No	Off-policy
DQN	Yes	No	No	Off-policy
REINFORCE	Yes	Yes	No	On-policy
A2C / A3C	Yes	Yes	No	On-policy
PPO	Yes	Yes	No	On-policy
DDPG	No	Yes	No	Off-policy
TD3	No	Yes	No	Off-policy
SAC	Yes	Yes	Yes	Off-policy
P-DQN	No	No	Yes	Off-policy
HyAR	No	No	Yes	Off-policy

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
Tang, H., Liu, S., & Chen, Z. (2022). An Overview of the Action Space for Deep Reinforcement Learning. *Proceedings of the 4th International Conference on Computing and Data Science*. https://dl.acm.org/doi/fullHtml/10.1145/3508546.3508598
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
Farama Foundation. (2024). Gymnasium Documentation: Spaces. https://gymnasium.farama.org/api/spaces/
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
Wei, E., & Wicke, L. (2018). Hierarchical Approaches for Reinforcement Learning in Parameterized Action Space. *arXiv preprint arXiv:1810.09656*.
Li, B., Tang, H., Zheng, Y., et al. (2021). HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation. *arXiv preprint arXiv:2109.05490*.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. *Machine Learning*, 8(3-4), 279-292.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. *arXiv preprint arXiv:1506.02438*.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. *Machine Learning*, 47(2-3), 235-256.
Huang, S., & Ontanon, S. (2020). A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. *arXiv preprint arXiv:2006.14171*.
Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. *Artificial Intelligence*, 112(1-2), 181-211.
Bacon, P.-L., Harb, J., & Precup, D. (2017). The Option-Critic Architecture. *Proceedings of the AAAI Conference on Artificial Intelligence*.
Busoniu, L., Babuska, R., & De Schutter, B. (2008). A Comprehensive Survey of Multiagent Reinforcement Learning. *IEEE Transactions on Systems, Man, and Cybernetics, Part C*, 38(2), 156-172.
Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. *Nature*, 529(7587), 484-489.
Kanervisto, A., Scheller, C., & Hautamaki, V. (2020). Action Space Shaping in Deep Reinforcement Learning. *arXiv preprint arXiv:2004.00980*.

Explain like I'm 5 (ELI5)

Action space

Discrete action spaces

Continuous action spaces

Hybrid (parameterized) action spaces

Multi-discrete and multi-binary action spaces

Comparison of action space types

Policy and action selection

Deterministic vs. stochastic policies

Action-value function (Q-function)

Exploration vs. exploitation in action selection

Epsilon-greedy

Softmax (Boltzmann) exploration

Upper confidence bound (UCB)

Entropy-based exploration

Comparison of exploration strategies

Action masking

Temporal abstraction and options

Actions in multi-agent systems

Actions in real-world applications

Game playing

Robotics

Autonomous vehicles

Action space design and shaping

Algorithms by action space support

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Bellman Equation

Explain like I'm 5 (ELI5)

Action space

Discrete action spaces

Continuous action spaces

Hybrid (parameterized) action spaces

Multi-discrete and multi-binary action spaces

Comparison of action space types

Policy and action selection

Deterministic vs. stochastic policies

Action-value function (Q-function)

Exploration vs. exploitation in action selection

Epsilon-greedy

Softmax (Boltzmann) exploration

Upper confidence bound (UCB)

Entropy-based exploration

Comparison of exploration strategies

Action masking

Temporal abstraction and options

Actions in multi-agent systems

Actions in real-world applications

Game playing

Robotics

Autonomous vehicles

Action space design and shaping

Algorithms by action space support

See also

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Bellman Equation