# Action (Reinforcement Learning)

> Source: https://aiwiki.ai/wiki/action
> Updated: 2026-07-11
> Categories: Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

In [reinforcement learning](/wiki/reinforcement_learning) (RL), an **action** is a decision or move made by an [agent](/wiki/agent) that affects the state of the [environment](/wiki/environment). At each discrete time step, the agent observes the current state and selects one action from the set of available options, called the **action space**; the environment then transitions to a new state and returns a [reward](/wiki/reward) signal that the agent uses to learn better behavior over time.[1] Actions are the sole mechanism through which an agent influences its environment, making them a foundational element of the reinforcement learning framework. A [policy](/wiki/policy) is the mapping from states to actions; as Sutton and Barto put it, the policy "alone is sufficient to determine behavior."[1]

Formally, an action is one component of the [Markov decision process](/wiki/markov_decision_process_mdp) (MDP) tuple $$(S, A, T, R, \gamma)$$, where S is the set of states, A is the set of actions, T is the state transition function, R is the reward function, and $$\gamma$$ is the discount factor.[2] At each time step t, the agent in state $$s_t$$ selects an action $$a_t$$ from the action space A, receives reward $$r_t$$, and transitions to a new state $$s_{t+1}$$ according to the transition probability $$T(s_{t+1} \mid s_t, a_t)$$.

## Explain like I'm 5 (ELI5)

Imagine you are playing a video game. Every time you press a button on the controller, your character does something: it might jump, run left, or pick up an item. Each button press is an "action." The game then changes because of what you did. If you made a good move, you get points (that is the reward). Over time, you learn which buttons to press in different situations to get the highest score. In reinforcement learning, the computer is the player, and it figures out which "buttons" to press by trying different actions and seeing what happens.

## How does an action fit in the agent-environment loop?

Reinforcement learning is built around the agent-environment interface, in which the agent and environment interact at discrete time steps. At each step t, the agent receives the environment's current state (or observation) and a reward signal, then chooses an action $$a_t$$ according to its current [policy](/wiki/policy). That action causes the environment to transition to a new state at step t+1 and to emit a new reward, and the cycle repeats.[1] The action is therefore the single output the agent controls: states and rewards are produced by the environment, while the action is the agent's only lever on what happens next.

Because the action is the agent's only point of influence, the entire learning problem reduces to choosing actions that maximize the expected sum of future (discounted) rewards. Sutton and Barto frame reinforcement learning as "learning what to do, how to map situations to actions, so as to maximize a numerical reward signal," which makes action selection the central quantity an RL algorithm must get right.[1]

## What is an action space?

The action space defines the complete set of actions available to an agent. The structure of the action space has a direct impact on which algorithms can be applied and how the agent learns. Action spaces are broadly classified into three categories: discrete, continuous, and hybrid.[3]

### What is a discrete action space?

A discrete action space contains a finite number of distinct actions. The agent selects one action from a fixed set at each time step. Board games, grid worlds, and classic Atari games are common examples of environments with discrete action spaces.

In chess, for instance, the action space at any given board position consists of all legal moves available to the current player. The agent picks exactly one move from this finite list. Algorithms such as [Q-learning](/wiki/q-learning) and [Deep Q-Networks](/wiki/deep_q-network_dqn) (DQN) are well suited to discrete action spaces because they can estimate a value for every possible action in a given state.[4]

In the Gymnasium (formerly OpenAI Gym) toolkit, discrete action spaces are represented by the `Discrete(n)` space, where n is the number of possible actions.[5]

### What is a continuous action space?

A continuous action space allows actions to take any real-valued number within a specified range. Instead of choosing from a list, the agent outputs one or more continuous parameters. Robotic control tasks are the classic example: a robotic arm might need to output joint torques as real-valued numbers, with each joint angle ranging from 0 to 360 degrees and each force value ranging from 0 to some maximum.

Self-driving cars also operate in continuous action spaces, where the steering angle, throttle, and braking force are all continuous values. Because there are infinitely many possible actions, value-based methods like DQN cannot enumerate them. Instead, policy gradient methods and actor-critic algorithms are used. [Deep Deterministic Policy Gradient](/wiki/ddpg) (DDPG), [Twin Delayed DDPG](/wiki/td3) (TD3), [Soft Actor-Critic](/wiki/soft_actor_critic) (SAC), and [Proximal Policy Optimization](/wiki/reinforcement_learning) (PPO) are all designed to handle continuous action outputs.[6]

In Gymnasium, continuous action spaces are represented by the `Box` space, which defines lower and upper bounds for each dimension of the action vector.[5]

### What is a hybrid (parameterized) action space?

Some environments require both discrete choices and continuous parameters. For example, in robot soccer, an agent might choose a discrete action like "kick" and then specify continuous parameters such as kick power and direction. This type of action space is called a hybrid or parameterized action space.

Hybrid action spaces present a challenge because most standard RL algorithms handle either discrete or continuous actions, but not both simultaneously. Approaches to this problem include hierarchical architectures where a higher-level network selects the discrete action and lower-level networks determine the continuous parameters.[7] The HyAR (Hybrid Action Representation) method is one approach that learns a unified latent representation for both discrete and continuous components.[8]

### Multi-discrete and multi-binary action spaces

Some environments feature multiple independent discrete choices that must be made simultaneously. A game controller, for example, requires the agent to decide on several buttons at once. Gymnasium provides `MultiDiscrete` for representing the Cartesian product of multiple discrete spaces and `MultiBinary` for actions represented as binary vectors (such as pressing or not pressing each of several buttons).[5]

### Comparison of action space types

| Action space type | Description | Example | Common algorithms |
|---|---|---|---|
| Discrete | Finite set of distinct actions | Board games, Atari games | [DQN](/wiki/deep_q-network_dqn), [Q-learning](/wiki/q-learning), SARSA |
| Continuous | Real-valued actions within a range | Robotic control, self-driving | [DDPG](/wiki/ddpg), [TD3](/wiki/td3), [SAC](/wiki/soft_actor_critic), [PPO](/wiki/reinforcement_learning) |
| Hybrid (parameterized) | Discrete choice plus continuous parameters | Robot soccer, RTS games | HyAR, P-DQN, hierarchical actor-critic |
| Multi-discrete | Multiple independent discrete choices | Game controllers, multi-joint robots | PPO with multi-head output, A2C |
| Multi-binary | Binary vector of on/off decisions | Button-press combinations | PPO, A2C |

## How does a policy choose an action?

A [policy](/wiki/policy) is the function that determines which action an agent takes in a given state. Sutton and Barto define it as "a mapping from perceived states of the environment to actions to be taken when in those states."[1] Formally, a policy $$\pi$$ maps states to actions (or to probability distributions over actions). The policy is the core of a reinforcement learning agent because, in their words, it "alone is sufficient to determine the agent's behavior."[1]

Policies can be **deterministic**, producing a single action for each state ($$a = \pi(s)$$), or **stochastic**, producing a probability distribution over actions ($$\pi(a \mid s)$$). Stochastic policies are useful because they naturally support exploration, allowing the agent to try different actions rather than always repeating the same one.

### Deterministic vs. stochastic policies

| Property | Deterministic policy | Stochastic policy |
|---|---|---|
| Output | Single action $$a = \pi(s)$$ | Probability distribution $$\pi(a \mid s)$$ |
| Exploration | Requires external noise (e.g., Ornstein-Uhlenbeck) | Built-in through sampling |
| Common algorithms | [DDPG](/wiki/ddpg), [TD3](/wiki/td3) | [PPO](/wiki/reinforcement_learning), [SAC](/wiki/soft_actor_critic), REINFORCE |
| Typical use case | Continuous control with low noise | Environments requiring exploration |

## Action-value function (Q-function)

The action-value function, commonly written as $$Q(s, a)$$, estimates the expected cumulative reward an agent will receive by taking action a in state s and then following a particular policy thereafter. This function is central to many RL algorithms and provides the basis for action selection in value-based methods.[9]

The Q-function satisfies the [Bellman equation](/wiki/bellman_equation):

$$
Q(s, a) = \mathbb{E}[r + \gamma \max_{a'} Q(s', a')]
$$

where r is the immediate reward, $$\gamma$$ is the discount factor, $$s'$$ is the next state, and the max is taken over all possible next actions $$a'$$. This recursive relationship allows algorithms like Q-learning to iteratively update their estimates of Q-values until they converge to the optimal action-value function $$Q^*$$.[9]

The **advantage function** $$A(s, a) = Q(s, a) - V(s)$$ measures how much better a particular action is compared to the average action in that state, where $$V(s)$$ is the state-value function. The advantage function is used in algorithms like [A2C](/wiki/a2c) (Advantage Actor-Critic) and PPO to reduce variance in policy gradient updates.[10]

## How do agents balance exploration and exploitation when choosing actions?

A fundamental challenge in reinforcement learning is the [exploration-exploitation tradeoff](/wiki/exploration_exploitation). The agent must balance exploiting actions it already knows yield high rewards against exploring unfamiliar actions that might yield even higher rewards. Several strategies address this tradeoff.

### Epsilon-greedy

The [epsilon-greedy](/wiki/epsilon_greedy_policy) strategy selects a random action with probability $$\epsilon$$ and the action with the highest estimated value with probability $$1 - \epsilon$$. This is the simplest exploration method and is widely used with DQN and tabular Q-learning.[1] The value of $$\epsilon$$ is typically annealed (gradually reduced) over training so that the agent explores more at the beginning and exploits more as it learns.

### Softmax (Boltzmann) exploration

Softmax exploration assigns a probability to each action proportional to its estimated value, regulated by a temperature parameter $$\tau$$. At high temperatures, all actions are nearly equally likely (more exploration). At low temperatures, the highest-valued action dominates (more exploitation). Unlike epsilon-greedy, softmax exploration accounts for the relative differences in action values rather than treating all non-greedy actions equally.[1]

### Upper confidence bound (UCB)

UCB methods select actions based on an optimistic estimate of their value, adding a bonus term that reflects how uncertain the agent is about each action. Actions that have been tried fewer times receive a larger bonus, encouraging the agent to try them. UCB1, introduced by Auer, Cesa-Bianchi, and Fischer (2002), is the most widely cited variant and provides theoretical regret bounds for the [multi-armed bandit](/wiki/multi_armed_bandit) problem.[11]

### Entropy-based exploration

Algorithms like SAC add an entropy bonus to the reward, encouraging the policy to remain as random as possible while still achieving high returns. This prevents premature convergence to a suboptimal deterministic policy and leads to more robust behavior.[6]

### Comparison of exploration strategies

| Strategy | Mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Epsilon-greedy | Random action with probability $$\epsilon$$ | Simple to implement | Treats all non-greedy actions equally |
| Softmax (Boltzmann) | Action probabilities proportional to Q-values | Accounts for relative action values | Temperature tuning required |
| UCB | Optimistic value estimate with uncertainty bonus | Theoretical guarantees, principled | Harder to apply in deep RL |
| Entropy regularization | Bonus reward for policy randomness | Prevents premature convergence | Adds a hyperparameter (entropy coefficient) |
| Curiosity-driven | Intrinsic reward for novel states | Effective in sparse-reward settings | Can be distracted by noise |

## What is action masking and why does it matter?

In many real-world and game environments, not all actions are valid in every state. A chess agent cannot move a piece to an occupied square of the same color, and a robot cannot move through a wall. Action masking is a technique that prevents the agent from selecting invalid actions by zeroing out or assigning negative infinity to the logits of invalid actions before the policy computes its probability distribution.[12]

Action masking matters most when the action space is very large. DeepMind's [AlphaStar](/wiki/alphastar) faced an average of roughly $$10^{26}$$ legal actions at every time step in StarCraft II, because hundreds of units and buildings must be controlled at once in real time.[16] In spaces that large, randomly sampling an action is far more likely to produce an invalid one than a valid one, so masking out invalid choices is essential for efficient learning. Action masking was used prominently in AlphaStar and in [OpenAI Five](/wiki/openai_five) (OpenAI's Dota 2 agent). Huang and Ontanon (2020) found that invalid action masking "is empirically significant to the performance of policy gradient algorithms," leading to faster training, lower variance, and better final performance than letting the agent learn to avoid invalid actions through negative rewards alone.[12]

## Temporal abstraction and options

Standard RL treats each action as a single-step primitive (e.g., move one cell, apply one torque value). The **options framework**, introduced by Sutton, Precup, and Singh (1999), extends this by defining "options" as temporally extended actions. An option is a sub-policy that, once initiated, runs for multiple time steps until a termination condition is met.[13]

Examples of options include high-level behaviors such as "navigate to the door," "pick up the object," or "turn left at the intersection." Each option encapsulates a sequence of primitive actions. This hierarchy allows agents to plan and learn at multiple time scales, reducing the effective horizon of the decision problem.

The options framework is formalized within [Semi-Markov Decision Processes](/wiki/smdp) (SMDPs) and forms the basis of hierarchical reinforcement learning. The **Option-Critic architecture** (Bacon, Harb, and Precup, 2017) extended this work by allowing options to be learned end-to-end using policy gradient methods rather than being hand-designed.[14]

## Actions in multi-agent systems

In [multi-agent reinforcement learning](/wiki/reinforcement_learning) (MARL), multiple agents act simultaneously in a shared environment. The **joint action space** is the Cartesian product of all individual agents' action spaces. If each of n agents has k possible actions, the joint action space has $$k^n$$ elements, which grows exponentially. This combinatorial explosion is one of the primary challenges in MARL.[15]

Several approaches address this challenge:

- **Independent learners**: Each agent treats other agents as part of the environment and learns its own policy independently (e.g., Independent Q-Learning).
- **Centralized training, decentralized execution (CTDE)**: Agents share information during training but act independently at test time. QMIX and MAPPO are examples of this paradigm.
- **Factorized value functions**: The joint Q-function is decomposed into individual agent contributions, reducing the dimensionality of the problem. Factorized Q-Learning (FQL) scales to hundreds of agents.[15]
- **Communication protocols**: Agents learn to send messages to coordinate their actions without requiring a central controller.

## What kinds of actions appear in real-world applications?

Actions take different forms depending on the application domain.

| Application | Action space type | Example actions |
|---|---|---|
| Board games (chess, Go) | Discrete | Place a stone, move a piece |
| [Atari games](/wiki/atari) | Discrete | Move left, fire, no-op |
| [Robotic manipulation](/wiki/robot_learning) | Continuous | Joint torques, gripper force |
| [Autonomous driving](/wiki/autonomous_driving) | Continuous or hybrid | Steering angle, throttle, braking |
| Portfolio management | Continuous | Asset allocation percentages |
| Dialogue systems | Discrete | Select a response template, ask a clarification |
| Network routing | Discrete | Forward packet to a neighbor node |
| [Recommender systems](/wiki/recommender_system) | Discrete | Select an item to recommend |
| Drug dosing | Continuous | Dosage amount for a treatment |
| Energy grid management | Hybrid | Turn generator on/off (discrete), set output level (continuous) |

### Game playing

[AlphaGo](/wiki/alphazero), developed by [DeepMind](/wiki/deepmind), demonstrated that RL agents could defeat world champions at Go by learning to select moves (actions) through a combination of Monte Carlo tree search and deep [neural network](/wiki/neural_network) evaluation.[17] In Atari game environments, DQN agents learn to map raw pixel observations directly to discrete joystick actions.[4]

### Robotics

Robotic control tasks require continuous actions such as joint torques, velocities, and gripper forces. Simulated environments like MuJoCo and Isaac Gym allow agents to learn these control policies through millions of trial-and-error interactions before transferring the learned policy to a physical robot (sim-to-real transfer).

### Autonomous vehicles

Self-driving cars must continuously select acceleration, braking, and steering actions in dynamic traffic environments. RL-based approaches are used by companies such as Waymo and in academic research to train driving policies that handle complex scenarios like merging, lane changes, and intersection navigation.

## Action space design and shaping

The design of the action space can significantly affect learning performance. **Action space shaping** refers to the practice of modifying the action space to make learning easier without changing the underlying task.[18]

Common techniques include:

- **Action discretization**: Converting a continuous action space into a discrete one by dividing each dimension into bins. This allows the use of discrete algorithms but may sacrifice precision.
- **Action normalization**: Scaling all action dimensions to a common range (e.g., [-1, 1]) to help gradient-based optimization.
- **Action repeat (frame skipping)**: Repeating the same action for multiple consecutive frames, effectively reducing the decision frequency. This technique was used in the original DQN Atari experiments.
- **Curriculum over actions**: Starting with a simplified action space and gradually expanding it as the agent improves.

## How does "action" in RL differ from "action" in agentic AI?

The term "action" also appears in the broader context of [agentic AI](/wiki/ai_agent), where a large-language-model agent takes "actions" such as calling a tool, querying an API, executing code, or browsing the web. These two senses share the same conceptual root, an output that changes the world, but they are not identical. In classical reinforcement learning, an action is a formally defined element of a fixed action space A within an MDP, selected by a learned policy and credited with a scalar reward. In tool-using LLM agents, an "action" is typically a structured tool call produced by the model's reasoning, often without an explicit reward signal or formal action space. The RL notion analyzed throughout this article, with its discrete/continuous action spaces, policies, and value functions, is the precise technical meaning; the agentic sense is a looser, application-level usage. The two converge when agentic systems are trained with reinforcement learning, in which case the available tools effectively become the action space.

## Algorithms by action space support

Different RL algorithms are designed for different action space types. The table below summarizes which algorithms support which action spaces.

| Algorithm | Discrete | Continuous | Hybrid | On/off-policy |
|---|---|---|---|---|
| [Q-learning](/wiki/q-learning) | Yes | No | No | Off-policy |
| [DQN](/wiki/deep_q-network_dqn) | Yes | No | No | Off-policy |
| REINFORCE | Yes | Yes | No | On-policy |
| [A2C](/wiki/a2c) / A3C | Yes | Yes | No | On-policy |
| [PPO](/wiki/reinforcement_learning) | Yes | Yes | No | On-policy |
| [DDPG](/wiki/ddpg) | No | Yes | No | Off-policy |
| [TD3](/wiki/td3) | No | Yes | No | Off-policy |
| [SAC](/wiki/soft_actor_critic) | Yes | Yes | Yes | Off-policy |
| P-DQN | No | No | Yes | Off-policy |
| HyAR | No | No | Yes | Off-policy |

## See also

- [Reinforcement learning](/wiki/reinforcement_learning)
- [Policy](/wiki/policy)
- [Reward](/wiki/reward)
- [Environment](/wiki/environment)
- [Agent](/wiki/agent)
- [Markov decision process](/wiki/markov_decision_process_mdp)
- [Q-learning](/wiki/q-learning)
- [Deep Q-Network](/wiki/deep_q-network_dqn)
- [Bellman equation](/wiki/bellman_equation)
- [Epsilon-greedy policy](/wiki/epsilon_greedy_policy)
- [Experience replay](/wiki/experience_replay)

## References

[1] Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html

[2] Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.

[3] Tang, H., Liu, S., & Chen, Z. (2022). An Overview of the Action Space for Deep Reinforcement Learning. *Proceedings of the 4th International Conference on Computing and Data Science*. https://dl.acm.org/doi/fullHtml/10.1145/3508546.3508598

[4] Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.

[5] Farama Foundation. (2024). Gymnasium Documentation: Spaces. https://gymnasium.farama.org/api/spaces/

[6] Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. *Proceedings of the 35th International Conference on Machine Learning (ICML)*.

[7] Wei, E., & Wicke, L. (2018). Hierarchical Approaches for Reinforcement Learning in Parameterized Action Space. *arXiv preprint arXiv:1810.09656*.

[8] Li, B., Tang, H., Zheng, Y., et al. (2021). HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation. *arXiv preprint arXiv:2109.05490*.

[9] Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. *Machine Learning*, 8(3-4), 279-292.

[10] Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. *arXiv preprint arXiv:1506.02438*.

[11] Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. *Machine Learning*, 47(2-3), 235-256.

[12] Huang, S., & Ontanon, S. (2020). A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. *arXiv preprint arXiv:2006.14171*. https://arxiv.org/abs/2006.14171

[13] Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. *Artificial Intelligence*, 112(1-2), 181-211.

[14] Bacon, P.-L., Harb, J., & Precup, D. (2017). The Option-Critic Architecture. *Proceedings of the AAAI Conference on Artificial Intelligence*.

[15] Busoniu, L., Babuska, R., & De Schutter, B. (2008). A Comprehensive Survey of Multiagent Reinforcement Learning. *IEEE Transactions on Systems, Man, and Cybernetics, Part C*, 38(2), 156-172.

[16] Vinyals, O., Babuschkin, I., Czarnecki, W. M., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. *Nature*, 575(7782), 350-354. https://deepmind.google/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning/

[17] Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. *Nature*, 529(7587), 484-489.

[18] Kanervisto, A., Scheller, C., & Hautamaki, V. (2020). Action Space Shaping in Deep Reinforcement Learning. *arXiv preprint arXiv:2004.00980*.