In reinforcement learning (RL), a state is a complete description of the environment at a particular point in time. It captures all the information an agent needs to make a decision about which action to take next. States are a foundational concept in the Markov decision process (MDP) framework, which provides the mathematical backbone for most RL algorithms.
The concept of state connects several areas of artificial intelligence and control theory. In classical control, states describe the configuration of a dynamical system. In machine learning, states generalize to include any relevant information that a model uses during learning or inference. Within RL specifically, the state determines the agent's situation relative to its environment and directly influences both the reward it receives and the transitions it can make.
Imagine you are playing a board game. The "state" is everything you can see on the board right now: where all the pieces are, whose turn it is, and any cards that have been played. If someone took a photo of the board, that photo would be the state. You look at the photo and decide your next move. After you move, the board changes, and now there is a new state (a new photo). A robot learning to play the game does the same thing: it looks at the current state, picks a move, and then checks what the new state looks like.
In the MDP framework, a state belongs to a set S called the state space. An MDP is defined as a tuple (S, A, P, R, gamma) where:
| Symbol | Name | Description |
|---|---|---|
| S | State space | The set of all possible states the environment can be in |
| A | Action space | The set of all possible actions the agent can take |
| P(s'|s, a) | Transition function | The probability of moving to state s' given current state s and action a |
| R(s, a, s') | Reward function | The immediate reward received after transitioning from s to s' via action a |
| gamma | Discount factor | A value in [0, 1) that controls how much the agent values future rewards relative to immediate ones |
At each discrete time step t, the environment is in some state s_t in S. The agent observes s_t, selects an action a_t in A, and the environment transitions to a new state s_{t+1} according to the transition probability P(s_{t+1}|s_t, a_t). The agent then receives a reward r_{t+1} = R(s_t, a_t, s_{t+1}).
A state signal is said to satisfy the Markov property if the probability of the next state and reward depends only on the current state and action, not on the entire history of prior states and actions. Formally:
P(s_{t+1}, r_{t+1} | s_t, a_t) = P(s_{t+1}, r_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0)
This means the current state contains all the information necessary to predict future states and rewards. As Sutton and Barto note in their textbook Reinforcement Learning: An Introduction, the state signal should ideally summarize everything about the complete history that is relevant to the future. In practice, the Markov property is rarely satisfied exactly, but it is often a reasonable approximation. Algorithms designed for MDPs can still perform well even when the Markov property holds only approximately.
The state space S is the set of all possible states the environment can occupy. State spaces vary widely in their structure and size depending on the problem.
| Property | Discrete state space | Continuous state space |
|---|---|---|
| Definition | A finite or countably infinite set of states | A subset of real-valued space (e.g., R^d) |
| Example | Grid positions in a maze; board positions in chess | Joint angles and velocities of a robotic arm |
| Size | Finite (can be enumerated) | Infinite (uncountably many states) |
| Typical methods | Tabular methods (Q-learning, dynamic programming) | Function approximation (neural networks, tile coding) |
| Storage | Lookup tables | Parameterized models |
Discrete state spaces are common in board games, grid worlds, and combinatorial problems. For example, in tic-tac-toe the state space contains fewer than 6,000 distinct board configurations. Chess, by contrast, has an estimated 10^47 legal positions, making exhaustive tabulation impractical despite the space being technically discrete.
Continuous state spaces arise in robotics, physics simulations, and control tasks. The classic CartPole environment, widely used in RL research, has a four-dimensional continuous state vector consisting of the cart position, cart velocity, pole angle, and pole angular velocity. Each dimension takes values from a continuous range, producing an uncountably infinite state space.
A finite state space has a bounded number of elements. Many textbook RL problems use finite state spaces because they allow exact solutions through tabular methods. Infinite state spaces can be either countably infinite (discrete but unbounded) or uncountably infinite (continuous). Infinite state spaces require approximation techniques for practical computation.
In many real-world problems, the agent does not have direct access to the true underlying state. Instead, it receives an observation, which may be a partial, noisy, or otherwise incomplete representation of the state.
| Concept | State | Observation |
|---|---|---|
| Definition | The complete description of the environment | What the agent actually perceives |
| Information | Contains all relevant information | May contain only partial information |
| Framework | Assumed known in an MDP | Central to a POMDP |
| Markov property | Satisfies the Markov property by definition | May not satisfy the Markov property |
| Example (robotics) | Full joint positions, velocities, and external forces | Camera image from a single viewpoint |
| Example (poker) | All players' cards and the deck order | Only the agent's own cards and community cards |
When the agent can observe the full state, the problem is called fully observable and is modeled as an MDP. When the agent receives only partial information, the problem is a partially observable Markov decision process (POMDP).
A POMDP extends the MDP tuple to include an observation space O and an observation function O(o|s, a) that gives the probability of receiving observation o after taking action a and arriving in state s. Formally, a POMDP is defined as the tuple (S, A, O, T, O, R, gamma).
Because the agent cannot observe the true state directly, it must maintain a belief state, which is a probability distribution over all possible states. The belief state b(s) represents the agent's estimate of how likely it is that the environment is in state s given all past observations and actions. After taking action a and receiving observation o, the agent updates its belief using Bayes' rule:
b'(s') = eta * O(o|s', a) * sum_s P(s'|s, a) * b(s)
where eta is a normalizing constant. The belief state itself satisfies the Markov property, which means a POMDP can be reformulated as a continuous-state MDP over the space of belief states (a "belief MDP"). However, solving belief MDPs exactly is computationally intractable for most problems because the belief space is continuous and high-dimensional even when the original state space is small and discrete.
Common approaches for handling partial observability include recurrent neural networks (such as LSTM and GRU architectures) that maintain an internal memory of past observations, frame stacking (concatenating several recent observations to approximate state), and attention-based methods.
How states are represented has a large effect on learning speed, generalization, and the computational cost of RL algorithms. A good state representation captures the features relevant to decision-making while discarding irrelevant details.
In early RL research, domain experts manually selected and engineered features to represent states. For example, in a robot navigation task, a human might define the state as a vector of distances to nearby obstacles, the robot's heading, and its speed. Hand-crafted features can be highly effective when domain knowledge is available, but they are labor-intensive to design and may fail to capture subtle patterns.
For small, discrete state spaces, each state can be stored as a separate entry in a table. Tabular Q-learning maintains a table of Q-values with one entry per state-action pair. Tabular methods provide convergence guarantees but do not scale to large or continuous state spaces.
Function approximation methods represent the value function (or policy) as a parameterized function of the state, enabling generalization across similar states. Common approaches include:
With the rise of deep learning, neural networks have become the dominant method for learning state representations from high-dimensional raw inputs such as images, audio, or text.
Convolutional neural networks (CNNs) are used to extract spatial features from image-based observations. The Deep Q-Network (DQN) introduced by Mnih et al. (2015) demonstrated that a CNN could learn to play Atari 2600 games at human-level performance directly from raw pixel inputs. The network received 84 x 84 grayscale images (stacked four frames deep to capture motion information) and output Q-values for each possible action. The convolutional layers learned to detect game objects, track motion, and extract other task-relevant features without any hand-crafted state engineering.
Recurrent neural networks (RNNs) are used in partially observable settings where a single observation does not provide enough information to determine the state. LSTM and GRU networks maintain a hidden state that accumulates information over time, effectively learning to construct an approximate state representation from a history of observations. Deep Recurrent Q-Networks (DRQN) replace the fully connected layers in DQN with recurrent layers to handle partial observability.
Autoencoders and variational autoencoders (VAEs) learn compressed latent representations of high-dimensional observations in an unsupervised manner. The encoder maps observations to a low-dimensional latent space, and the decoder reconstructs the original observation from the latent code. The latent representation can then serve as the state input to an RL algorithm.
World models learn a predictive model of the environment's dynamics in a compressed latent state space. The agent can then "imagine" future trajectories by rolling out the learned model in latent space, enabling planning and more sample-efficient learning.
Ha and Schmidhuber (2018) proposed a world model architecture that combines a variational autoencoder (for spatial compression) with a recurrent network (for temporal prediction). The controller operates on the learned latent state rather than raw observations, allowing it to be very compact. Remarkably, agents could even be trained entirely inside their own "dreams" generated by the world model, and the learned policies transferred to the real environment.
Hafner et al. (2020) introduced Dreamer, which learns behaviors by "latent imagination." Dreamer uses a Recurrent State-Space Model (RSSM) that divides the latent state into deterministic and stochastic components. The deterministic part is propagated forward using a GRU, while the stochastic part captures uncertainty. Dreamer propagates analytic gradients of learned state values back through imagined trajectories, enabling efficient long-horizon planning. Subsequent versions of Dreamer (DreamerV2, DreamerV3) extended this approach across hundreds of diverse tasks.
States are the foundation of value functions, which estimate how good it is for an agent to be in a given state (or to take a given action in a given state).
The state value function V^pi(s) gives the expected cumulative discounted reward when starting in state s and following policy pi thereafter:
V^pi(s) = E_pi [ sum_{k=0}^{infinity} gamma^k * r_{t+k+1} | s_t = s ]
The Bellman equation for V^pi expresses the value of a state recursively in terms of the values of successor states:
V^pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s, a) * [ R(s, a, s') + gamma * V^pi(s') ]
The action value function Q^pi(s, a) gives the expected cumulative discounted reward when starting in state s, taking action a, and then following policy pi:
Q^pi(s, a) = E_pi [ sum_{k=0}^{infinity} gamma^k * r_{t+k+1} | s_t = s, a_t = a ]
The relationship between V and Q is:
V^pi(s) = sum_a pi(a|s) * Q^pi(s, a)
Many RL algorithms (such as Q-learning, SARSA, and actor-critic methods) work by estimating and improving these value functions. The optimal value functions V* and Q* correspond to the policy that maximizes expected return from every state.
While the term "state" is most precisely defined in the RL context, it appears in other areas of machine learning as well.
| ML paradigm | What "state" refers to | How it changes |
|---|---|---|
| Supervised learning | Model parameters (weights, biases) and training data | Updated through gradient-based optimization during training |
| Unsupervised learning | Latent variables, cluster assignments, learned representations | Inferred during training without labeled guidance |
| Reinforcement learning | The environment's configuration at a point in time | Changes dynamically as the agent interacts with the environment |
| Recurrent models (LSTM, GRU) | The hidden state vector carried across time steps | Updated at each time step based on the current input and previous hidden state |
In supervised learning, the notion of "state" typically refers to the snapshot of model parameters during training. The state evolves as the optimizer updates weights to minimize the loss function. In unsupervised learning, state can refer to the current cluster assignments or latent variable estimates, which are updated iteratively as the model discovers structure in the data.
Not all state representations are equally useful for learning. Research has identified several properties that contribute to effective state representations.
| Property | Description | Why it matters |
|---|---|---|
| Markov | The state captures enough information so that the future is conditionally independent of the past given the present | Enables the use of standard MDP algorithms and simplifies the learning problem |
| Compact | The state uses a low-dimensional representation | Reduces computational cost and speeds up learning |
| Sufficient | The state retains all task-relevant information | Ensures the agent can learn an optimal policy |
| Smooth | Similar states map to nearby points in the representation space | Supports generalization across similar situations |
| Disentangled | Different factors of variation are captured by separate components of the state vector | Makes the representation easier to interpret and can improve transfer learning |
| Stationary | The distribution of states does not change over time | Simplifies learning; non-stationary distributions require adaptive methods |
Sparse state representations use mostly zero-valued features, with only a small number of active features for any given state. Tile coding, for example, naturally produces sparse representations because only a few tiles are active at any time. Sparse representations can improve computational efficiency because operations on sparse vectors skip zero entries. However, overly sparse representations may lose information, leading to aliasing where distinct states are mapped to the same representation.
State abstraction refers to techniques that simplify the state space by grouping or mapping states into a smaller set of abstract states. Abstraction reduces the effective size of the problem while (ideally) preserving enough information for near-optimal decision-making.
As the number of state dimensions grows, the volume of the state space increases exponentially. This phenomenon, known as the curse of dimensionality, means that the number of samples required to adequately cover the state space grows exponentially with its dimensionality. For model-based RL, the sample complexity is often proportional to the size of the state-action space.
Factored MDPs address this challenge by decomposing the state into independent or weakly interacting components. Instead of treating the state as a single monolithic vector, a factored MDP represents it as a collection of state variables, each with its own (smaller) domain. The transition dynamics are specified as a factored function, where each state variable depends on only a few other variables. This decomposition can reduce the sample complexity from exponential in the full state dimension to polynomial in the dimension of the largest factor.
The following table illustrates what constitutes a state in various RL domains.
| Domain | State components | State space type | Observability |
|---|---|---|---|
| Chess | Positions of all pieces on the board, whose turn it is, castling rights, en passant availability | Discrete, finite (approximately 10^47 positions) | Fully observable |
| CartPole | Cart position, cart velocity, pole angle, pole angular velocity | Continuous, 4-dimensional | Fully observable |
| Atari games (DQN) | Raw pixel values of the game screen (stacked frames) | High-dimensional continuous (84 x 84 x 4) | Partially observable (single frame lacks velocity info) |
| Poker (Texas Hold'em) | Player's cards, community cards, pot size, betting history, opponent behavior | Mixed discrete and continuous | Partially observable (opponent cards hidden) |
| Robotic manipulation | Joint angles, joint velocities, gripper state, object positions and orientations | Continuous, high-dimensional | Often partially observable (occluded objects) |
| Autonomous driving | Vehicle position, velocity, heading; positions and velocities of other vehicles; lane markings; traffic signals | Continuous, very high-dimensional | Partially observable (sensor limitations, occlusions) |
| Grid world | Agent's (x, y) position on a grid | Discrete, finite | Fully observable |
Introduced by Mnih et al. (2015) in the DQN paper, frame stacking concatenates the last k observations (typically k = 4) as a single input. This technique converts a partially observable problem (where a single frame does not reveal velocity or direction of motion) into an approximately fully observable one. Frame stacking is computationally simple and has become standard practice in vision-based RL. However, fixed-size frame stacks may either include too much irrelevant information or miss events that occurred further in the past.
Data augmentation techniques, such as Reinforcement Learning with Augmented Data (RAD), apply random transformations to state observations (e.g., crops, color jitter, rotations) to improve generalization and data efficiency. These augmentations are applied consistently across the frame stack to preserve temporal information.
Normalizing state features to have zero mean and unit variance (or to lie within a fixed range) can significantly improve learning stability and speed, particularly when state dimensions have different scales. For example, in a robotics task, joint angles might range from -pi to pi while joint velocities might range from -10 to 10. Without normalization, the learning algorithm may be dominated by the dimensions with larger magnitudes.
The concept of state in sequential decision-making traces back to the work of Richard Bellman in the 1950s, who developed dynamic programming and the principle of optimality. Bellman's formulation explicitly uses states as the basis for recursive value computation. The MDP framework was formalized by Bellman (1957) and later refined by Howard (1960) and Puterman (1994).
The application of states to RL was shaped by several milestones: