State (Reinforcement Learning)

In reinforcement learning (RL), a state is a complete description of the environment at a particular point in time. It captures all the information an agent needs to make a decision about which action to take next. States are a foundational concept in the Markov decision process (MDP) framework, which provides the mathematical backbone for most RL algorithms.

The concept of state connects several areas of artificial intelligence and control theory. In classical control, states describe the configuration of a dynamical system. In machine learning, states generalize to include any relevant information that a model uses during learning or inference. Within RL specifically, the state determines the agent's situation relative to its environment and directly influences both the reward it receives and the transitions it can make.

Explain like I'm 5 (ELI5)

Imagine you are playing a board game. The "state" is everything you can see on the board right now: where all the pieces are, whose turn it is, and any cards that have been played. If someone took a photo of the board, that photo would be the state. You look at the photo and decide your next move. After you move, the board changes, and now there is a new state (a new photo). A robot learning to play the game does the same thing: it looks at the current state, picks a move, and then checks what the new state looks like.

Formal definition

In the MDP framework, a state belongs to a set S called the state space. An MDP is defined as a tuple (S, A, P, R, gamma) where:

Symbol	Name	Description
S	State space	The set of all possible states the environment can be in
A	Action space	The set of all possible actions the agent can take
P(s'\|s, a)	Transition function	The probability of moving to state s' given current state s and action a
R(s, a, s')	Reward function	The immediate reward received after transitioning from s to s' via action a
gamma	Discount factor	A value in [0, 1) that controls how much the agent values future rewards relative to immediate ones

At each discrete time step t, the environment is in some state s_t in S. The agent observes s_t, selects an action a_t in A, and the environment transitions to a new state s_{t+1} according to the transition probability P(s_{t+1}|s_t, a_t). The agent then receives a reward r_{t+1} = R(s_t, a_t, s_{t+1}).

The Markov property

A state signal is said to satisfy the Markov property if the probability of the next state and reward depends only on the current state and action, not on the entire history of prior states and actions. Formally:

P(s_{t+1}, r_{t+1} | s_t, a_t) = P(s_{t+1}, r_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0)

This means the current state contains all the information necessary to predict future states and rewards. As Sutton and Barto note in their textbook Reinforcement Learning: An Introduction, the state signal should ideally summarize everything about the complete history that is relevant to the future. In practice, the Markov property is rarely satisfied exactly, but it is often a reasonable approximation. Algorithms designed for MDPs can still perform well even when the Markov property holds only approximately.

State space

The state space S is the set of all possible states the environment can occupy. State spaces vary widely in their structure and size depending on the problem.

Discrete vs. continuous state spaces

Property	Discrete state space	Continuous state space
Definition	A finite or countably infinite set of states	A subset of real-valued space (e.g., R^d)
Example	Grid positions in a maze; board positions in chess	Joint angles and velocities of a robotic arm
Size	Finite (can be enumerated)	Infinite (uncountably many states)
Typical methods	Tabular methods (Q-learning, dynamic programming)	Function approximation (neural networks, tile coding)
Storage	Lookup tables	Parameterized models

Discrete state spaces are common in board games, grid worlds, and combinatorial problems. For example, in tic-tac-toe the state space contains fewer than 6,000 distinct board configurations. Chess, by contrast, has an estimated 10^47 legal positions, making exhaustive tabulation impractical despite the space being technically discrete.

Continuous state spaces arise in robotics, physics simulations, and control tasks. The classic CartPole environment, widely used in RL research, has a four-dimensional continuous state vector consisting of the cart position, cart velocity, pole angle, and pole angular velocity. Each dimension takes values from a continuous range, producing an uncountably infinite state space.

Finite vs. infinite state spaces

A finite state space has a bounded number of elements. Many textbook RL problems use finite state spaces because they allow exact solutions through tabular methods. Infinite state spaces can be either countably infinite (discrete but unbounded) or uncountably infinite (continuous). Infinite state spaces require approximation techniques for practical computation.

State vs. observation

In many real-world problems, the agent does not have direct access to the true underlying state. Instead, it receives an observation, which may be a partial, noisy, or otherwise incomplete representation of the state.

Concept	State	Observation
Definition	The complete description of the environment	What the agent actually perceives
Information	Contains all relevant information	May contain only partial information
Framework	Assumed known in an MDP	Central to a POMDP
Markov property	Satisfies the Markov property by definition	May not satisfy the Markov property
Example (robotics)	Full joint positions, velocities, and external forces	Camera image from a single viewpoint
Example (poker)	All players' cards and the deck order	Only the agent's own cards and community cards

When the agent can observe the full state, the problem is called fully observable and is modeled as an MDP. When the agent receives only partial information, the problem is a partially observable Markov decision process (POMDP).

Partially observable environments and belief states

A POMDP extends the MDP tuple to include an observation space O and an observation function O(o|s, a) that gives the probability of receiving observation o after taking action a and arriving in state s. Formally, a POMDP is defined as the tuple (S, A, O, T, O, R, gamma).

Because the agent cannot observe the true state directly, it must maintain a belief state, which is a probability distribution over all possible states. The belief state b(s) represents the agent's estimate of how likely it is that the environment is in state s given all past observations and actions. After taking action a and receiving observation o, the agent updates its belief using Bayes' rule:

b'(s') = eta * O(o|s', a) * sum_s P(s'|s, a) * b(s)

where eta is a normalizing constant. The belief state itself satisfies the Markov property, which means a POMDP can be reformulated as a continuous-state MDP over the space of belief states (a "belief MDP"). However, solving belief MDPs exactly is computationally intractable for most problems because the belief space is continuous and high-dimensional even when the original state space is small and discrete.

Common approaches for handling partial observability include recurrent neural networks (such as LSTM and GRU architectures) that maintain an internal memory of past observations, frame stacking (concatenating several recent observations to approximate state), and attention-based methods.

State representations

How states are represented has a large effect on learning speed, generalization, and the computational cost of RL algorithms. A good state representation captures the features relevant to decision-making while discarding irrelevant details.

Hand-crafted features

In early RL research, domain experts manually selected and engineered features to represent states. For example, in a robot navigation task, a human might define the state as a vector of distances to nearby obstacles, the robot's heading, and its speed. Hand-crafted features can be highly effective when domain knowledge is available, but they are labor-intensive to design and may fail to capture subtle patterns.

Tabular representations

For small, discrete state spaces, each state can be stored as a separate entry in a table. Tabular Q-learning maintains a table of Q-values with one entry per state-action pair. Tabular methods provide convergence guarantees but do not scale to large or continuous state spaces.

Function approximation

Function approximation methods represent the value function (or policy) as a parameterized function of the state, enabling generalization across similar states. Common approaches include:

Linear function approximation: The state is represented as a feature vector, and the value function is a linear combination of those features. This is computationally efficient but limited in expressiveness.
Tile coding: The continuous state space is partitioned into overlapping sets of tiles. Each tile is a binary feature indicating whether the state falls within it. Multiple overlapping tilings provide finer resolution and better generalization. Tile coding has been widely used in classic RL applications due to its balance of representational power and computational cost.
Coarse coding: Similar to tile coding, but uses overlapping receptive fields (often circular or Gaussian-shaped) rather than axis-aligned tiles. Multiple features can be active simultaneously, which enables generalization across neighboring states.
Radial basis functions (RBFs): Continuous-valued features based on the distance between the current state and a set of prototype states. Unlike tile coding, RBFs produce graded activations rather than binary ones.

Deep state representations

With the rise of deep learning, neural networks have become the dominant method for learning state representations from high-dimensional raw inputs such as images, audio, or text.

Convolutional neural networks (CNNs) are used to extract spatial features from image-based observations. The Deep Q-Network (DQN) introduced by Mnih et al. (2015) demonstrated that a CNN could learn to play Atari 2600 games at human-level performance directly from raw pixel inputs. The network received 84 x 84 grayscale images (stacked four frames deep to capture motion information) and output Q-values for each possible action. The convolutional layers learned to detect game objects, track motion, and extract other task-relevant features without any hand-crafted state engineering.

Recurrent neural networks (RNNs) are used in partially observable settings where a single observation does not provide enough information to determine the state. LSTM and GRU networks maintain a hidden state that accumulates information over time, effectively learning to construct an approximate state representation from a history of observations. Deep Recurrent Q-Networks (DRQN) replace the fully connected layers in DQN with recurrent layers to handle partial observability.

Autoencoders and variational autoencoders (VAEs) learn compressed latent representations of high-dimensional observations in an unsupervised manner. The encoder maps observations to a low-dimensional latent space, and the decoder reconstructs the original observation from the latent code. The latent representation can then serve as the state input to an RL algorithm.

World models and latent states

World models learn a predictive model of the environment's dynamics in a compressed latent state space. The agent can then "imagine" future trajectories by rolling out the learned model in latent space, enabling planning and more sample-efficient learning.

Ha and Schmidhuber (2018) proposed a world model architecture that combines a variational autoencoder (for spatial compression) with a recurrent network (for temporal prediction). The controller operates on the learned latent state rather than raw observations, allowing it to be very compact. Remarkably, agents could even be trained entirely inside their own "dreams" generated by the world model, and the learned policies transferred to the real environment.

Hafner et al. (2020) introduced Dreamer, which learns behaviors by "latent imagination." Dreamer uses a Recurrent State-Space Model (RSSM) that divides the latent state into deterministic and stochastic components. The deterministic part is propagated forward using a GRU, while the stochastic part captures uncertainty. Dreamer propagates analytic gradients of learned state values back through imagined trajectories, enabling efficient long-horizon planning. Subsequent versions of Dreamer (DreamerV2, DreamerV3) extended this approach across hundreds of diverse tasks.

State value functions

States are the foundation of value functions, which estimate how good it is for an agent to be in a given state (or to take a given action in a given state).

State value function V(s)

The state value function V^pi(s) gives the expected cumulative discounted reward when starting in state s and following policy pi thereafter:

V^pi(s) = E_pi [ sum_{k=0}^{infinity} gamma^k * r_{t+k+1} | s_t = s ]

The Bellman equation for V^pi expresses the value of a state recursively in terms of the values of successor states:

V^pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s, a) * [ R(s, a, s') + gamma * V^pi(s') ]

Action value function Q(s, a)

The action value function Q^pi(s, a) gives the expected cumulative discounted reward when starting in state s, taking action a, and then following policy pi:

Q^pi(s, a) = E_pi [ sum_{k=0}^{infinity} gamma^k * r_{t+k+1} | s_t = s, a_t = a ]

The relationship between V and Q is:

V^pi(s) = sum_a pi(a|s) * Q^pi(s, a)

Many RL algorithms (such as Q-learning, SARSA, and actor-critic methods) work by estimating and improving these value functions. The optimal value functions V* and Q* correspond to the policy that maximizes expected return from every state.

State in different types of machine learning

While the term "state" is most precisely defined in the RL context, it appears in other areas of machine learning as well.

ML paradigm	What "state" refers to	How it changes
Supervised learning	Model parameters (weights, biases) and training data	Updated through gradient-based optimization during training
Unsupervised learning	Latent variables, cluster assignments, learned representations	Inferred during training without labeled guidance
Reinforcement learning	The environment's configuration at a point in time	Changes dynamically as the agent interacts with the environment
Recurrent models (LSTM, GRU)	The hidden state vector carried across time steps	Updated at each time step based on the current input and previous hidden state

In supervised learning, the notion of "state" typically refers to the snapshot of model parameters during training. The state evolves as the optimizer updates weights to minimize the loss function. In unsupervised learning, state can refer to the current cluster assignments or latent variable estimates, which are updated iteratively as the model discovers structure in the data.

Properties of good state representations

Not all state representations are equally useful for learning. Research has identified several properties that contribute to effective state representations.

Property	Description	Why it matters
Markov	The state captures enough information so that the future is conditionally independent of the past given the present	Enables the use of standard MDP algorithms and simplifies the learning problem
Compact	The state uses a low-dimensional representation	Reduces computational cost and speeds up learning
Sufficient	The state retains all task-relevant information	Ensures the agent can learn an optimal policy
Smooth	Similar states map to nearby points in the representation space	Supports generalization across similar situations
Disentangled	Different factors of variation are captured by separate components of the state vector	Makes the representation easier to interpret and can improve transfer learning
Stationary	The distribution of states does not change over time	Simplifies learning; non-stationary distributions require adaptive methods

Sparsity in state representations

Sparse state representations use mostly zero-valued features, with only a small number of active features for any given state. Tile coding, for example, naturally produces sparse representations because only a few tiles are active at any time. Sparse representations can improve computational efficiency because operations on sparse vectors skip zero entries. However, overly sparse representations may lose information, leading to aliasing where distinct states are mapped to the same representation.

State abstraction

State abstraction refers to techniques that simplify the state space by grouping or mapping states into a smaller set of abstract states. Abstraction reduces the effective size of the problem while (ideally) preserving enough information for near-optimal decision-making.

Types of state abstraction

State aggregation: States that are considered equivalent under some criterion are merged into a single abstract state. For example, states with the same optimal action and similar transition dynamics might be grouped together. Bisimulation metrics formalize this idea by defining a distance between states based on their behavioral similarity.
Feature selection: Irrelevant state features are removed, reducing the dimensionality of the state space. Feature selection can be done manually (using domain knowledge) or automatically (using techniques like mutual information or learned attention masks).
Hierarchical abstraction: The state space is organized into multiple levels of abstraction, where higher levels represent coarser, more abstract descriptions. Options and semi-MDPs provide a formal framework for hierarchical RL, where temporally extended actions operate over abstract state spaces.

The curse of dimensionality

As the number of state dimensions grows, the volume of the state space increases exponentially. This phenomenon, known as the curse of dimensionality, means that the number of samples required to adequately cover the state space grows exponentially with its dimensionality. For model-based RL, the sample complexity is often proportional to the size of the state-action space.

Factored MDPs address this challenge by decomposing the state into independent or weakly interacting components. Instead of treating the state as a single monolithic vector, a factored MDP represents it as a collection of state variables, each with its own (smaller) domain. The transition dynamics are specified as a factored function, where each state variable depends on only a few other variables. This decomposition can reduce the sample complexity from exponential in the full state dimension to polynomial in the dimension of the largest factor.

Practical examples of states

The following table illustrates what constitutes a state in various RL domains.

Domain	State components	State space type	Observability
Chess	Positions of all pieces on the board, whose turn it is, castling rights, en passant availability	Discrete, finite (approximately 10^47 positions)	Fully observable
CartPole	Cart position, cart velocity, pole angle, pole angular velocity	Continuous, 4-dimensional	Fully observable
Atari games (DQN)	Raw pixel values of the game screen (stacked frames)	High-dimensional continuous (84 x 84 x 4)	Partially observable (single frame lacks velocity info)
Poker (Texas Hold'em)	Player's cards, community cards, pot size, betting history, opponent behavior	Mixed discrete and continuous	Partially observable (opponent cards hidden)
Robotic manipulation	Joint angles, joint velocities, gripper state, object positions and orientations	Continuous, high-dimensional	Often partially observable (occluded objects)
Autonomous driving	Vehicle position, velocity, heading; positions and velocities of other vehicles; lane markings; traffic signals	Continuous, very high-dimensional	Partially observable (sensor limitations, occlusions)
Grid world	Agent's (x, y) position on a grid	Discrete, finite	Fully observable

Techniques for handling complex state spaces

Frame stacking

Introduced by Mnih et al. (2015) in the DQN paper, frame stacking concatenates the last k observations (typically k = 4) as a single input. This technique converts a partially observable problem (where a single frame does not reveal velocity or direction of motion) into an approximately fully observable one. Frame stacking is computationally simple and has become standard practice in vision-based RL. However, fixed-size frame stacks may either include too much irrelevant information or miss events that occurred further in the past.

Data augmentation

Data augmentation techniques, such as Reinforcement Learning with Augmented Data (RAD), apply random transformations to state observations (e.g., crops, color jitter, rotations) to improve generalization and data efficiency. These augmentations are applied consistently across the frame stack to preserve temporal information.

State normalization

Normalizing state features to have zero mean and unit variance (or to lie within a fixed range) can significantly improve learning stability and speed, particularly when state dimensions have different scales. For example, in a robotics task, joint angles might range from -pi to pi while joint velocities might range from -10 to 10. Without normalization, the learning algorithm may be dominated by the dimensions with larger magnitudes.

Historical context

The concept of state in sequential decision-making traces back to the work of Richard Bellman in the 1950s, who developed dynamic programming and the principle of optimality. Bellman's formulation explicitly uses states as the basis for recursive value computation. The MDP framework was formalized by Bellman (1957) and later refined by Howard (1960) and Puterman (1994).

The application of states to RL was shaped by several milestones:

Temporal difference learning (Sutton, 1988): Introduced methods for learning state value functions from experience without requiring a model of the environment.
Q-learning (Watkins, 1989): Provided an off-policy method for learning action-value functions, enabling agents to learn optimal behavior from exploratory data.
DQN (Mnih et al., 2015): Demonstrated that deep neural networks could learn effective state representations directly from high-dimensional pixel inputs, bridging the gap between RL and deep learning.
World Models (Ha and Schmidhuber, 2018): Showed that agents could learn compact latent state representations and even train entirely in imagined environments.
Dreamer (Hafner et al., 2020): Advanced latent state learning with gradient-based planning through imagined trajectories.

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. *arXiv preprint arXiv:1312.5602*.
Ha, D., & Schmidhuber, J. (2018). World models. *arXiv preprint arXiv:1803.10122*.
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to control: Learning behaviors by latent imagination. *Proceedings of the International Conference on Learning Representations (ICLR)*.
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. *Artificial Intelligence*, 101(1-2), 99-134.
Watkins, C. J. C. H. (1989). *Learning from Delayed Rewards*. PhD thesis, King's College, Cambridge.
Howard, R. A. (1960). *Dynamic Programming and Markov Processes*. MIT Press.
Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. *Proceedings of the International Symposium on Artificial Intelligence and Mathematics (ISAIM)*.
Lesort, T., Diaz-Rodriguez, N., Goudou, J. F., & Filliat, D. (2018). State representation learning: A review. *Journal of Artificial Intelligence Research*, 63, 541-612.
Echchahed, B., & Castro, P. S. (2025). A survey of state representation learning for deep reinforcement learning. *arXiv preprint arXiv:2506.17518*.

Explain like I'm 5 (ELI5)

Formal definition

The Markov property

State space

Discrete vs. continuous state spaces

Finite vs. infinite state spaces

State vs. observation

Partially observable environments and belief states

State representations

Hand-crafted features

Tabular representations

Function approximation

Deep state representations

World models and latent states

State value functions

State value function V(s)

Action value function Q(s, a)

State in different types of machine learning

Properties of good state representations

Sparsity in state representations

State abstraction

Types of state abstraction

The curse of dimensionality

Practical examples of states

Techniques for handling complex state spaces

Frame stacking

Data augmentation

State normalization

Historical context

Related concepts

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Explain like I'm 5 (ELI5)

Formal definition

The Markov property

State space

Discrete vs. continuous state spaces

Finite vs. infinite state spaces

State vs. observation

Partially observable environments and belief states

State representations

Hand-crafted features

Tabular representations

Function approximation

Deep state representations

World models and latent states

State value functions

State value function V(s)

Action value function Q(s, a)

State in different types of machine learning

Properties of good state representations

Sparsity in state representations

State abstraction

Types of state abstraction

The curse of dimensionality

Practical examples of states

Techniques for handling complex state spaces

Frame stacking

Data augmentation

State normalization

Historical context

Related concepts

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation