# State (Reinforcement Learning)

> Source: https://aiwiki.ai/wiki/state
> Updated: 2026-07-11
> Categories: Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

In [reinforcement learning](/wiki/reinforcement_learning) (RL), a **state** is a complete description of the environment at a particular point in time, containing all the information an [agent](/wiki/agent) needs to choose its next [action](/wiki/action). A state that retains all relevant information is said to satisfy the **Markov property**, meaning the future is independent of the past given the present: once the current state is known, the full history of how the agent got there is unnecessary for predicting what happens next. States are the foundational object of the [Markov decision process](/wiki/markov_decision_process_mdp) (MDP) framework, which provides the mathematical backbone for most RL algorithms.[1]

The ideal state signal, as Richard Sutton and Andrew Barto put it in *Reinforcement Learning: An Introduction*, is one that "summarizes past sensations compactly, yet in such a way that all relevant information is retained."[1] In classical control, states describe the configuration of a dynamical system. In [machine learning](/wiki/machine_learning), states generalize to include any relevant information a model uses during learning or inference. Within RL specifically, the state determines the agent's situation relative to its environment and directly influences both the [reward](/wiki/reward) it receives and the transitions it can make.

## Explain like I'm 5 (ELI5)

Imagine you are playing a board game. The "state" is everything you can see on the board right now: where all the pieces are, whose turn it is, and any cards that have been played. If someone took a photo of the board, that photo would be the state. You look at the photo and decide your next move. After you move, the board changes, and now there is a new state (a new photo). A robot learning to play the game does the same thing: it looks at the current state, picks a move, and then checks what the new state looks like.

## Formal definition

In the MDP framework, a state belongs to a set **S** called the **state space**. An MDP is defined as a tuple $$(S, A, P, R, \gamma)$$ where:[3]

| Symbol | Name | Description |
|--------|------|-------------|
| $$S$$ | State space | The set of all possible states the environment can be in |
| $$A$$ | [Action](/wiki/action) space | The set of all possible actions the agent can take |
| $$P(s' \mid s, a)$$ | Transition function | The probability of moving to state s' given current state s and action a |
| $$R(s, a, s')$$ | [Reward](/wiki/reward) function | The immediate reward received after transitioning from s to s' via action a |
| $$\gamma$$ | Discount factor | A value in $$[0, 1)$$ that controls how much the agent values future rewards relative to immediate ones |

At each discrete time step t, the environment is in some state $$s_t \in S$$. The agent observes $$s_t$$, selects an action $$a_t \in A$$, and the environment transitions to a new state $$s_{t+1}$$ according to the transition probability $$P(s_{t+1} \mid s_t, a_t)$$. The agent then receives a reward $$r_{t+1} = R(s_t, a_t, s_{t+1})$$.

## What is the Markov property?

A state signal is said to satisfy the **Markov property** if the probability of the next state and reward depends only on the current state and action, not on the entire history of prior states and actions. Formally:

$$
P(s_{t+1}, r_{t+1} \mid s_t, a_t) = P(s_{t+1}, r_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0)
$$

Intuitively, the Markov property means the future is conditionally independent of the past given the present: the current state is a sufficient statistic for predicting future states and rewards.[14] In the words of Sutton and Barto, "a state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property."[14] They illustrate this with a checkers position: the current configuration of all the pieces "summarizes everything important about the complete sequence of positions that led to it," so much of the path information is discarded yet all that matters for the future of the game is retained.[14]

In practice, the Markov property is rarely satisfied exactly, but it is often a reasonable approximation. Algorithms designed for MDPs can still perform well even when the Markov property holds only approximately. A practical benefit of the assumption is theoretical: under the Markov assumption there exists an optimal stationary policy that is no worse than any history-dependent or non-stationary policy, which is why RL methods can search over policies that depend only on the current state.[1]

## State space

The state space S is the set of all possible states the environment can occupy. State spaces vary widely in their structure and size depending on the problem.

### What is the difference between discrete and continuous state spaces?

| Property | Discrete state space | Continuous state space |
|----------|---------------------|------------------------|
| Definition | A finite or countably infinite set of states | A subset of real-valued space (e.g., $$\mathbb{R}^d$$) |
| Example | Grid positions in a maze; board positions in chess | Joint angles and velocities of a robotic arm |
| Size | Finite (can be enumerated) | Infinite (uncountably many states) |
| Typical methods | Tabular methods ([Q-learning](/wiki/q-learning), dynamic programming) | [Function approximation](/wiki/function_approximation) ([neural networks](/wiki/neural_network), tile coding) |
| Storage | Lookup tables | Parameterized models |

**Discrete state spaces** are common in board games, grid worlds, and combinatorial problems. For example, in tic-tac-toe the state space contains fewer than 6,000 distinct board configurations. Chess, by contrast, has an estimated $$10^{47}$$ legal positions, making exhaustive tabulation impractical despite the space being technically discrete.

**Continuous state spaces** arise in robotics, physics simulations, and control tasks. The classic CartPole environment, widely used in RL research, has a four-dimensional continuous state vector consisting of the cart position, cart velocity, pole angle, and pole angular velocity. In the standard Gymnasium implementation the cart position is bounded to [-4.8, 4.8] and the pole angle to about [-0.418, 0.418] radians (roughly plus or minus 24 degrees), while the two velocities are unbounded; an episode terminates if the pole angle leaves the tighter range of plus or minus 12 degrees.[15] Because each dimension takes values from a continuous range, the state space is uncountably infinite.

### Finite vs. infinite state spaces

A finite state space has a bounded number of elements. Many textbook RL problems use finite state spaces because they allow exact solutions through tabular methods. Infinite state spaces can be either countably infinite (discrete but unbounded) or uncountably infinite (continuous). Infinite state spaces require approximation techniques for practical computation.

## What is the difference between a state and an observation?

In many real-world problems, the agent does not have direct access to the true underlying state. Instead, it receives an **observation**, which may be a partial, noisy, or otherwise incomplete representation of the state.

| Concept | State | Observation |
|---------|-------|-------------|
| Definition | The complete description of the environment | What the agent actually perceives |
| Information | Contains all relevant information | May contain only partial information |
| Framework | Assumed known in an MDP | Central to a [POMDP](/wiki/partially_observable_markov_decision_process) |
| Markov property | Satisfies the Markov property by definition | May not satisfy the Markov property |
| Example (robotics) | Full joint positions, velocities, and external forces | Camera image from a single viewpoint |
| Example (poker) | All players' cards and the deck order | Only the agent's own cards and community cards |

When the agent can observe the full state, the problem is called **fully observable** and is modeled as an MDP. When the agent receives only partial information, the problem is a **partially observable Markov decision process** (POMDP).[8]

### Partially observable environments and belief states

A POMDP extends the MDP tuple to include an observation space O and an observation function $$O(o \mid s, a)$$ that gives the probability of receiving observation o after taking action a and arriving in state s. Formally, a POMDP is defined as the tuple $$(S, A, O, T, O, R, \gamma)$$.

Because the agent cannot observe the true state directly, it must maintain a **belief state**, which is a probability distribution over all possible states. The belief state $$b(s)$$ represents the agent's estimate of how likely it is that the environment is in state s given all past observations and actions. After taking action a and receiving observation o, the agent updates its belief using Bayes' rule:

$$
b'(s') = \eta\, O(o \mid s', a) \sum_s P(s' \mid s, a)\, b(s)
$$

where $$\eta$$ is a normalizing constant. The belief state itself satisfies the Markov property, which means a POMDP can be reformulated as a continuous-state MDP over the space of belief states (a "belief MDP"). However, solving belief MDPs exactly is computationally intractable for most problems because the belief space is continuous and high-dimensional even when the original state space is small and discrete.[8]

Common approaches for handling partial observability include recurrent neural networks (such as [LSTM](/wiki/long_short-term_memory_lstm) and GRU architectures) that maintain an internal memory of past observations, frame stacking (concatenating several recent observations to approximate state), and attention-based methods.

## How are states represented?

How states are represented has a large effect on learning speed, generalization, and the computational cost of RL algorithms. A good state representation captures the features relevant to decision-making while discarding irrelevant details.[12]

### Hand-crafted features

In early RL research, domain experts manually selected and engineered features to represent states. For example, in a robot navigation task, a human might define the state as a vector of distances to nearby obstacles, the robot's heading, and its speed. Hand-crafted features can be highly effective when domain knowledge is available, but they are labor-intensive to design and may fail to capture subtle patterns.

### Tabular representations

For small, discrete state spaces, each state can be stored as a separate entry in a table. [Tabular Q-learning](/wiki/tabular_q-learning) maintains a table of Q-values with one entry per state-action pair. Tabular methods provide convergence guarantees but do not scale to large or continuous state spaces.[1]

### Function approximation

Function approximation methods represent the value function (or policy) as a parameterized function of the state, enabling generalization across similar states. Common approaches include:

- **Linear function approximation:** The state is represented as a feature vector, and the value function is a linear combination of those features. This is computationally efficient but limited in expressiveness.
- **Tile coding:** The continuous state space is partitioned into overlapping sets of tiles. Each tile is a binary feature indicating whether the state falls within it. Multiple overlapping tilings provide finer resolution and better generalization. Tile coding has been widely used in classic RL applications due to its balance of representational power and computational cost.[1]
- **Coarse coding:** Similar to tile coding, but uses overlapping receptive fields (often circular or Gaussian-shaped) rather than axis-aligned tiles. Multiple features can be active simultaneously, which enables generalization across neighboring states.
- **Radial basis functions (RBFs):** Continuous-valued features based on the distance between the current state and a set of prototype states. Unlike tile coding, RBFs produce graded activations rather than binary ones.

### Deep state representations

With the rise of [deep learning](/wiki/deep_learning), neural networks have become the dominant method for learning state representations from high-dimensional raw inputs such as images, audio, or text.

**Convolutional neural networks (CNNs)** are used to extract spatial features from image-based observations. The [Deep Q-Network](/wiki/deep_q-network_dqn) (DQN) introduced by Mnih et al. (2015) demonstrated that a CNN could learn to play Atari 2600 games at human-level performance directly from raw pixel inputs.[4] Using the same network architecture and hyperparameters across 49 games, DQN outperformed the best prior reinforcement learning methods on 43 of them and reached more than 75 percent of a professional human games tester's score on more than half, performing at a level "comparable to that of a professional human games tester."[4] The network received 84 x 84 grayscale images (stacked four frames deep to capture motion information) and output Q-values for each possible action. The convolutional layers learned to detect game objects, track motion, and extract other task-relevant features without any hand-crafted state engineering.[5]

**Recurrent neural networks (RNNs)** are used in partially observable settings where a single observation does not provide enough information to determine the state. [LSTM](/wiki/long_short-term_memory_lstm) and GRU networks maintain a hidden state that accumulates information over time, effectively learning to construct an approximate state representation from a history of observations. Deep Recurrent Q-Networks (DRQN) replace the fully connected layers in DQN with recurrent layers to handle partial observability.

**Autoencoders and variational autoencoders (VAEs)** learn compressed latent representations of high-dimensional observations in an unsupervised manner. The encoder maps observations to a low-dimensional latent space, and the decoder reconstructs the original observation from the latent code. The latent representation can then serve as the state input to an RL algorithm.

### World models and latent states

World models learn a predictive model of the environment's dynamics in a compressed latent state space. The agent can then "imagine" future trajectories by rolling out the learned model in latent space, enabling planning and more sample-efficient learning.

Ha and Schmidhuber (2018) proposed a world model architecture that combines a [variational autoencoder](/wiki/variational_autoencoder) (compressing each 64 x 64 RGB frame into a 32-dimensional latent vector) with a mixture-density recurrent network that predicts the next latent code, plus a tiny linear controller operating on the learned latent state rather than raw observations.[6] Remarkably, agents could even be trained entirely inside their own "dreams" generated by the recurrent world model, and the learned policies transferred back to the real environment.[6]

Hafner et al. (2020) introduced Dreamer, which learns behaviors by "latent imagination." Dreamer uses a Recurrent State-Space Model (RSSM) that divides the latent state into deterministic and stochastic components. The deterministic part is propagated forward using a GRU, while the stochastic part captures uncertainty. Dreamer propagates analytic gradients of learned state values back through imagined trajectories, enabling efficient long-horizon planning.[7] Subsequent versions of Dreamer (DreamerV2, DreamerV3) extended this approach across hundreds of diverse tasks.

## What is a state-value function?

States are the foundation of value functions, which estimate how good it is for an agent to be in a given state (or to take a given action in a given state).

### State value function V(s)

The **state value function** $$V^\pi(s)$$ gives the expected cumulative discounted reward when starting in state s and following [policy](/wiki/policy) $$\pi$$ thereafter:

$$
V^\pi(s) = \mathbb{E}_\pi\left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \mid s_t = s \right]
$$

The [Bellman equation](/wiki/bellman_equation) for $$V^\pi$$ expresses the value of a state recursively in terms of the values of successor states:[2]

$$
V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[ R(s, a, s') + \gamma V^\pi(s') \right]
$$

### Action value function Q(s, a)

The **action value function** $$Q^\pi(s, a)$$ gives the expected cumulative discounted reward when starting in state s, taking action a, and then following policy $$\pi$$:

$$
Q^\pi(s, a) = \mathbb{E}_\pi\left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \mid s_t = s, a_t = a \right]
$$

The relationship between $$V$$ and $$Q$$ is:

$$
V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a)
$$

Many RL algorithms (such as [Q-learning](/wiki/q-learning), SARSA, and actor-critic methods) work by estimating and improving these value functions.[9] The optimal value functions $$V^*$$ and $$Q^*$$ correspond to the policy that maximizes expected return from every state.

## State in different types of machine learning

While the term "state" is most precisely defined in the RL context, it appears in other areas of machine learning as well.

| ML paradigm | What "state" refers to | How it changes |
|-------------|------------------------|----------------|
| [Supervised learning](/wiki/supervised_learning) | Model parameters (weights, biases) and training data | Updated through gradient-based optimization during training |
| [Unsupervised learning](/wiki/unsupervised_learning) | Latent variables, cluster assignments, learned representations | Inferred during training without labeled guidance |
| [Reinforcement learning](/wiki/reinforcement_learning) | The environment's configuration at a point in time | Changes dynamically as the agent interacts with the environment |
| Recurrent models ([LSTM](/wiki/long_short-term_memory_lstm), GRU) | The hidden state vector carried across time steps | Updated at each time step based on the current input and previous hidden state |

In [supervised learning](/wiki/supervised_learning), the notion of "state" typically refers to the snapshot of model parameters during training. The state evolves as the optimizer updates weights to minimize the [loss function](/wiki/loss_function). In [unsupervised learning](/wiki/unsupervised_learning), state can refer to the current cluster assignments or latent variable estimates, which are updated iteratively as the model discovers structure in the data.

## What makes a good state representation?

Not all state representations are equally useful for learning. Research has identified several properties that contribute to effective state representations.[12]

| Property | Description | Why it matters |
|----------|-------------|----------------|
| Markov | The state captures enough information so that the future is conditionally independent of the past given the present | Enables the use of standard MDP algorithms and simplifies the learning problem |
| Compact | The state uses a low-dimensional representation | Reduces computational cost and speeds up learning |
| Sufficient | The state retains all task-relevant information | Ensures the agent can learn an optimal policy |
| Smooth | Similar states map to nearby points in the representation space | Supports generalization across similar situations |
| Disentangled | Different factors of variation are captured by separate components of the state vector | Makes the representation easier to interpret and can improve transfer learning |
| Stationary | The distribution of states does not change over time | Simplifies learning; non-stationary distributions require adaptive methods |

### Sparsity in state representations

Sparse state representations use mostly zero-valued features, with only a small number of active features for any given state. Tile coding, for example, naturally produces sparse representations because only a few tiles are active at any time. Sparse representations can improve computational efficiency because operations on sparse vectors skip zero entries. However, overly sparse representations may lose information, leading to aliasing where distinct states are mapped to the same representation.

## State abstraction

State abstraction refers to techniques that simplify the state space by grouping or mapping states into a smaller set of abstract states. Abstraction reduces the effective size of the problem while (ideally) preserving enough information for near-optimal decision-making.[11]

### Types of state abstraction

- **State aggregation:** States that are considered equivalent under some criterion are merged into a single abstract state. For example, states with the same optimal action and similar transition dynamics might be grouped together. Bisimulation metrics formalize this idea by defining a distance between states based on their behavioral similarity.[11]
- **Feature selection:** Irrelevant state features are removed, reducing the dimensionality of the state space. Feature selection can be done manually (using domain knowledge) or automatically (using techniques like mutual information or learned attention masks).
- **Hierarchical abstraction:** The state space is organized into multiple levels of abstraction, where higher levels represent coarser, more abstract descriptions. Options and semi-MDPs provide a formal framework for hierarchical RL, where temporally extended actions operate over abstract state spaces.

### The curse of dimensionality

As the number of state dimensions grows, the volume of the state space increases exponentially. This phenomenon, known as the [curse of dimensionality](/wiki/curse_of_dimensionality), means that the number of samples required to adequately cover the state space grows exponentially with its dimensionality.[2] For model-based RL, the sample complexity is often proportional to the size of the state-action space.

Factored MDPs address this challenge by decomposing the state into independent or weakly interacting components. Instead of treating the state as a single monolithic vector, a factored MDP represents it as a collection of state variables, each with its own (smaller) domain. The transition dynamics are specified as a factored function, where each state variable depends on only a few other variables. This decomposition can reduce the sample complexity from exponential in the full state dimension to polynomial in the dimension of the largest factor.

## Practical examples of states

The following table illustrates what constitutes a state in various RL domains.

| Domain | State components | State space type | Observability |
|--------|-----------------|------------------|---------------|
| Chess | Positions of all pieces on the board, whose turn it is, castling rights, en passant availability | Discrete, finite (approximately $$10^{47}$$ positions) | Fully observable |
| CartPole | Cart position, cart velocity, pole angle, pole angular velocity | Continuous, 4-dimensional | Fully observable |
| Atari games (DQN) | Raw pixel values of the game screen (stacked frames) | High-dimensional continuous (84 x 84 x 4) | Partially observable (single frame lacks velocity info) |
| Poker (Texas Hold'em) | Player's cards, community cards, pot size, betting history, opponent behavior | Mixed discrete and continuous | Partially observable (opponent cards hidden) |
| Robotic manipulation | Joint angles, joint velocities, gripper state, object positions and orientations | Continuous, high-dimensional | Often partially observable (occluded objects) |
| Autonomous driving | Vehicle position, velocity, heading; positions and velocities of other vehicles; lane markings; traffic signals | Continuous, very high-dimensional | Partially observable (sensor limitations, occlusions) |
| Grid world | Agent's (x, y) position on a grid | Discrete, finite | Fully observable |

## Techniques for handling complex state spaces

### Frame stacking

Introduced by Mnih et al. (2015) in the DQN paper, frame stacking concatenates the last k observations (typically k = 4) as a single input. This technique converts a partially observable problem (where a single frame does not reveal velocity or direction of motion) into an approximately fully observable one. Frame stacking is computationally simple and has become standard practice in vision-based RL.[4] However, fixed-size frame stacks may either include too much irrelevant information or miss events that occurred further in the past.

### Data augmentation

Data augmentation techniques, such as Reinforcement Learning with Augmented Data (RAD), apply random transformations to state observations (e.g., crops, color jitter, rotations) to improve generalization and data efficiency. These augmentations are applied consistently across the frame stack to preserve temporal information.[13]

### State normalization

Normalizing state features to have zero mean and unit variance (or to lie within a fixed range) can significantly improve learning stability and speed, particularly when state dimensions have different scales. For example, in a robotics task, joint angles might range from $$-\pi$$ to $$\pi$$ while joint velocities might range from -10 to 10. Without normalization, the learning algorithm may be dominated by the dimensions with larger magnitudes.

## Historical context

The concept of state in sequential decision-making traces back to the work of Richard Bellman in the 1950s, who developed [dynamic programming](/wiki/dynamic_programming) and the principle of optimality.[2] Bellman's formulation explicitly uses states as the basis for recursive value computation. The MDP framework was formalized by Bellman (1957) and later refined by Howard (1960) and Puterman (1994).[10][3]

The application of states to RL was shaped by several milestones:

- **Temporal difference learning (Sutton, 1988):** Introduced methods for learning state value functions from experience without requiring a model of the environment.[1]
- **Q-learning (Watkins, 1989):** Provided an off-policy method for learning action-value functions, enabling agents to learn optimal behavior from exploratory data.[9]
- **DQN (Mnih et al., 2015):** Demonstrated that deep neural networks could learn effective state representations directly from high-dimensional pixel inputs, bridging the gap between RL and [deep learning](/wiki/deep_learning).[4]
- **World Models (Ha and Schmidhuber, 2018):** Showed that agents could learn compact latent state representations and even train entirely in imagined environments.[6]
- **Dreamer (Hafner et al., 2020):** Advanced latent state learning with gradient-based planning through imagined trajectories.[7]

## Related concepts

- [Action](/wiki/action)
- [Agent](/wiki/agent)
- [Bellman equation](/wiki/bellman_equation)
- [Deep Q-Network (DQN)](/wiki/deep_q-network_dqn)
- [Environment](/wiki/environment)
- [Markov decision process (MDP)](/wiki/markov_decision_process_mdp)
- [Policy](/wiki/policy)
- [Q-learning](/wiki/q-learning)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [Reward](/wiki/reward)

## References

1. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
2. Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
3. Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
4. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
5. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. *arXiv preprint arXiv:1312.5602*.
6. Ha, D., & Schmidhuber, J. (2018). World models. *arXiv preprint arXiv:1803.10122*.
7. Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to control: Learning behaviors by latent imagination. *Proceedings of the International Conference on Learning Representations (ICLR)*.
8. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. *Artificial Intelligence*, 101(1-2), 99-134.
9. Watkins, C. J. C. H. (1989). *Learning from Delayed Rewards*. PhD thesis, King's College, Cambridge.
10. Howard, R. A. (1960). *Dynamic Programming and Markov Processes*. MIT Press.
11. Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. *Proceedings of the International Symposium on Artificial Intelligence and Mathematics (ISAIM)*.
12. Lesort, T., Diaz-Rodriguez, N., Goudou, J. F., & Filliat, D. (2018). State representation learning: A review. *Journal of Artificial Intelligence Research*, 63, 541-612.
13. Echchahed, B., & Castro, P. S. (2025). A survey of state representation learning for deep reinforcement learning. *arXiv preprint arXiv:2506.17518*.
14. Sutton, R. S., & Barto, A. G. (2018). The Markov Property (Section 3.5). In *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
15. Farama Foundation. Cart Pole environment documentation. Gymnasium. Retrieved 2026.