Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making in stochastic environments. MDPs provide a formal structure for reasoning about situations where an agent must choose actions over time, with outcomes that are partly random and partly under the agent's control. The framework captures the probabilistic nature of state transitions, the rewards or penalties tied to actions, and the influence of an agent's choices on the evolution of the system.

MDPs are a cornerstone of reinforcement learning, control theory, and operations research. They generalize Markov chains by introducing actions and rewards, allowing a decision-maker to influence state transitions rather than merely observing them. Since their formalization in the 1950s, MDPs have become one of the most widely studied models in artificial intelligence and applied mathematics, serving as the lingua franca across reinforcement learning, dynamic programming, and control theory.

History

The theoretical foundations of MDPs trace back to the work of Richard Bellman, who introduced the theory of dynamic programming while working at the RAND Corporation beginning in 1949. Bellman's first dynamic programming publication, "On the Theory of Dynamic Programming," appeared in 1952 in the Proceedings of the National Academy of Sciences. In 1957, Bellman published his landmark book Dynamic Programming, which laid out the principle of optimality and the recursive equations (now called Bellman equations) that underpin MDP solutions. That same year, he formally introduced the topic of Markovian decision processes and solution by iteration in policy space in a paper published in the Journal of Mathematics and Mechanics ^[1].

Bellman's principle of optimality states that an optimal policy has the property that, regardless of the initial state and initial decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. This recursive insight became the foundation for all dynamic programming algorithms.

Notably, Lloyd Shapley's 1953 paper on stochastic games contained value iteration as a special case, predating Bellman's explicit treatment of MDPs ^[6]. However, it was Bellman who formalized and popularized the MDP framework as a distinct area of study.

Ronald A. Howard extended this foundation in his 1960 monograph Dynamic Programming and Markov Processes, published by MIT Press ^[2]. Howard introduced the policy iteration algorithm, originally motivated by the practical problem of optimizing Sears catalogue mailings, and established much of the terminology still used in the field today. His work connected Bellman's dynamic programming approach with the theory of Markov chains, creating a unified framework for decision-making under uncertainty.

Francois d'Epenoux contributed the linear programming formulation for MDPs in 1963, providing a third exact solution method alongside value iteration and policy iteration ^[9]. Throughout the 1960s and 1970s, researchers in operations research refined the mathematical theory of MDPs, developing convergence proofs, exploring computational methods, and introducing modified policy iteration variants.

The connection between MDPs and reinforcement learning was solidified in the 1980s and 1990s through the work of researchers such as Andrew Barto, Richard Sutton, and Chris Watkins. Watkins' 1989 PhD thesis introduced Q-learning, a model-free method for solving MDPs that does not require knowledge of the transition probabilities ^[10]. Sutton and Barto's textbook Reinforcement Learning: An Introduction, first published in 1998 and updated in 2018, cemented the MDP framework as the standard formalism for reinforcement learning ^[4].

Formal Definition

A Markov Decision Process is formally defined by a 5-tuple (S, A, P, R, gamma), where each component captures a different aspect of the decision problem.

Component	Symbol	Description
State space	S	A finite (or countable) set of states representing all possible configurations of the environment.
Action space	A	A finite set of actions available to the decision-maker. In some formulations, the set of available actions depends on the current state, written as A(s).
Transition function	P(s' \| s, a)	The state-transition probability function. For each state s and action a, P(s' \| s, a) gives the probability of moving to state s' when action a is taken in state s. The transitions must satisfy: the sum of P(s' \| s, a) over all s' equals 1 for every (s, a) pair.
Reward function	R(s, a, s')	The immediate reward received after transitioning from state s to state s' via action a. Alternative formulations use R(s, a) or R(s) depending on whether the reward depends on the action and the next state.
Discount factor	gamma	A real number in the range [0, 1] that determines the present value of future rewards. A discount factor of 0 makes the agent short-sighted (caring only about immediate rewards), while a value close to 1 makes it far-sighted.

Some formulations include additional elements such as an initial state s0 or a set of terminal states. When terminal states exist, the process ends upon reaching one of them. Different fields use varying notation: economics typically uses beta or gamma for the discount factor and emphasizes action/reward/value terminology, while engineering favors alpha for the discount and uses control/cost/cost-to-go terminology.

The Markov Property

The defining characteristic of an MDP is the Markov property (also called the memoryless property). This property states that the probability of transitioning to any future state depends only on the current state and action, not on the sequence of states and actions that preceded it. Formally:

P(S(t+1) = s' | S(t) = s, A(t) = a, S(t-1), A(t-1), ..., S(0), A(0)) = P(S(t+1) = s' | S(t) = s, A(t) = a)

In other words, the current state contains all information relevant to predicting the future. The history of how the agent arrived at the current state does not matter. This assumption simplifies the decision problem enormously because the agent only needs to consider its current state when choosing an action, rather than maintaining a complete history of all past states and actions.

The Markov property is a reasonable approximation in many practical settings. When it does not hold exactly, practitioners often augment the state representation to include enough history that the resulting process is approximately Markovian. For instance, in a game where the current screen frame alone is insufficient to determine the full game state (e.g., the velocity of a moving object), stacking several recent frames into a single state representation restores the Markov property.

State Transitions and Rewards

At each discrete time step t, the agent observes its current state s(t), selects an action a(t), and the environment responds by transitioning to a new state s(t+1) according to the transition probability P(s(t+1) | s(t), a(t)) and delivering a reward R(s(t), a(t), s(t+1)). This interaction produces a trajectory:

s(0), a(0), r(1), s(1), a(1), r(2), s(2), a(2), r(3), ...

The transition function P captures the dynamics of the environment. In a deterministic MDP, each state-action pair leads to exactly one next state. In a stochastic MDP, the same state-action pair can lead to different next states with different probabilities. The stochastic case is much more common in practical applications because real-world environments typically involve uncertainty.

The reward function R quantifies the immediate desirability of a transition. Rewards can be positive (indicating desirable outcomes), negative (indicating penalties or costs), or zero. The design of the reward function is critical because it encodes the objective that the agent should optimize.

Policies

A policy is a rule that specifies which action the agent should take in each state. It represents the agent's complete strategy for interacting with the environment. There are two main types of policies.

Deterministic Policies

A deterministic policy is a function that maps each state to a single action:

pi: S -> A

For a given state s, the policy pi(s) always returns the same action. Deterministic policies are simpler to represent and implement. For finite MDPs with the standard objective of maximizing expected discounted return, there always exists an optimal policy that is deterministic.

Stochastic Policies

A stochastic policy maps each state to a probability distribution over actions:

pi(a | s) = P(A(t) = a | S(t) = s)

For each state s and action a, pi(a | s) gives the probability that the agent selects action a when in state s. The probabilities must satisfy: the sum of pi(a | s) over all a in A equals 1 for every state s.

Stochastic policies are more general than deterministic policies (a deterministic policy is a special case where the probability is 1 for one action and 0 for all others). While they are not strictly necessary for standard MDPs (an optimal deterministic policy always exists), stochastic policies become important in several settings:

Partially observable environments, where randomization can help with exploration
Multi-agent settings, where mixing strategies can prevent exploitation by opponents
Constrained MDPs, where the optimal policy may need to randomize to satisfy constraints
Policy gradient methods in reinforcement learning, which naturally parameterize stochastic policies

Stationary vs. Non-Stationary Policies

A stationary policy does not change over time. The agent uses the same mapping from states to actions at every time step. For infinite-horizon discounted MDPs, the optimal policy is always stationary.

A non-stationary policy can change over time, so the mapping from states to actions depends on the current time step: pi(t)(s) may differ from pi(t')(s) for t not equal to t'. Non-stationary policies are necessary for finite-horizon MDPs, where the optimal action in a given state may depend on how many time steps remain.

Value Functions

Value functions estimate how good it is for the agent to be in a given state (or to take a given action in a given state) under a particular policy. They are central to most MDP solution methods.

State-Value Function V(s)

The state-value function V^pi(s) gives the expected cumulative discounted reward starting from state s and following policy pi thereafter:

V^pi(s) = E_pi [ sum from k=0 to infinity of gamma^k * R(t+k+1) | S(t) = s ]

Here, gamma is the discount factor and R(t+k+1) is the reward received at time step t+k+1. The expectation is taken over the stochasticity of both the policy and the environment transitions.

The value function satisfies the following properties:

For terminal states, V(s) = 0 (no future rewards are possible)
Higher values indicate more desirable states under the given policy
The optimal state-value function V*(s) = max over all policies pi of V^pi(s) gives the maximum achievable expected return from each state

Action-Value Function Q(s, a)

The action-value function (also called the Q-function) Q^pi(s, a) gives the expected cumulative discounted reward starting from state s, taking action a, and then following policy pi:

Q^pi(s, a) = E_pi [ sum from k=0 to infinity of gamma^k * R(t+k+1) | S(t) = s, A(t) = a ]

The Q-function is especially useful because it allows the agent to compare actions without needing a model of the environment's dynamics. If Q*(s, a) is known, the optimal action in state s is simply the one that maximizes Q*(s, a):

pi*(s) = argmax over a of Q*(s, a)

This property makes the Q-function the foundation of algorithms such as Q-learning and its deep learning extension, Deep Q-Networks (DQN).

Relationship Between V and Q

The state-value function and the action-value function are related by:

V^pi(s) = sum over a of pi(a | s) * Q^pi(s, a)

This equation states that the value of a state under a policy equals the expected Q-value, where the expectation is over the actions selected by the policy. For a deterministic policy, this simplifies to V^pi(s) = Q^pi(s, pi(s)).

Conversely:

Q^pi(s, a) = sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V^pi(s') ]

The Q-value of taking action a in state s equals the expected immediate reward plus the discounted value of the next state.

Bellman Equations

The Bellman equations are recursive relationships that express the value of a state in terms of the values of its successor states. They form the mathematical backbone of nearly all MDP solution algorithms.

Bellman Expectation Equation

For a given policy pi, the Bellman expectation equation for the state-value function is:

V^pi(s) = sum over a of pi(a | s) * sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V^pi(s') ]

This equation decomposes the value of a state into two parts: the expected immediate reward and the discounted expected value of the next state. It expresses the value of the current state as a weighted average over all possible actions and successor states.

Similarly, the Bellman expectation equation for the action-value function is:

Q^pi(s, a) = sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * sum over a' of pi(a' | s') * Q^pi(s', a') ]

These equations form a system of linear equations (one for each state or state-action pair) that can be solved directly for small state spaces.

Bellman Optimality Equation

The Bellman optimality equation characterizes the optimal value function. For the state-value function:

V*(s) = max over a of sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V*(s') ]

This equation states that the value of a state under the optimal policy equals the maximum (over all actions) of the expected immediate reward plus the discounted value of the successor state.

For the action-value function:

Q*(s, a) = sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * max over a' of Q*(s', a') ]

Unlike the Bellman expectation equations, the Bellman optimality equations are nonlinear (due to the max operator) and cannot generally be solved in closed form. Instead, iterative methods are used.

Finite vs. Infinite Horizon

MDPs can be classified by their time horizon, which determines the length of the decision-making process.

Finite-Horizon MDPs

In a finite-horizon MDP, the agent interacts with the environment for a fixed number of time steps T (the horizon). The objective is to maximize the total expected reward over these T steps:

E [ sum from t=0 to T-1 of R(t+1) ]

Key characteristics of finite-horizon MDPs:

The optimal policy is generally non-stationary, meaning the best action in a given state depends on how many time steps remain. For example, an agent with only one step remaining should behave differently from one with many steps ahead.
They are solved using backward induction (dynamic programming from the last time step to the first).
A discount factor is not strictly necessary because the finite horizon already prevents infinite reward accumulation.
The value function is indexed by both state and time: V(t)(s) denotes the value of being in state s with t steps remaining.

Infinite-Horizon MDPs

In an infinite-horizon MDP, there is no fixed endpoint. The agent continues making decisions indefinitely. To ensure that the total reward is finite and well-defined, a discount factor gamma strictly less than 1 is typically used:

E [ sum from t=0 to infinity of gamma^t * R(t+1) ]

Key characteristics of infinite-horizon MDPs:

The optimal policy is stationary: the best action in a given state does not depend on the current time step.
This stationarity simplifies both analysis and computation.
The maximum possible return is bounded above by R_max / (1 - gamma), where R_max is the maximum single-step reward.
Most theoretical results and practical algorithms for MDPs focus on the infinite-horizon discounted case.

Episodic vs. Continuing Tasks

Beyond the finite/infinite horizon distinction, MDPs are also categorized by whether the interaction naturally breaks into episodes.

Episodic Tasks

An episodic task is one where the agent-environment interaction naturally breaks into separate subsequences called episodes. Each episode begins in a start state and ends when the agent reaches a terminal state. After an episode ends, the environment resets and a new episode begins independently of how the previous one ended. Playing a game of chess, navigating a maze, or a robot assembling a specific part are all examples of episodic tasks.

In episodic tasks, the return is typically the undiscounted sum of rewards within a single episode:

G(t) = R(t+1) + R(t+2) + ... + R(T)

where T is the time step at which the terminal state is reached. Because each episode has a finite length, the total reward is guaranteed to be finite even without discounting.

Continuing Tasks

A continuing task has no natural endpoint. The agent-environment interaction goes on without limit, and there are no terminal states. Managing a power grid, running an ongoing trading strategy, or controlling the temperature of a chemical reactor are examples of continuing tasks.

For continuing tasks, the discounted return formulation is essential:

G(t) = R(t+1) + gamma * R(t+2) + gamma^2 * R(t+3) + ...

Without discounting (gamma < 1), the sum of rewards would diverge, making it impossible to compare policies. An alternative objective for continuing tasks is the average reward formulation, where the agent maximizes the long-run average reward per time step rather than a discounted sum.

Implications for Algorithms

The distinction between episodic and continuing tasks affects which algorithms can be applied. Monte Carlo methods, which estimate value functions by averaging returns over complete episodes, require episodic tasks because they need to observe full trajectories from start to terminal state. Temporal difference (TD) methods and Q-learning, on the other hand, update estimates after every single step and work for both episodic and continuing tasks ^[4].

Discounting

The discount factor gamma plays several important roles in MDP formulations.

Mathematical necessity. In infinite-horizon problems, discounting with gamma < 1 ensures that the infinite sum of rewards converges to a finite value. Without discounting, the total reward could be infinite, making it impossible to compare policies.

Time preference. Discounting encodes a preference for receiving rewards sooner rather than later. A reward received now is worth more than the same reward received in the distant future. This mirrors economic concepts like the time value of money and interest rates.

Modeling uncertainty. The discount factor can also be interpreted as the probability of the process continuing at each step. Under this interpretation, gamma represents the survival probability, and (1 - gamma) represents the probability that the process terminates at each time step.

Effect on behavior. The choice of gamma significantly affects the agent's learned behavior:

Discount Factor	Agent Behavior
gamma = 0	Completely myopic; the agent only considers immediate rewards and ignores all future consequences.
gamma close to 0	Short-sighted; the agent focuses primarily on near-term rewards.
gamma = 0.5	Moderate; rewards two steps ahead are worth one-quarter of immediate rewards.
gamma close to 1	Far-sighted; the agent values long-term and short-term rewards nearly equally.
gamma = 1	Undiscounted; only valid for finite-horizon or episodic tasks where the total reward is guaranteed to be finite.

Solving MDPs

There are three classical exact methods for finding optimal policies in MDPs: value iteration, policy iteration, and linear programming. All three methods require full knowledge of the transition probabilities and reward function (i.e., they are model-based methods).

Value Iteration

Value iteration, introduced by Bellman (1957), computes the optimal value function through repeated application of the Bellman optimality operator. The algorithm starts with an arbitrary initial value function V(0) and iteratively updates it:

V(k+1)(s) = max over a of sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V(k)(s') ]

This update is applied to all states simultaneously in each iteration. The algorithm converges to the optimal value function V* as k approaches infinity. Convergence is guaranteed by the contraction mapping theorem: the Bellman optimality operator is a gamma-contraction in the max-norm, meaning each iteration brings V(k) closer to V* by at least a factor of gamma. In practice, iteration stops when the maximum change across all states falls below a small threshold epsilon:

max over s of | V(k+1)(s) - V(k)(s) | < epsilon

Once V* has been computed, the optimal policy is extracted by choosing the action that maximizes the right-hand side of the Bellman optimality equation for each state.

Properties of value iteration:

Guaranteed to converge for gamma < 1
Each iteration requires O(|S|^2 * |A|) operations (one pass through all state-action-next-state triples)
The number of iterations required for convergence depends on gamma and the desired precision
Conceptually straightforward: a single loop with one update rule

Policy Iteration

Policy iteration, introduced by Howard (1960), alternates between two steps ^[2]:

Policy evaluation: Given a fixed policy pi, compute its value function V^pi by solving the system of linear equations defined by the Bellman expectation equation. For |S| states, this involves solving a system of |S| linear equations with |S| unknowns.
Policy improvement: Construct a new policy pi' by acting greedily with respect to V^pi:

pi'(s) = argmax over a of sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V^pi(s') ]

These two steps repeat until the policy no longer changes (pi' = pi), at which point the policy is guaranteed to be optimal.

Properties of policy iteration:

Converges in a finite number of iterations (since the number of deterministic policies is finite: |A|^|S|)
Each iteration is more expensive than value iteration (due to solving a linear system), but far fewer iterations are typically needed
In practice, policy iteration often converges in a surprisingly small number of iterations, even for large state spaces
The policy evaluation step can be approximated using iterative methods (called modified policy iteration), bridging the gap between value iteration and policy iteration

Linear Programming

The optimal value function of an MDP can also be found by solving a linear program (LP), as first shown by d'Epenoux (1963) ^[9]. The primal LP formulation is:

Minimize: sum over s of alpha(s) * V(s)

Subject to: V(s) >= sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V(s') ] for all s in S, a in A

Here, alpha(s) > 0 is a weighting over states (often chosen as the initial state distribution). The constraints ensure that V(s) is at least as large as the expected return of every action, and the minimization drives V(s) down to the tightest constraint, which corresponds to the best action.

The dual LP formulation uses occupancy measures (also called state-action frequencies). The dual LP is:

Maximize: sum over s,a of mu(s, a) * R(s, a)

Subject to: sum over a of mu(s', a) = alpha(s') + gamma * sum over s,a of mu(s, a) * P(s' | s, a) for all s' in S, and mu(s, a) >= 0 for all s, a

The variable mu(s, a) represents the (discounted) expected number of times the agent visits state s and takes action a. The optimal policy can be recovered from the occupancy measure. One notable advantage of the LP formulation is that it naturally extends to constrained MDPs, where additional linear constraints on the occupancy measure enforce side constraints on the policy.

Prioritized Sweeping

Prioritized sweeping is an enhancement to value iteration that preferentially applies updates to states that are most likely to have changed significantly, rather than sweeping through all states uniformly. The algorithm maintains a priority queue and focuses computation where it is most needed, which can dramatically reduce the number of updates required to reach convergence in sparse or structured MDPs.

Comparison of Solution Methods

Method	Introduced By	Time per Iteration	Number of Iterations	Total Complexity	Best Suited For
Value Iteration	Bellman (1957)	O(\|S\|^2 * \|A\|)	Depends on gamma and epsilon; can be large	Pseudo-polynomial	Small to medium MDPs; simple implementation
Policy Iteration	Howard (1960)	O(\|S\|^3 + \|S\|^2 * \|A\|)	Typically very few (often < 20)	Polynomial in practice	Medium MDPs where fewer iterations are preferred
Linear Programming	d'Epenoux (1963)	N/A (single solve)	1 (single LP solve)	Polynomial (LP is polynomial)	Constrained MDPs; theoretical analysis
Modified Policy Iteration	Puterman & Shin (1978)	O(\|S\|^2 * \|A\|)	Between VI and PI	Between VI and PI	Large MDPs; practical compromise

Computational Complexity

Decision problems based on MDPs belong to the complexity class P, meaning algorithms exist that find optimal policies in time polynomial in the size of the problem representation ^[8]. However, this result can be misleading in practice. The number of states in an MDP often grows exponentially with the number of state variables (the "curse of dimensionality," a term coined by Bellman himself). For example, an MDP with 20 binary state variables has 2^20 (over one million) states, even though the problem description is compact. This exponential blowup limits exact solution techniques to problems with either small state spaces or special structure that can be exploited.

To handle large or continuous state spaces, practitioners turn to approximate methods including function approximation, deep learning-based approaches (such as Deep Q-Networks), and factored MDP representations that decompose the state space into independent or weakly interacting components ^[5].

Partially Observable MDPs (POMDPs)

A Partially Observable Markov Decision Process (POMDP) generalizes the MDP framework to situations where the agent cannot directly observe the true state of the environment. Instead, the agent receives observations that provide incomplete or noisy information about the underlying state.

Formal Definition

A POMDP is defined by a 7-tuple (S, A, P, R, Omega, O, gamma), where the first four components and the discount factor are identical to those in a standard MDP, and two new components are added:

Component	Symbol	Description
State space	S	Finite set of states (as in MDP)
Action space	A	Finite set of actions (as in MDP)
Transition function	P(s' \| s, a)	State-transition probabilities (as in MDP)
Reward function	R(s, a, s')	Immediate reward (as in MDP)
Observation space	Omega	A finite set of observations the agent can receive
Observation function	O(o \| s', a)	The probability of observing o given that the agent took action a and arrived in state s'
Discount factor	gamma	Discount factor (as in MDP)

Belief States

Since the agent cannot observe the true state directly, it maintains a belief state: a probability distribution over all possible states. The belief state b(s) represents the agent's subjective probability that the true state is s. After taking action a and receiving observation o, the belief is updated using Bayes' rule:

b'(s') = O(o | s', a) * sum over s of P(s' | s, a) * b(s) / P(o | a, b)

Here, P(o | a, b) is a normalizing constant that ensures the updated belief sums to 1.

The key insight is that a POMDP can be converted to a fully observable MDP over the space of belief states (called a belief MDP). However, the belief space is continuous (even when the underlying state space is finite), which makes POMDPs substantially harder to solve than standard MDPs.

Solving POMDPs

Solving POMDPs exactly is PSPACE-complete, and most practical algorithms rely on approximate methods ^[7]^[11]. Several important algorithmic advances have made POMDPs more tractable:

Exact methods exploit the fact that the finite-horizon value function is piecewise linear and convex (PWLC), meaning only a finite number of linear segments need to be found at each iteration. However, the number of segments grows exponentially with the horizon.
Point-based value iteration (PBVI) restricts value function computations to a finite subset of the belief space, avoiding the exponential growth of exact methods. Algorithms like Perseus and HSVI (Heuristic Search Value Iteration) fall into this category.
SARSOP (Successive Approximations of the Reachable Space under Optimal Policies) focuses computation on the optimally reachable belief space, further improving efficiency.
While exact methods handle only very small problems, point-based algorithms can efficiently find approximate solutions for problems with thousands of states and hundreds of observations.

Applications of POMDPs

POMDPs have found applications in domains where the agent operates with incomplete information:

Robot navigation under sensor uncertainty
Dialogue systems where the user's intent is not fully known
Medical diagnosis where the true disease state is hidden and only symptoms are observed
Wildlife conservation, such as managing endangered species populations that cannot be directly counted
Aircraft collision avoidance systems (e.g., the ACAS X system)

Connection to Reinforcement Learning

Reinforcement learning (RL) is the field of machine learning concerned with learning optimal behavior through interaction with an environment. MDPs provide the formal mathematical foundation for nearly all of reinforcement learning.

In the standard RL setting, the agent interacts with an environment that is assumed to be an MDP, but the agent does not know the transition probabilities P or the reward function R in advance. The agent must learn a good policy through trial and error, using only the states, actions, and rewards it observes during interaction.

RL methods can be broadly divided into two categories based on their relationship to the MDP model:

Category	Description	Examples
Model-based RL	The agent learns an explicit model of the MDP (transition and reward functions) and uses it for planning.	Dyna, PILCO, MuZero, World Models
Model-free RL	The agent learns a policy or value function directly from experience without constructing an explicit model.	Q-learning, SARSA, Policy Gradient, Actor-Critic

Key connections between MDPs and RL algorithms:

Temporal difference (TD) learning can be viewed as a sample-based, incremental version of the Bellman expectation equation.
Q-learning is a sample-based version of value iteration applied to the action-value function Q ^[10].
Policy gradient methods optimize a parameterized stochastic policy by estimating the gradient of expected return with respect to policy parameters.
Actor-critic methods combine policy gradient (actor) with value function estimation (critic), using the Bellman equation as the basis for the critic's updates.
Deep learning extensions such as Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO) use neural networks to approximate value functions and policies in MDPs with large or continuous state spaces.

MDPs in Notable AI Systems

Several landmark AI systems are built on the MDP framework:

AlphaGo combined deep neural networks with Monte Carlo tree search to master the game of Go, treating the game as a large MDP with approximately 10^170 possible board positions ^[4].
AlphaZero extended this approach to chess and shogi using pure self-play reinforcement learning, with no human game data.
MuZero learned to play Atari games, Go, chess, and shogi without being given the rules, instead learning a model of the environment's dynamics internally. MuZero also found practical application in YouTube's VP9 video compression codec.
Deep Q-Networks (DQN) demonstrated that a single neural network architecture could learn to play 49 different Atari games at superhuman levels by treating each game as an MDP.

Extensions and Variants

Several extensions of the basic MDP framework address limitations of the standard model.

Extension	Key Modification	Use Case
Continuous-state MDPs	State or action spaces are continuous rather than discrete.	Robotics, control systems, physics simulations
Semi-Markov Decision Processes (SMDPs)	Time between transitions follows a probability distribution rather than being fixed.	Systems where actions take variable amounts of time
Constrained MDPs (CMDPs)	Auxiliary cost functions and constraints are added to the standard objective.	Safety-critical systems, resource-limited settings
Multi-agent MDPs (Markov Games)	Multiple agents interact, each with their own actions and rewards.	Competitive and cooperative multi-agent scenarios
Factored MDPs	State is decomposed into multiple variables with exploitable structure.	High-dimensional problems with sparse variable interactions
Average-reward MDPs	Agent maximizes long-run average reward per step instead of discounted sum.	Indefinite operation with no natural discounting
Inverse RL / Inverse MDPs	The reward function is inferred from observed behavior rather than given.	Learning from demonstrations, apprenticeship learning
Continuous-time MDPs	Dynamics defined by ordinary differential equations; summation replaced by integration.	Finance (Hamilton-Jacobi-Bellman equation), continuous control

Continuous-state MDPs. Many real-world problems have continuous state or action spaces. Linear-quadratic regulators (LQR) and Gaussian process models are common approaches for continuous MDPs.

Semi-Markov Decision Processes (SMDPs). In an SMDP, the time between transitions is not fixed but follows a probability distribution. This extension is useful for modeling systems where actions take variable amounts of time.

Constrained MDPs (CMDPs). These add auxiliary cost functions and constraints to the standard MDP objective ^[12]. The agent must maximize reward subject to constraints on expected costs. Unlike standard MDPs, CMDPs can only be solved via linear programming (not dynamic programming), and the optimal policy may depend on the starting state.

Multi-agent MDPs. When multiple agents interact in the same environment, the framework extends to stochastic games (also called Markov games). Each agent has its own action set and reward function, and the transition depends on the joint action of all agents.

Factored MDPs. When the state is composed of multiple variables, factored MDPs exploit the structure to represent transition and reward functions compactly, enabling more efficient computation. Recent research has shown that factored representations can help overcome the curse of dimensionality by decomposing high-dimensional MDPs into smaller, independently evolving components.

Inverse Reinforcement Learning. Rather than solving an MDP given a reward function, inverse RL infers the reward function from observed expert behavior. This approach is useful when the reward function is difficult to specify manually but demonstrations of good behavior are available.

Applications

Markov Decision Processes have been applied across a wide range of domains. The following table summarizes key application areas.

Domain	Application Examples	Why MDPs Are Suitable
Reinforcement learning	Training game-playing agents, learning control policies, Q-learning, policy gradient methods	MDPs provide the formal foundation; RL algorithms solve MDPs when the model is unknown.
Robotics	Path planning, navigation, manipulation, grasping, locomotion	Robots operate in stochastic environments where actions have uncertain outcomes.
Operations research	Inventory management, supply chain optimization, queuing systems, maintenance scheduling	MDPs model sequential resource allocation decisions under demand uncertainty.
Healthcare	Treatment planning, clinical trial design, chronic disease management, imaging screening optimization	Patient health evolves stochastically, and treatment decisions have long-term consequences.
Finance	Portfolio optimization, option pricing, algorithmic trading, risk management	Financial markets involve sequential decisions under uncertainty with discounted future returns.
Telecommunications	Network routing, channel allocation, congestion control, resource scheduling	Network states change probabilistically and routing decisions affect future network performance.
Natural language processing	Dialogue systems, text generation, machine translation	Conversations are sequential decision problems where the next response depends on the current context.
Autonomous vehicles	Lane changing, intersection management, adaptive cruise control	Driving involves sequential decisions under uncertainty about other drivers' behavior.
Agriculture	Crop management, irrigation scheduling, pest control	Agricultural decisions depend on stochastic weather and crop growth dynamics.
Energy	Power grid management, demand response, battery storage optimization	Energy systems require sequential decisions under uncertain demand and renewable generation.

Robotics

In robotics, MDPs and their extensions are used for motion planning, task planning, and control. A robot navigating an environment can model its position as the state, movement commands as actions, and sensor noise as transition uncertainty. POMDPs are particularly relevant for robotic applications because sensors are typically noisy and the robot's state is not perfectly known. Research on robotic grasping, for instance, uses MDP formulations to plan sequences of grasp attempts that maximize the probability of successfully picking up an object. Socially assistive robots have also adopted MDP-based frameworks for modeling emotional behavior and interaction management, reducing computational complexity while improving response quality.

Operations Research

Operations research was one of the earliest fields to adopt MDPs. Classic problems include inventory management (deciding how much stock to order given uncertain demand), machine maintenance (deciding when to repair or replace equipment), and queuing (managing service systems with random arrivals). The MDP framework naturally captures the sequential, stochastic nature of these problems and the trade-off between immediate costs and long-term performance.

Healthcare

MDPs have been applied to clinical decision-making, where a physician must choose treatments for a patient whose health state evolves over time. For example, MDP models have been used to optimize screening schedules for cancer detection, determine optimal treatment sequences for HIV, and manage chronic conditions such as diabetes. The patient's health state serves as the MDP state, available treatments are the actions, health outcomes define the transitions, and patient quality of life (or survival probability) serves as the reward. During the COVID-19 pandemic, reinforcement learning methods built on MDP frameworks played a role in optimizing treatment decisions under heightened uncertainty.

Finance

In quantitative finance, MDPs model portfolio management, where the state represents the current portfolio composition and market conditions, actions correspond to trades, and the reward reflects portfolio returns. Dynamic asset allocation, option exercise decisions, and credit risk management have all been formulated as MDPs. The discount factor in financial MDPs often corresponds directly to the time value of money.

Game Playing

Game playing has served as one of the most visible testbeds for MDP-based methods. Board games like Go, chess, and shogi can be modeled as large MDPs (or two-player Markov games), and deep reinforcement learning systems like AlphaGo and AlphaZero have achieved superhuman performance in these domains. Video games, including the Atari 2600 suite, have similarly been modeled as MDPs and solved using Deep Q-Networks and policy gradient methods.

Assumptions and Limitations

MDPs rely on several assumptions that may not always hold in practice.

Markov property. The future depends only on the current state and action, not on history. In many real-world problems, the state representation may not capture all relevant information, violating this assumption.
Full observability. The standard MDP assumes the agent can observe the true state of the environment at all times. When this does not hold, POMDPs or other extensions are needed.
Stationarity. The transition probabilities and reward function do not change over time. In non-stationary environments, the MDP model must be periodically updated or replaced with a non-stationary formulation.
Known model. Classical solution methods (value iteration, policy iteration, LP) require complete knowledge of P and R. When the model is unknown, reinforcement learning methods are used instead.
Curse of dimensionality. The computational cost of solving an MDP grows polynomially with the number of states and actions, but the number of states often grows exponentially with the number of state variables. This makes exact solutions intractable for high-dimensional problems and motivates the use of function approximation techniques, neural network-based approaches, and factored representations.
Reward specification. Designing the reward function correctly is both critical and challenging. A poorly designed reward function can lead to unintended behavior, a problem sometimes called reward hacking. Inverse reinforcement learning offers one approach to this challenge by learning the reward from demonstrations.

Explain Like I'm 5 (ELI5)

Imagine you are playing a board game where you roll a die and move your piece around the board. At every square you land on, you have to make a choice: go left or go right. Some choices lead to squares where you get candy (that is your reward), and some lead to squares where you lose candy. The tricky part is that when you roll the die, you cannot be sure exactly which square you will land on next.

A Markov Decision Process is like a set of rules that helps you figure out the very best choice to make on every single square, so that over the whole game you collect as much candy as possible. The really important thing about these rules is that they only care about which square you are on right now, not about all the squares you visited before. It does not matter if you got to your current square by going left three times or right twice; the best next move is the same either way.

Scientists and engineers use MDPs to help robots decide where to walk, to help doctors pick the best treatment for patients, and to teach computers how to play video games really well. Any time someone needs to make a series of choices where the results are a little bit unpredictable, an MDP can help find the smartest strategy.

References

Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
Howard, R.A. (1960). *Dynamic Programming and Markov Processes*. MIT Press.
Puterman, M.L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
Sutton, R.S. and Barto, A.G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
Bertsekas, D.P. (2012). *Dynamic Programming and Optimal Control* (4th ed.). Athena Scientific.
Shapley, L.S. (1953). "Stochastic Games." *Proceedings of the National Academy of Sciences*, 39(10), 1095-1100.
Kaelbling, L.P., Littman, M.L., and Cassandra, A.R. (1998). "Planning and Acting in Partially Observable Stochastic Domains." *Artificial Intelligence*, 101(1-2), 99-134.
Papadimitriou, C.H. and Tsitsiklis, J.N. (1987). "The Complexity of Markov Decision Processes." *Mathematics of Operations Research*, 12(3), 441-450.
d'Epenoux, F. (1963). "A Probabilistic Production and Inventory Problem." *Management Science*, 10(1), 98-108.
Watkins, C.J.C.H. and Dayan, P. (1992). "Q-learning." *Machine Learning*, 8(3-4), 279-292.
Sondik, E.J. (1978). "The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs." *Operations Research*, 26(2), 282-304.
Altman, E. (1999). *Constrained Markov Decision Processes*. Chapman and Hall/CRC.