See also: Machine learning terms
A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making in stochastic environments. MDPs provide a formal structure for reasoning about situations where an agent must choose actions over time, with outcomes that are partly random and partly under the agent's control. The framework captures the probabilistic nature of state transitions, the rewards or penalties tied to actions, and the influence of an agent's choices on the evolution of the system.
MDPs are a cornerstone of reinforcement learning, control theory, and operations research. They generalize Markov chains by introducing actions and rewards, allowing a decision-maker to influence state transitions rather than merely observing them. Since their formalization in the 1950s, MDPs have become one of the most widely studied models in artificial intelligence and applied mathematics, serving as the lingua franca across reinforcement learning, dynamic programming, and control theory.
The theoretical foundations of MDPs trace back to the work of Richard Bellman, who introduced the theory of dynamic programming while working at the RAND Corporation beginning in 1949. Bellman's first dynamic programming publication, "On the Theory of Dynamic Programming," appeared in 1952 in the Proceedings of the National Academy of Sciences. In 1957, Bellman published his landmark book Dynamic Programming, which laid out the principle of optimality and the recursive equations (now called Bellman equations) that underpin MDP solutions. That same year, he formally introduced the topic of Markovian decision processes and solution by iteration in policy space in a paper published in the Journal of Mathematics and Mechanics [1].
Bellman's principle of optimality states that an optimal policy has the property that, regardless of the initial state and initial decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. This recursive insight became the foundation for all dynamic programming algorithms.
Notably, Lloyd Shapley's 1953 paper on stochastic games contained value iteration as a special case, predating Bellman's explicit treatment of MDPs [6]. However, it was Bellman who formalized and popularized the MDP framework as a distinct area of study.
Ronald A. Howard extended this foundation in his 1960 monograph Dynamic Programming and Markov Processes, published by MIT Press [2]. Howard introduced the policy iteration algorithm, originally motivated by the practical problem of optimizing Sears catalogue mailings, and established much of the terminology still used in the field today. His work connected Bellman's dynamic programming approach with the theory of Markov chains, creating a unified framework for decision-making under uncertainty.
Francois d'Epenoux contributed the linear programming formulation for MDPs in 1963, providing a third exact solution method alongside value iteration and policy iteration [9]. Throughout the 1960s and 1970s, researchers in operations research refined the mathematical theory of MDPs, developing convergence proofs, exploring computational methods, and introducing modified policy iteration variants.
The connection between MDPs and reinforcement learning was solidified in the 1980s and 1990s through the work of researchers such as Andrew Barto, Richard Sutton, and Chris Watkins. Watkins' 1989 PhD thesis introduced Q-learning, a model-free method for solving MDPs that does not require knowledge of the transition probabilities [10]. Sutton and Barto's textbook Reinforcement Learning: An Introduction, first published in 1998 and updated in 2018, cemented the MDP framework as the standard formalism for reinforcement learning [4].
A Markov Decision Process is formally defined by a 5-tuple (S, A, P, R, gamma), where each component captures a different aspect of the decision problem.
| Component | Symbol | Description |
|---|---|---|
| State space | S | A finite (or countable) set of states representing all possible configurations of the environment. |
| Action space | A | A finite set of actions available to the decision-maker. In some formulations, the set of available actions depends on the current state, written as A(s). |
| Transition function | P(s' | s, a) | The state-transition probability function. For each state s and action a, P(s' | s, a) gives the probability of moving to state s' when action a is taken in state s. The transitions must satisfy: the sum of P(s' | s, a) over all s' equals 1 for every (s, a) pair. |
| Reward function | R(s, a, s') | The immediate reward received after transitioning from state s to state s' via action a. Alternative formulations use R(s, a) or R(s) depending on whether the reward depends on the action and the next state. |
| Discount factor | gamma | A real number in the range [0, 1] that determines the present value of future rewards. A discount factor of 0 makes the agent short-sighted (caring only about immediate rewards), while a value close to 1 makes it far-sighted. |
Some formulations include additional elements such as an initial state s0 or a set of terminal states. When terminal states exist, the process ends upon reaching one of them. Different fields use varying notation: economics typically uses beta or gamma for the discount factor and emphasizes action/reward/value terminology, while engineering favors alpha for the discount and uses control/cost/cost-to-go terminology.
The defining characteristic of an MDP is the Markov property (also called the memoryless property). This property states that the probability of transitioning to any future state depends only on the current state and action, not on the sequence of states and actions that preceded it. Formally:
P(S(t+1) = s' | S(t) = s, A(t) = a, S(t-1), A(t-1), ..., S(0), A(0)) = P(S(t+1) = s' | S(t) = s, A(t) = a)
In other words, the current state contains all information relevant to predicting the future. The history of how the agent arrived at the current state does not matter. This assumption simplifies the decision problem enormously because the agent only needs to consider its current state when choosing an action, rather than maintaining a complete history of all past states and actions.
The Markov property is a reasonable approximation in many practical settings. When it does not hold exactly, practitioners often augment the state representation to include enough history that the resulting process is approximately Markovian. For instance, in a game where the current screen frame alone is insufficient to determine the full game state (e.g., the velocity of a moving object), stacking several recent frames into a single state representation restores the Markov property.
At each discrete time step t, the agent observes its current state s(t), selects an action a(t), and the environment responds by transitioning to a new state s(t+1) according to the transition probability P(s(t+1) | s(t), a(t)) and delivering a reward R(s(t), a(t), s(t+1)). This interaction produces a trajectory:
s(0), a(0), r(1), s(1), a(1), r(2), s(2), a(2), r(3), ...
The transition function P captures the dynamics of the environment. In a deterministic MDP, each state-action pair leads to exactly one next state. In a stochastic MDP, the same state-action pair can lead to different next states with different probabilities. The stochastic case is much more common in practical applications because real-world environments typically involve uncertainty.
The reward function R quantifies the immediate desirability of a transition. Rewards can be positive (indicating desirable outcomes), negative (indicating penalties or costs), or zero. The design of the reward function is critical because it encodes the objective that the agent should optimize.
A policy is a rule that specifies which action the agent should take in each state. It represents the agent's complete strategy for interacting with the environment. There are two main types of policies.
A deterministic policy is a function that maps each state to a single action:
pi: S -> A
For a given state s, the policy pi(s) always returns the same action. Deterministic policies are simpler to represent and implement. For finite MDPs with the standard objective of maximizing expected discounted return, there always exists an optimal policy that is deterministic.
A stochastic policy maps each state to a probability distribution over actions:
pi(a | s) = P(A(t) = a | S(t) = s)
For each state s and action a, pi(a | s) gives the probability that the agent selects action a when in state s. The probabilities must satisfy: the sum of pi(a | s) over all a in A equals 1 for every state s.
Stochastic policies are more general than deterministic policies (a deterministic policy is a special case where the probability is 1 for one action and 0 for all others). While they are not strictly necessary for standard MDPs (an optimal deterministic policy always exists), stochastic policies become important in several settings:
A stationary policy does not change over time. The agent uses the same mapping from states to actions at every time step. For infinite-horizon discounted MDPs, the optimal policy is always stationary.
A non-stationary policy can change over time, so the mapping from states to actions depends on the current time step: pi(t)(s) may differ from pi(t')(s) for t not equal to t'. Non-stationary policies are necessary for finite-horizon MDPs, where the optimal action in a given state may depend on how many time steps remain.
Value functions estimate how good it is for the agent to be in a given state (or to take a given action in a given state) under a particular policy. They are central to most MDP solution methods.
The state-value function V^pi(s) gives the expected cumulative discounted reward starting from state s and following policy pi thereafter:
V^pi(s) = E_pi [ sum from k=0 to infinity of gamma^k * R(t+k+1) | S(t) = s ]
Here, gamma is the discount factor and R(t+k+1) is the reward received at time step t+k+1. The expectation is taken over the stochasticity of both the policy and the environment transitions.
The value function satisfies the following properties:
The action-value function (also called the Q-function) Q^pi(s, a) gives the expected cumulative discounted reward starting from state s, taking action a, and then following policy pi:
Q^pi(s, a) = E_pi [ sum from k=0 to infinity of gamma^k * R(t+k+1) | S(t) = s, A(t) = a ]
The Q-function is especially useful because it allows the agent to compare actions without needing a model of the environment's dynamics. If Q*(s, a) is known, the optimal action in state s is simply the one that maximizes Q*(s, a):
pi*(s) = argmax over a of Q*(s, a)
This property makes the Q-function the foundation of algorithms such as Q-learning and its deep learning extension, Deep Q-Networks (DQN).
The state-value function and the action-value function are related by:
V^pi(s) = sum over a of pi(a | s) * Q^pi(s, a)
This equation states that the value of a state under a policy equals the expected Q-value, where the expectation is over the actions selected by the policy. For a deterministic policy, this simplifies to V^pi(s) = Q^pi(s, pi(s)).
Conversely:
Q^pi(s, a) = sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V^pi(s') ]
The Q-value of taking action a in state s equals the expected immediate reward plus the discounted value of the next state.
The Bellman equations are recursive relationships that express the value of a state in terms of the values of its successor states. They form the mathematical backbone of nearly all MDP solution algorithms.
For a given policy pi, the Bellman expectation equation for the state-value function is:
V^pi(s) = sum over a of pi(a | s) * sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V^pi(s') ]
This equation decomposes the value of a state into two parts: the expected immediate reward and the discounted expected value of the next state. It expresses the value of the current state as a weighted average over all possible actions and successor states.
Similarly, the Bellman expectation equation for the action-value function is:
Q^pi(s, a) = sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * sum over a' of pi(a' | s') * Q^pi(s', a') ]
These equations form a system of linear equations (one for each state or state-action pair) that can be solved directly for small state spaces.
The Bellman optimality equation characterizes the optimal value function. For the state-value function:
V*(s) = max over a of sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V*(s') ]
This equation states that the value of a state under the optimal policy equals the maximum (over all actions) of the expected immediate reward plus the discounted value of the successor state.
For the action-value function:
Q*(s, a) = sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * max over a' of Q*(s', a') ]
Unlike the Bellman expectation equations, the Bellman optimality equations are nonlinear (due to the max operator) and cannot generally be solved in closed form. Instead, iterative methods are used.
MDPs can be classified by their time horizon, which determines the length of the decision-making process.
In a finite-horizon MDP, the agent interacts with the environment for a fixed number of time steps T (the horizon). The objective is to maximize the total expected reward over these T steps:
E [ sum from t=0 to T-1 of R(t+1) ]
Key characteristics of finite-horizon MDPs:
In an infinite-horizon MDP, there is no fixed endpoint. The agent continues making decisions indefinitely. To ensure that the total reward is finite and well-defined, a discount factor gamma strictly less than 1 is typically used:
E [ sum from t=0 to infinity of gamma^t * R(t+1) ]
Key characteristics of infinite-horizon MDPs:
Beyond the finite/infinite horizon distinction, MDPs are also categorized by whether the interaction naturally breaks into episodes.
An episodic task is one where the agent-environment interaction naturally breaks into separate subsequences called episodes. Each episode begins in a start state and ends when the agent reaches a terminal state. After an episode ends, the environment resets and a new episode begins independently of how the previous one ended. Playing a game of chess, navigating a maze, or a robot assembling a specific part are all examples of episodic tasks.
In episodic tasks, the return is typically the undiscounted sum of rewards within a single episode:
G(t) = R(t+1) + R(t+2) + ... + R(T)
where T is the time step at which the terminal state is reached. Because each episode has a finite length, the total reward is guaranteed to be finite even without discounting.
A continuing task has no natural endpoint. The agent-environment interaction goes on without limit, and there are no terminal states. Managing a power grid, running an ongoing trading strategy, or controlling the temperature of a chemical reactor are examples of continuing tasks.
For continuing tasks, the discounted return formulation is essential:
G(t) = R(t+1) + gamma * R(t+2) + gamma^2 * R(t+3) + ...
Without discounting (gamma < 1), the sum of rewards would diverge, making it impossible to compare policies. An alternative objective for continuing tasks is the average reward formulation, where the agent maximizes the long-run average reward per time step rather than a discounted sum.
The distinction between episodic and continuing tasks affects which algorithms can be applied. Monte Carlo methods, which estimate value functions by averaging returns over complete episodes, require episodic tasks because they need to observe full trajectories from start to terminal state. Temporal difference (TD) methods and Q-learning, on the other hand, update estimates after every single step and work for both episodic and continuing tasks [4].
The discount factor gamma plays several important roles in MDP formulations.
Mathematical necessity. In infinite-horizon problems, discounting with gamma < 1 ensures that the infinite sum of rewards converges to a finite value. Without discounting, the total reward could be infinite, making it impossible to compare policies.
Time preference. Discounting encodes a preference for receiving rewards sooner rather than later. A reward received now is worth more than the same reward received in the distant future. This mirrors economic concepts like the time value of money and interest rates.
Modeling uncertainty. The discount factor can also be interpreted as the probability of the process continuing at each step. Under this interpretation, gamma represents the survival probability, and (1 - gamma) represents the probability that the process terminates at each time step.
Effect on behavior. The choice of gamma significantly affects the agent's learned behavior:
| Discount Factor | Agent Behavior |
|---|---|
| gamma = 0 | Completely myopic; the agent only considers immediate rewards and ignores all future consequences. |
| gamma close to 0 | Short-sighted; the agent focuses primarily on near-term rewards. |
| gamma = 0.5 | Moderate; rewards two steps ahead are worth one-quarter of immediate rewards. |
| gamma close to 1 | Far-sighted; the agent values long-term and short-term rewards nearly equally. |
| gamma = 1 | Undiscounted; only valid for finite-horizon or episodic tasks where the total reward is guaranteed to be finite. |
There are three classical exact methods for finding optimal policies in MDPs: value iteration, policy iteration, and linear programming. All three methods require full knowledge of the transition probabilities and reward function (i.e., they are model-based methods).
Value iteration, introduced by Bellman (1957), computes the optimal value function through repeated application of the Bellman optimality operator. The algorithm starts with an arbitrary initial value function V(0) and iteratively updates it:
V(k+1)(s) = max over a of sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V(k)(s') ]
This update is applied to all states simultaneously in each iteration. The algorithm converges to the optimal value function V* as k approaches infinity. Convergence is guaranteed by the contraction mapping theorem: the Bellman optimality operator is a gamma-contraction in the max-norm, meaning each iteration brings V(k) closer to V* by at least a factor of gamma. In practice, iteration stops when the maximum change across all states falls below a small threshold epsilon:
max over s of | V(k+1)(s) - V(k)(s) | < epsilon
Once V* has been computed, the optimal policy is extracted by choosing the action that maximizes the right-hand side of the Bellman optimality equation for each state.
Properties of value iteration:
Policy iteration, introduced by Howard (1960), alternates between two steps [2]:
Policy evaluation: Given a fixed policy pi, compute its value function V^pi by solving the system of linear equations defined by the Bellman expectation equation. For |S| states, this involves solving a system of |S| linear equations with |S| unknowns.
Policy improvement: Construct a new policy pi' by acting greedily with respect to V^pi:
pi'(s) = argmax over a of sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V^pi(s') ]
These two steps repeat until the policy no longer changes (pi' = pi), at which point the policy is guaranteed to be optimal.
Properties of policy iteration:
The optimal value function of an MDP can also be found by solving a linear program (LP), as first shown by d'Epenoux (1963) [9]. The primal LP formulation is:
Minimize: sum over s of alpha(s) * V(s)
Subject to: V(s) >= sum over s' of P(s' | s, a) * [ R(s, a, s') + gamma * V(s') ] for all s in S, a in A
Here, alpha(s) > 0 is a weighting over states (often chosen as the initial state distribution). The constraints ensure that V(s) is at least as large as the expected return of every action, and the minimization drives V(s) down to the tightest constraint, which corresponds to the best action.
The dual LP formulation uses occupancy measures (also called state-action frequencies). The dual LP is:
Maximize: sum over s,a of mu(s, a) * R(s, a)
Subject to: sum over a of mu(s', a) = alpha(s') + gamma * sum over s,a of mu(s, a) * P(s' | s, a) for all s' in S, and mu(s, a) >= 0 for all s, a
The variable mu(s, a) represents the (discounted) expected number of times the agent visits state s and takes action a. The optimal policy can be recovered from the occupancy measure. One notable advantage of the LP formulation is that it naturally extends to constrained MDPs, where additional linear constraints on the occupancy measure enforce side constraints on the policy.
Prioritized sweeping is an enhancement to value iteration that preferentially applies updates to states that are most likely to have changed significantly, rather than sweeping through all states uniformly. The algorithm maintains a priority queue and focuses computation where it is most needed, which can dramatically reduce the number of updates required to reach convergence in sparse or structured MDPs.
| Method | Introduced By | Time per Iteration | Number of Iterations | Total Complexity | Best Suited For |
|---|---|---|---|---|---|
| Value Iteration | Bellman (1957) | O(|S|^2 * |A|) | Depends on gamma and epsilon; can be large | Pseudo-polynomial | Small to medium MDPs; simple implementation |
| Policy Iteration | Howard (1960) | O(|S|^3 + |S|^2 * |A|) | Typically very few (often < 20) | Polynomial in practice | Medium MDPs where fewer iterations are preferred |
| Linear Programming | d'Epenoux (1963) | N/A (single solve) | 1 (single LP solve) | Polynomial (LP is polynomial) | Constrained MDPs; theoretical analysis |
| Modified Policy Iteration | Puterman & Shin (1978) | O(|S|^2 * |A|) | Between VI and PI | Between VI and PI | Large MDPs; practical compromise |
Decision problems based on MDPs belong to the complexity class P, meaning algorithms exist that find optimal policies in time polynomial in the size of the problem representation [8]. However, this result can be misleading in practice. The number of states in an MDP often grows exponentially with the number of state variables (the "curse of dimensionality," a term coined by Bellman himself). For example, an MDP with 20 binary state variables has 2^20 (over one million) states, even though the problem description is compact. This exponential blowup limits exact solution techniques to problems with either small state spaces or special structure that can be exploited.
To handle large or continuous state spaces, practitioners turn to approximate methods including function approximation, deep learning-based approaches (such as Deep Q-Networks), and factored MDP representations that decompose the state space into independent or weakly interacting components [5].
A Partially Observable Markov Decision Process (POMDP) generalizes the MDP framework to situations where the agent cannot directly observe the true state of the environment. Instead, the agent receives observations that provide incomplete or noisy information about the underlying state.
A POMDP is defined by a 7-tuple (S, A, P, R, Omega, O, gamma), where the first four components and the discount factor are identical to those in a standard MDP, and two new components are added:
| Component | Symbol | Description |
|---|---|---|
| State space | S | Finite set of states (as in MDP) |
| Action space | A | Finite set of actions (as in MDP) |
| Transition function | P(s' | s, a) | State-transition probabilities (as in MDP) |
| Reward function | R(s, a, s') | Immediate reward (as in MDP) |
| Observation space | Omega | A finite set of observations the agent can receive |
| Observation function | O(o | s', a) | The probability of observing o given that the agent took action a and arrived in state s' |
| Discount factor | gamma | Discount factor (as in MDP) |
Since the agent cannot observe the true state directly, it maintains a belief state: a probability distribution over all possible states. The belief state b(s) represents the agent's subjective probability that the true state is s. After taking action a and receiving observation o, the belief is updated using Bayes' rule:
b'(s') = O(o | s', a) * sum over s of P(s' | s, a) * b(s) / P(o | a, b)
Here, P(o | a, b) is a normalizing constant that ensures the updated belief sums to 1.
The key insight is that a POMDP can be converted to a fully observable MDP over the space of belief states (called a belief MDP). However, the belief space is continuous (even when the underlying state space is finite), which makes POMDPs substantially harder to solve than standard MDPs.
Solving POMDPs exactly is PSPACE-complete, and most practical algorithms rely on approximate methods [7][11]. Several important algorithmic advances have made POMDPs more tractable:
POMDPs have found applications in domains where the agent operates with incomplete information:
Reinforcement learning (RL) is the field of machine learning concerned with learning optimal behavior through interaction with an environment. MDPs provide the formal mathematical foundation for nearly all of reinforcement learning.
In the standard RL setting, the agent interacts with an environment that is assumed to be an MDP, but the agent does not know the transition probabilities P or the reward function R in advance. The agent must learn a good policy through trial and error, using only the states, actions, and rewards it observes during interaction.
RL methods can be broadly divided into two categories based on their relationship to the MDP model:
| Category | Description | Examples |
|---|---|---|
| Model-based RL | The agent learns an explicit model of the MDP (transition and reward functions) and uses it for planning. | Dyna, PILCO, MuZero, World Models |
| Model-free RL | The agent learns a policy or value function directly from experience without constructing an explicit model. | Q-learning, SARSA, Policy Gradient, Actor-Critic |
Key connections between MDPs and RL algorithms:
Several landmark AI systems are built on the MDP framework:
Several extensions of the basic MDP framework address limitations of the standard model.
| Extension | Key Modification | Use Case |
|---|---|---|
| Continuous-state MDPs | State or action spaces are continuous rather than discrete. | Robotics, control systems, physics simulations |
| Semi-Markov Decision Processes (SMDPs) | Time between transitions follows a probability distribution rather than being fixed. | Systems where actions take variable amounts of time |
| Constrained MDPs (CMDPs) | Auxiliary cost functions and constraints are added to the standard objective. | Safety-critical systems, resource-limited settings |
| Multi-agent MDPs (Markov Games) | Multiple agents interact, each with their own actions and rewards. | Competitive and cooperative multi-agent scenarios |
| Factored MDPs | State is decomposed into multiple variables with exploitable structure. | High-dimensional problems with sparse variable interactions |
| Average-reward MDPs | Agent maximizes long-run average reward per step instead of discounted sum. | Indefinite operation with no natural discounting |
| Inverse RL / Inverse MDPs | The reward function is inferred from observed behavior rather than given. | Learning from demonstrations, apprenticeship learning |
| Continuous-time MDPs | Dynamics defined by ordinary differential equations; summation replaced by integration. | Finance (Hamilton-Jacobi-Bellman equation), continuous control |
Continuous-state MDPs. Many real-world problems have continuous state or action spaces. Linear-quadratic regulators (LQR) and Gaussian process models are common approaches for continuous MDPs.
Semi-Markov Decision Processes (SMDPs). In an SMDP, the time between transitions is not fixed but follows a probability distribution. This extension is useful for modeling systems where actions take variable amounts of time.
Constrained MDPs (CMDPs). These add auxiliary cost functions and constraints to the standard MDP objective [12]. The agent must maximize reward subject to constraints on expected costs. Unlike standard MDPs, CMDPs can only be solved via linear programming (not dynamic programming), and the optimal policy may depend on the starting state.
Multi-agent MDPs. When multiple agents interact in the same environment, the framework extends to stochastic games (also called Markov games). Each agent has its own action set and reward function, and the transition depends on the joint action of all agents.
Factored MDPs. When the state is composed of multiple variables, factored MDPs exploit the structure to represent transition and reward functions compactly, enabling more efficient computation. Recent research has shown that factored representations can help overcome the curse of dimensionality by decomposing high-dimensional MDPs into smaller, independently evolving components.
Inverse Reinforcement Learning. Rather than solving an MDP given a reward function, inverse RL infers the reward function from observed expert behavior. This approach is useful when the reward function is difficult to specify manually but demonstrations of good behavior are available.
Markov Decision Processes have been applied across a wide range of domains. The following table summarizes key application areas.
| Domain | Application Examples | Why MDPs Are Suitable |
|---|---|---|
| Reinforcement learning | Training game-playing agents, learning control policies, Q-learning, policy gradient methods | MDPs provide the formal foundation; RL algorithms solve MDPs when the model is unknown. |
| Robotics | Path planning, navigation, manipulation, grasping, locomotion | Robots operate in stochastic environments where actions have uncertain outcomes. |
| Operations research | Inventory management, supply chain optimization, queuing systems, maintenance scheduling | MDPs model sequential resource allocation decisions under demand uncertainty. |
| Healthcare | Treatment planning, clinical trial design, chronic disease management, imaging screening optimization | Patient health evolves stochastically, and treatment decisions have long-term consequences. |
| Finance | Portfolio optimization, option pricing, algorithmic trading, risk management | Financial markets involve sequential decisions under uncertainty with discounted future returns. |
| Telecommunications | Network routing, channel allocation, congestion control, resource scheduling | Network states change probabilistically and routing decisions affect future network performance. |
| Natural language processing | Dialogue systems, text generation, machine translation | Conversations are sequential decision problems where the next response depends on the current context. |
| Autonomous vehicles | Lane changing, intersection management, adaptive cruise control | Driving involves sequential decisions under uncertainty about other drivers' behavior. |
| Agriculture | Crop management, irrigation scheduling, pest control | Agricultural decisions depend on stochastic weather and crop growth dynamics. |
| Energy | Power grid management, demand response, battery storage optimization | Energy systems require sequential decisions under uncertain demand and renewable generation. |
In robotics, MDPs and their extensions are used for motion planning, task planning, and control. A robot navigating an environment can model its position as the state, movement commands as actions, and sensor noise as transition uncertainty. POMDPs are particularly relevant for robotic applications because sensors are typically noisy and the robot's state is not perfectly known. Research on robotic grasping, for instance, uses MDP formulations to plan sequences of grasp attempts that maximize the probability of successfully picking up an object. Socially assistive robots have also adopted MDP-based frameworks for modeling emotional behavior and interaction management, reducing computational complexity while improving response quality.
Operations research was one of the earliest fields to adopt MDPs. Classic problems include inventory management (deciding how much stock to order given uncertain demand), machine maintenance (deciding when to repair or replace equipment), and queuing (managing service systems with random arrivals). The MDP framework naturally captures the sequential, stochastic nature of these problems and the trade-off between immediate costs and long-term performance.
MDPs have been applied to clinical decision-making, where a physician must choose treatments for a patient whose health state evolves over time. For example, MDP models have been used to optimize screening schedules for cancer detection, determine optimal treatment sequences for HIV, and manage chronic conditions such as diabetes. The patient's health state serves as the MDP state, available treatments are the actions, health outcomes define the transitions, and patient quality of life (or survival probability) serves as the reward. During the COVID-19 pandemic, reinforcement learning methods built on MDP frameworks played a role in optimizing treatment decisions under heightened uncertainty.
In quantitative finance, MDPs model portfolio management, where the state represents the current portfolio composition and market conditions, actions correspond to trades, and the reward reflects portfolio returns. Dynamic asset allocation, option exercise decisions, and credit risk management have all been formulated as MDPs. The discount factor in financial MDPs often corresponds directly to the time value of money.
Game playing has served as one of the most visible testbeds for MDP-based methods. Board games like Go, chess, and shogi can be modeled as large MDPs (or two-player Markov games), and deep reinforcement learning systems like AlphaGo and AlphaZero have achieved superhuman performance in these domains. Video games, including the Atari 2600 suite, have similarly been modeled as MDPs and solved using Deep Q-Networks and policy gradient methods.
MDPs rely on several assumptions that may not always hold in practice.
Markov property. The future depends only on the current state and action, not on history. In many real-world problems, the state representation may not capture all relevant information, violating this assumption.
Full observability. The standard MDP assumes the agent can observe the true state of the environment at all times. When this does not hold, POMDPs or other extensions are needed.
Stationarity. The transition probabilities and reward function do not change over time. In non-stationary environments, the MDP model must be periodically updated or replaced with a non-stationary formulation.
Known model. Classical solution methods (value iteration, policy iteration, LP) require complete knowledge of P and R. When the model is unknown, reinforcement learning methods are used instead.
Curse of dimensionality. The computational cost of solving an MDP grows polynomially with the number of states and actions, but the number of states often grows exponentially with the number of state variables. This makes exact solutions intractable for high-dimensional problems and motivates the use of function approximation techniques, neural network-based approaches, and factored representations.
Reward specification. Designing the reward function correctly is both critical and challenging. A poorly designed reward function can lead to unintended behavior, a problem sometimes called reward hacking. Inverse reinforcement learning offers one approach to this challenge by learning the reward from demonstrations.
Imagine you are playing a board game where you roll a die and move your piece around the board. At every square you land on, you have to make a choice: go left or go right. Some choices lead to squares where you get candy (that is your reward), and some lead to squares where you lose candy. The tricky part is that when you roll the die, you cannot be sure exactly which square you will land on next.
A Markov Decision Process is like a set of rules that helps you figure out the very best choice to make on every single square, so that over the whole game you collect as much candy as possible. The really important thing about these rules is that they only care about which square you are on right now, not about all the squares you visited before. It does not matter if you got to your current square by going left three times or right twice; the best next move is the same either way.
Scientists and engineers use MDPs to help robots decide where to walk, to help doctors pick the best treatment for patients, and to teach computers how to play video games really well. Any time someone needs to make a series of choices where the results are a little bit unpredictable, an MDP can help find the smartest strategy.