See also: Reinforcement learning, Q-learning, Markov decision process
In reinforcement learning (RL), a policy is the strategy or rule that an agent uses to decide which actions to take in a given state of the environment. Formally, a policy is a mapping from states to actions (or probability distributions over actions) that defines the agent's behavior at every point in time. The policy is the central object of interest in RL: the entire goal of training an agent is to find a policy that maximizes the expected cumulative reward over time.
Policies can take many forms, from simple lookup tables in small discrete environments to complex neural networks that process high-dimensional sensory inputs. The concept of a policy is grounded in the mathematical framework of the Markov decision process (MDP), which models sequential decision-making problems as a tuple of states, actions, transition probabilities, rewards, and a discount factor. Within this framework, the policy is the component that the agent controls directly, and optimizing it is equivalent to solving the MDP.
The study of policies spans several major research areas, including policy evaluation, policy improvement, policy gradient methods, and policy search. In recent years, the concept of a policy has taken on additional significance in the context of large language model (LLM) alignment, where reinforcement learning from human feedback (RLHF) treats the language model itself as a policy that generates text token by token.
A policy operates within the framework of an MDP, defined by the tuple (S, A, P, R, gamma), where S is the set of states, A is the set of actions, P is the state-transition function, R is the reward function, and gamma is the discount factor.
A policy pi defines how the agent selects actions. There are two principal forms:
| Type | Notation | Description |
|---|---|---|
| Deterministic policy | a = mu(s) | Maps each state s to a single, fixed action a. Given the same state, the agent always takes the same action. |
| Stochastic policy | a ~ pi(a | s) | Maps each state s to a probability distribution over actions. The agent samples an action from this distribution. |
In the deterministic case, the policy is denoted by mu (Greek letter mu), while stochastic policies are typically denoted by pi (Greek letter pi). When the policy is parameterized (for example, by a neural network with weights theta), it is written as pi_theta(a | s), indicating that the probability of selecting action a in state s depends on the learned parameters theta.
The objective of the agent is to find a policy that maximizes the expected discounted return:
J(pi) = E_pi [ sum_{t=0}^{infinity} gamma^t * R(s_t, a_t) ]
where gamma (0 <= gamma <= 1) is the discount factor that controls the trade-off between immediate and future rewards.
The distinction between deterministic and stochastic policies is fundamental in reinforcement learning theory and has practical implications for algorithm design.
A deterministic policy produces a single action for each state: a_t = mu(s_t). This means that given the same state, the agent will always select the same action. Deterministic policies are conceptually simple and computationally efficient, since no randomness is involved in action selection.
Deterministic policies work well in fully observable environments where the agent has complete information about the state. They are commonly found as the output of value-based methods like Q-learning, where the greedy policy selects the action with the highest Q-value: mu(s) = argmax_a Q(s, a).
However, deterministic policies have notable limitations. They do not naturally explore the environment, since the same state always produces the same action. This lack of exploration can prevent the agent from discovering better strategies. In partially observable environments or adversarial settings (such as games like rock-paper-scissors), a deterministic policy can be exploited by an opponent who predicts the agent's actions.
A stochastic policy assigns a probability to each action given a state: pi(a | s) = P(A_t = a | S_t = s). The agent samples from this distribution at each time step, which introduces controlled randomness into decision-making.
Stochastic policies offer several advantages:
| Advantage | Explanation |
|---|---|
| Built-in exploration | The agent naturally tries different actions, which helps it discover new and potentially better strategies without requiring a separate exploration mechanism. |
| Robustness in partial observability | When the agent cannot fully observe the environment state, randomizing over actions can be optimal because the same observation may correspond to different underlying states requiring different responses. |
| Smooth optimization landscape | Small changes to policy parameters produce small changes in action probabilities, making gradient-based optimization more stable. In contrast, deterministic policies can exhibit discontinuous behavior changes from small parameter shifts. |
| Game-theoretic optimality | In competitive multi-agent settings, stochastic (mixed) strategies can be optimal where deterministic strategies are not. |
Two common parameterizations of stochastic policies are:
As a general guideline, deterministic policies are preferred when the environment is fully observable and non-adversarial, and when the agent only needs to exploit a known-good strategy. Stochastic policies are preferred during training (to facilitate exploration), in partially observable environments, in multi-agent or adversarial settings, and whenever policy gradient methods are used for optimization.
It is worth noting that for any MDP, there always exists at least one deterministic optimal policy. However, the process of finding that optimal policy often requires the use of stochastic policies during training.
An optimal policy pi* is a policy that achieves the maximum expected cumulative reward from every state. Formally, a policy pi* is optimal if and only if, for all states s and all other policies pi:
V^{pi*}(s) >= V^{pi}(s)
where V^{pi}(s) is the state-value function under policy pi, representing the expected return starting from state s and following policy pi thereafter.
The existence of an optimal policy is guaranteed by the Bellman equation and the theory of MDPs. The Bellman optimality equation provides a recursive characterization of the optimal value function:
V*(s) = max_a [ R(s, a) + gamma * sum_{s'} P(s' | s, a) * V*(s') ]
Similarly, the optimal action-value function Q* satisfies:
Q*(s, a) = R(s, a) + gamma * sum_{s'} P(s' | s, a) * max_{a'} Q*(s', a')
Once Q* is known, the optimal deterministic policy is obtained by:
pi*(s) = argmax_a Q*(s, a)
Classical dynamic programming algorithms such as value iteration and policy iteration compute the optimal policy by iteratively applying the Bellman equation. These methods require a complete model of the environment (transition probabilities and rewards), making them suitable for planning problems but not for model-free settings where Q-learning and policy gradient methods are used instead.
How a policy is represented computationally determines the types of environments it can handle and the algorithms that can be used to optimize it.
| Representation | Description | Suitable For |
|---|---|---|
| Lookup table | A table storing the action (or action probabilities) for each state. Each state-action pair is an independent entry. | Small, discrete state and action spaces (e.g., gridworlds, simple board games). |
| Linear function approximator | The policy is a linear function of state features: pi(a | s) = softmax(theta^T * phi(s)). | Moderate-sized problems with well-designed features. |
| Neural network | A deep neural network maps states (or raw observations like images) to actions or action distributions. | Large, continuous, or high-dimensional state and action spaces (e.g., Atari games, robotic control, autonomous driving). |
| Decision tree or rule set | A tree or set of if-then rules that maps states to actions. | Interpretable policies in domains requiring transparency. |
In modern deep reinforcement learning, neural networks are the dominant policy representation. The network takes the current state (or observation) as input and outputs either a single action (deterministic) or parameters of a probability distribution over actions (stochastic). Convolutional neural networks are used when the input is an image, while recurrent neural networks or transformers handle sequential or partially observable inputs.
Two fundamental operations in reinforcement learning are policy evaluation and policy improvement, which together form the basis of the policy iteration algorithm.
Policy evaluation is the process of computing the value function V^{pi}(s) for a given policy pi. This tells the agent how good it is to be in each state when following the current policy. Policy evaluation can be performed exactly (using the Bellman expectation equation) or approximately (using sampling-based methods like Monte Carlo estimation or temporal-difference learning).
The Bellman expectation equation for policy evaluation is:
V^{pi}(s) = sum_a pi(a | s) * [ R(s, a) + gamma * sum_{s'} P(s' | s, a) * V^{pi}(s') ]
Iterative policy evaluation repeatedly applies this equation until the value function converges.
Given the value function from policy evaluation, policy improvement constructs a new, better policy by acting greedily with respect to the current value function:
pi'(s) = argmax_a [ R(s, a) + gamma * sum_{s'} P(s' | s, a) * V^{pi}(s') ]
The policy improvement theorem guarantees that the new policy pi' is at least as good as the original policy pi, and strictly better unless pi was already optimal.
Policy iteration alternates between evaluation and improvement:
Policy iteration is guaranteed to converge to the optimal policy in a finite number of steps for finite MDPs. A variant called generalized policy iteration (GPI) interleaves partial evaluation and improvement steps, and most practical RL algorithms can be viewed as instances of GPI.
RL algorithms are commonly categorized by how they approach the problem of finding a good policy.
| Aspect | Value-Based Methods | Policy-Based Methods |
|---|---|---|
| What is learned | A value function (V or Q), from which a policy is derived implicitly | A parameterized policy pi_theta directly |
| Examples | Q-learning, DQN, SARSA | REINFORCE, PPO, A2C |
| Action spaces | Best suited for discrete action spaces | Handle both discrete and continuous action spaces naturally |
| Policy type | Typically deterministic (greedy w.r.t. value function) | Can represent stochastic policies |
| Optimization | Updates value estimates using Bellman equation backups | Updates policy parameters using gradient ascent on expected reward |
| Stability | Can suffer from instability with function approximation (the "deadly triad") | Policy changes are smooth; small parameter changes produce small behavior changes |
| Convergence | Strong convergence guarantees in tabular settings | Local optima possible, but strong practical performance |
Value-based methods like Q-learning learn an action-value function Q(s, a) and derive a policy by selecting the action with the highest Q-value. These methods are highly effective in discrete action spaces but face challenges in continuous spaces, where finding the argmax over a continuous action domain is computationally expensive.
Policy-based methods parameterize the policy directly and optimize it using gradient-based techniques. They handle continuous action spaces naturally, since the policy can output parameters of a continuous distribution (e.g., the mean and variance of a Gaussian). They also support stochastic policies, which can be essential for exploration and for problems where randomized behavior is optimal.
Policy gradient methods are a family of algorithms that optimize a parameterized policy pi_theta by computing the gradient of the expected return with respect to the policy parameters theta, and then performing gradient ascent.
The theoretical foundation of policy gradient methods is the policy gradient theorem, introduced by Richard Sutton, David McAllester, Satinder Singh, and Yishay Mansour in their landmark 2000 paper "Policy Gradient Methods for Reinforcement Learning with Function Approximation," presented at the Conference on Neural Information Processing Systems (NeurIPS).
The key challenge in computing the gradient of J(theta) is that the expected return depends on the state distribution induced by the policy, which itself depends on theta. Computing the gradient of this state distribution directly is intractable. The policy gradient theorem elegantly sidesteps this issue by showing that the gradient can be expressed as:
∇_theta J(theta) = E_pi [ Q^{pi}(s, a) * ∇_theta ln pi_theta(a | s) ]
This formulation is remarkable because it does not require computing the gradient of the state distribution. The term ∇_theta ln pi_theta(a | s) is the score function (the gradient of the log-probability of taking action a in state s), and Q^{pi}(s, a) weights this gradient by how good the action was. This expectation can be estimated from sample trajectories, making the theorem the basis for practical algorithms.
The REINFORCE algorithm, introduced by Ronald J. Williams in 1992, was the first practical policy gradient method. It uses Monte Carlo sampling to estimate the policy gradient from complete episodes.
The REINFORCE update rule is:
theta <- theta + alpha * gamma^t * G_t * ∇_theta ln pi_theta(a_t | s_t)
where G_t is the return (cumulative discounted reward) from time step t onward, and alpha is the learning rate.
REINFORCE is simple and unbiased, but it suffers from high variance in its gradient estimates because the return G_t can fluctuate significantly across episodes. A common remedy is to subtract a baseline b(s) from the return:
theta <- theta + alpha * (G_t - b(s_t)) * ∇_theta ln pi_theta(a_t | s_t)
A natural choice for the baseline is the state-value function V(s), which does not introduce bias but reduces variance. When the baseline is V(s), the quantity G_t - V(s_t) approximates the advantage function A(s, a) = Q(s, a) - V(s), which measures how much better an action is compared to the average.
Actor-critic methods address the high variance of REINFORCE by introducing a learned value function (the critic) that evaluates the policy (the actor). Instead of waiting for a complete episode to compute returns, the critic provides bootstrapped value estimates that allow updates after every step.
| Component | Role | What It Learns |
|---|---|---|
| Actor | The policy pi_theta | Which actions to take in each state |
| Critic | The value function V_w(s) or Q_w(s, a) | How good states (or state-action pairs) are under the current policy |
The actor is updated using the policy gradient, with the critic's estimate replacing the Monte Carlo return:
theta <- theta + alpha_theta * A_w(s_t, a_t) * ∇_theta ln pi_theta(a_t | s_t)
where A_w is the advantage estimate from the critic. The critic is updated using temporal-difference (TD) learning:
w <- w + alpha_w * delta_t * ∇_w V_w(s_t)
where delta_t = r_t + gamma * V_w(s_{t+1}) - V_w(s_t) is the TD error.
Actor-critic methods combine the low bias of policy gradient methods with the low variance of value function estimation, producing faster and more stable learning than pure policy gradient approaches.
A2C (Advantage Actor-Critic) is a synchronous variant of the A3C (Asynchronous Advantage Actor-Critic) algorithm introduced by Mnih et al. in 2016. A2C runs multiple parallel actors that each collect experience in their own copy of the environment, then performs a single synchronized update using data from all actors.
The key features of A2C include:
PPO, introduced by John Schulman and colleagues at OpenAI in 2017, is one of the most widely used policy gradient algorithms in practice. It addresses a fundamental challenge: how to take the largest possible policy improvement step without causing a catastrophic collapse in performance.
PPO achieves this through a clipped surrogate objective. Let r(theta) = pi_theta(a | s) / pi_{theta_old}(a | s) be the probability ratio between the new and old policies. The PPO objective is:
L^{CLIP}(theta) = E [ min( r(theta) * A, clip(r(theta), 1 - epsilon, 1 + epsilon) * A ) ]
where epsilon is a hyperparameter (typically 0.1 or 0.2) and A is the advantage estimate. The clipping operation prevents the ratio r(theta) from moving too far from 1, which limits the size of each policy update.
| PPO Feature | Detail |
|---|---|
| Clipping range | epsilon = 0.1 to 0.2 (controls maximum policy change per update) |
| Data reuse | Multiple epochs of minibatch updates on collected trajectories |
| Advantage estimation | Generalized Advantage Estimation (GAE) with lambda parameter |
| Additional objectives | Value function loss and entropy bonus added to the total loss |
| Computational cost | Lower than TRPO (no conjugate gradient or line search needed) |
PPO has become the default algorithm in many RL applications due to its simplicity, stability, and strong empirical performance. It is used in robotics (OpenAI's Dactyl hand), game AI (OpenAI Five for Dota 2), and LLM alignment (the RL fine-tuning step in RLHF).
Before PPO, TRPO (Schulman et al., 2015) addressed the same problem of safe policy updates by imposing a hard constraint on the KL divergence between the old and new policies:
maximize E [ r(theta) * A ] subject to D_KL(pi_{theta_old} || pi_theta) <= delta
TRPO uses conjugate gradient methods and line search to solve this constrained optimization problem, which makes it more computationally expensive than PPO. While TRPO provides stronger theoretical guarantees on monotonic improvement, PPO's simpler clipping mechanism achieves comparable or better results in practice.
Policies play a central role in the distinction between on-policy and off-policy learning, which is one of the most important taxonomic divisions in reinforcement learning.
| Aspect | On-Policy | Off-Policy |
|---|---|---|
| Definition | The agent learns about the same policy it uses to generate behavior | The agent learns about a target policy different from the behavior policy |
| Data source | Trajectories generated by the current policy pi | Data can come from any policy (historical data, human demonstrations, different agents) |
| Examples | SARSA, REINFORCE, A2C, PPO | Q-learning, DQN, SAC, importance-weighted policy gradient |
| Sample efficiency | Lower (data becomes stale after each policy update) | Higher (can reuse data from old policies via experience replay) |
| Stability | Generally more stable during training | Requires correction techniques (importance sampling, target networks) to remain stable |
| Exploration | The policy being learned must also explore | Exploration and learning can be separated |
On-policy methods learn the value of the policy they are currently following and can only use data generated by that policy. After each policy update, all previously collected data is typically discarded because it was generated under a different policy. This makes on-policy methods less sample-efficient but simpler and more stable.
Off-policy methods learn about one policy (the target policy) while following a different policy (the behavior policy). This allows them to reuse past experience stored in a replay buffer, learn from demonstrations, and separate exploration from exploitation. However, off-policy learning requires techniques like importance sampling to correct for the mismatch between the behavior and target policies, which can introduce variance and instability.
PPO is sometimes described as "on-policy" because it collects fresh data using the current policy, but it reuses each batch of data for multiple gradient updates within a trust region, blurring the boundary between on-policy and off-policy methods.
Policy search refers to a broad class of methods that optimize the policy directly in policy parameter space, without necessarily computing value functions. While policy gradient methods are one form of policy search, the term also encompasses gradient-free optimization approaches.
Gradient-based policy search methods include REINFORCE, actor-critic algorithms, PPO, and TRPO. These methods compute or estimate the gradient of the expected return and use it to update the policy parameters.
Gradient-free policy search methods treat policy optimization as a black-box optimization problem. They include evolutionary strategies (ES), cross-entropy methods (CEM), and Bayesian optimization. These approaches evaluate entire policies by running them in the environment and use the resulting scores to guide the search. Gradient-free methods can be advantageous when the gradient is difficult to compute, the reward function is non-differentiable, or the policy parameterization is non-smooth. OpenAI demonstrated in 2017 that evolutionary strategies can be competitive with gradient-based methods on continuous control benchmarks while offering natural parallelization.
Guided policy search (GPS), developed by Sergey Levine and colleagues, combines trajectory optimization with supervised learning. GPS first solves the control problem using local trajectory optimization (which is easier) and then trains a neural network policy to match the optimized trajectories. This approach avoids the poor local optima that can trap direct policy gradient methods.
The concept of a policy has gained significant importance outside traditional RL through its role in reinforcement learning from human feedback (RLHF), the technique used to align large language models like ChatGPT, Claude, and Gemini with human preferences.
In the RLHF framework, the language model itself is treated as a policy:
| RL Concept | RLHF Interpretation |
|---|---|
| State | The prompt plus all tokens generated so far |
| Action | The next token to generate (from a vocabulary of ~50,000 tokens) |
| Policy | The language model pi_theta, which outputs a probability distribution over the next token |
| Reward | A score from a trained reward model that rates how well the generated text matches human preferences |
| Episode | The generation of a complete response to a prompt |
The RLHF process involves three stages:
The RL fine-tuning objective includes a KL divergence penalty to prevent the policy from drifting too far from the reference model:
R_total = R_reward_model(response) - lambda * D_KL(pi_theta || pi_reference)
This KL penalty is critical because without it, the policy could learn to produce outputs that "hack" the reward model (achieving high reward scores through degenerate text that does not actually reflect human preferences). By constraining the policy to remain close to the reference model, RLHF preserves the language model's fluency and general capabilities while steering its behavior toward alignment with human values.
More recent approaches such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO, introduced by DeepSeek) simplify the RLHF pipeline by eliminating the separate reward model training step, instead optimizing the policy directly on preference data. GRPO, in particular, omits the value function entirely and instead samples multiple outputs per prompt, computing relative advantages within each group.
Imagine you are playing a maze game. A policy is like a set of instructions that tells you what to do at every spot in the maze. At this corner, turn left. At that dead end, go back. A really simple policy might be written on a piece of paper with one instruction for every spot. A more complicated policy might be a helper who looks at where you are and figures out the best move right then and there.
Some policies always tell you the same thing at the same spot (these are called deterministic). Other policies might say "turn left 70% of the time and turn right 30% of the time" (these are called stochastic), which can be helpful when you are still figuring out the maze and want to try different paths.
The goal of reinforcement learning is to find the best possible policy, the set of instructions that gets you through the maze the fastest and collects the most treats along the way. The agent tries different things, sees what works, and slowly makes its instructions better and better until it has a really good plan.