Policy gradient methods
Last reviewed
May 1, 2026
Sources
21 citations
Review status
Source-backed
Revision
v1 · 5,041 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
21 citations
Review status
Source-backed
Revision
v1 · 5,041 words
Add missing citations, update stale details, or suggest a clearer explanation.
Policy gradient methods are a family of reinforcement learning algorithms that directly parameterise the agent's policy and optimise it by stochastic gradient ascent on the expected return. Instead of first learning a value function and then deriving a policy from it (the classic value-based approach used by Q-learning and SARSA), a policy gradient method maintains a parameterised policy π_θ(a|s), collects trajectories by acting in the environment, and adjusts θ in the direction that increases the expected cumulative reward J(θ) = E_{τ ~ π_θ}[R(τ)].
The modern lineage starts with Ronald J. Williams's REINFORCE paper in 1992 and was placed on a rigorous footing by the policy gradient theorem of Sutton, McAllester, Singh, and Mansour in 2000. Over the next two decades the family expanded to include actor-critic methods, trust region methods (TRPO), proximal optimisation (PPO), deterministic policy gradients (DPG, DDPG, TD3), and maximum-entropy methods (SAC). Today policy gradient algorithms drive a striking share of applied reinforcement learning, from OpenAI Five and AlphaStar through bipedal robot locomotion to the RLHF fine-tuning step that produced ChatGPT and Claude.
A Markov decision process (MDP) is defined by states s, actions a, transition dynamics p(s'|s,a), reward function r(s,a), and discount factor γ. The agent's goal is to choose actions that maximise the expected discounted return G_t = Σ_{k=0}^{∞} γ^k r_{t+k+1}. Value-based methods such as Q-learning estimate Q(s,a), the expected return of taking action a in state s and following the optimal policy thereafter, and pick the greedy action argmax_a Q(s,a). This works well in small discrete action spaces but breaks down in two regimes that matter in practice: continuous action spaces, where the argmax is itself a non-trivial optimisation problem at every step, and stochastic policies, where the optimal behaviour is genuinely randomised (as in many partially observable or adversarial settings).
Policy gradient methods sidestep these problems by working directly with a parameterised policy. The policy can be a Gaussian over actions whose mean and standard deviation are network outputs, a softmax over discrete actions, a categorical mixture, or a deterministic function of the state. There are several reasons to want this:
The central technical result is the policy gradient theorem of Sutton, McAllester, Singh, and Mansour (NeurIPS 1999, published 2000). For a parameterised policy π_θ(a|s) and a long-run performance measure J(θ) (either an episodic start-state value or an average reward), the gradient of performance with respect to θ has a clean form:
∇_θ J(θ) = E_{s ~ d^π, a ~ π_θ}[∇_θ log π_θ(a|s) · Q^π(s, a)]
The expectation is taken over the discounted state-visitation distribution d^π and the policy itself, and Q^π is the action-value function under the current policy. The crucial property is that the gradient does not contain the term ∇_θ d^π, which would be hard to estimate because changing θ also changes which states are visited. The visitation effect cancels.
In practice the unknown Q^π is replaced by an estimator, and the various choices give a family of algorithms:
| Estimator for Q^π(s,a) | Resulting algorithm | Bias | Variance |
|---|---|---|---|
| Monte Carlo return G_t = Σ γ^k r_{t+k} | REINFORCE | Unbiased | Very high |
| One-step bootstrapped TD: r + γ V(s') | One-step actor-critic | Biased (V is approximate) | Low |
| n-step return | n-step actor-critic | Tunable | Tunable |
| Generalised Advantage Estimation GAE(γ, λ) | A2C, A3C, TRPO, PPO with GAE | Tunable via λ | Tunable via λ |
| Advantage A(s,a) = Q(s,a) − V(s) | Actor-critic with baseline | Same as Q estimator | Lower than Q alone |
Replacing the return G_t with G_t − b(s) for any state-dependent baseline b(s) leaves the gradient unbiased while reducing variance, because E[∇_θ log π_θ(a|s) · b(s)] = 0 under π_θ. Choosing b(s) = V^π(s) gives the advantage form Q − V, which is the basis of virtually every modern actor-critic.
Williams's 1992 paper Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning introduced REINFORCE, the prototypical policy gradient method. The update is starkly simple. After running an episode and observing the return G_t from each timestep, every state-action pair is updated by
θ ← θ + α · ∇_θ log π_θ(a_t | s_t) · (G_t − b(s_t))
No critic, no bootstrapping, no replay. The gradient is unbiased. The catch is variance: G_t is a sum of many noisy random rewards, so its standard deviation grows roughly as √H in horizon H, and the gradient estimate is correspondingly noisy. Williams already noted that subtracting a baseline b(s_t) leaves the expected update unchanged but can dramatically reduce variance, and that an obvious good baseline is a learned estimate of V^π(s_t). REINFORCE is rarely competitive on its own in modern deep RL benchmarks but it is still a useful pedagogical starting point and a building block: PPO, TRPO, and A3C all reduce to a variance-controlled version of REINFORCE in their inner loop.
Replacing the Monte Carlo return with a learned value estimate gives an actor-critic method. The actor is the policy π_θ, the critic is a value function V_φ(s) or Q_φ(s,a), and the two are trained concurrently. Konda and Tsitsiklis (NeurIPS 2000) gave the first formal analysis of two-time-scale actor-critic algorithms with linear critics. The key advantages are lower-variance gradients, online operation without waiting for episodes to terminate, and the ability to bootstrap off the value estimate (so credit can be assigned in environments without natural episode boundaries).
Actor-critic methods come in on-policy and off-policy variants. On-policy critics are trained on data generated by the current policy and discarded after each update; this is what A2C, A3C, TRPO, and PPO do. Off-policy critics learn from a replay buffer of historical experience, which is more sample efficient but requires importance correction or special structure to remain stable; DDPG, TD3, and SAC are the canonical examples.
Mnih et al. (ICML 2016) introduced Asynchronous Advantage Actor-Critic (A3C), which runs many actor-learner threads in parallel on a single multi-core CPU. Each worker maintains its own environment and policy copy, computes gradients on short rollouts, and asynchronously pushes them to a shared parameter server. The diversity of experiences across workers acts as an implicit replay buffer and stabilises training without storing past transitions explicitly. A3C surpassed the Atari state of the art at the time while training in half the wall-clock time on a single CPU as opposed to GPU-trained DQN. A2C is the simpler synchronous variant, where workers wait for each other and a single batched update is applied; in practice A2C often matches A3C and is easier to tune.
A chronic problem with naive policy gradient steps is that even a small parameter step can cause a large policy change in regions where the policy is steep, which destabilises training. Schulman, Levine, Moritz, Jordan, and Abbeel (ICML 2015) addressed this in Trust Region Policy Optimisation (TRPO) by constraining each update to keep the new policy close to the old one in KL divergence:
maximise E[ π_θ(a|s) / π_θ_old(a|s) · A^{π_old}(s,a) ]
subject to E[ KL( π_θ_old(·|s) || π_θ(·|s) ) ] ≤ δ
TRPO solves this constrained problem by linearising the surrogate objective and quadratically approximating the KL constraint, which produces a natural policy gradient direction. The update direction is found via the conjugate gradient method (avoiding explicit storage of the Fisher information matrix), and a backtracking line search adjusts the step size until the KL constraint is satisfied and the surrogate objective improved. The trust region machinery makes TRPO impressively stable across very different problem domains, but the conjugate gradient solve and line search make each iteration computationally heavy and the implementation fiddly.
Schulman, Wolski, Dhariwal, Radford, and Klimov (arXiv 2017) proposed Proximal Policy Optimisation (PPO) as a much simpler alternative to TRPO. Instead of an explicit KL constraint, PPO penalises updates that move the probability ratio ρ_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) too far from 1. The clipped surrogate objective is
L^CLIP(θ) = E_t[ min( ρ_t(θ) · A_t , clip(ρ_t(θ), 1 − ε, 1 + ε) · A_t ) ]
with ε typically set to 0.2. The clip operation flattens the loss as soon as the ratio leaves [1 − ε, 1 + ε] in the wrong direction, removing the incentive for the optimiser to push the policy further. The min with the unclipped term ensures that PPO still allows the policy to improve when the clipped term would be optimistic.
Three things made PPO the dominant policy gradient algorithm in practice: it works with first-order optimisers like Adam (no conjugate gradient), it permits multiple epochs of minibatch SGD per batch of collected data (so sample efficiency is much better than vanilla A2C), and it is robust to a wide range of hyperparameters. A 2022 study by Huang et al. in the ICLR Blog Track catalogued 37 implementation details that affect PPO performance; despite this complexity, PPO remains the algorithm reached for first in deep RL projects. OpenAI Five (Dota 2), AlphaStar (StarCraft II), bipedal robots like Cassie, and the RLHF stage of InstructGPT and ChatGPT all used PPO or close variants.
A parallel line of work attacks the same continuous-control problem with off-policy actor-critic algorithms that learn from a replay buffer, which makes them much more sample efficient than on-policy PPO at the cost of more delicate tuning.
| Algorithm | Year | Policy type | Key idea | Notes |
|---|---|---|---|---|
| DPG | 2014 | Deterministic | Deterministic policy gradient theorem (Silver et al.) | Off-policy actor-critic with linear function approximation |
| DDPG | 2016 | Deterministic | DPG + DQN tricks (replay, target networks) | First effective deep RL for continuous control on pixels (Lillicrap et al.) |
| TD3 | 2018 | Deterministic | Twin critics, delayed actor updates, target policy smoothing | Fixes DDPG's overestimation bias (Fujimoto, Hoof, Meger) |
| SAC | 2018 | Stochastic | Maximum-entropy RL, automatic temperature tuning | State of the art for continuous control; very stable (Haarnoja et al.) |
DDPG (Lillicrap et al., ICLR 2016) adapts the deterministic policy gradient of Silver et al. (ICML 2014) to deep networks by reusing the DQN tricks that stabilise off-policy Q-learning: a replay buffer, separate target networks for the actor and the critic, and Polyak-averaged target updates. The critic is trained by minimising the Bellman error on Q(s,a), and the deterministic actor is updated via the chain rule, ∇_θ J = E[∇_a Q_φ(s, a) · ∇θ μ_θ(s)|{a=μ_θ(s)}]. DDPG works on more than twenty simulated physics tasks including dexterous manipulation, legged locomotion, and end-to-end pixel control.
TD3 (Fujimoto, Hoof, Meger, ICML 2018) addresses three failure modes of DDPG. First, deep Q-functions overestimate values because the max operator in the Bellman target picks up positive noise; TD3 trains two critics and uses the smaller of the two as the target. Second, errors in the critic propagate to the actor and feed back into the data; TD3 updates the actor (and the target networks) only every two critic updates. Third, deterministic policies overfit to narrow peaks of the Q-function; TD3 adds clipped noise to the target action, smoothing the Q-learning target across nearby actions. Together these changes substantially outperform DDPG on the OpenAI Gym continuous-control suite.
SAC (Haarnoja, Zhou, Abbeel, Levine, ICML 2018) takes a different route. It frames RL as maximum-entropy RL, maximising a modified objective J_MaxEnt(π) = E[Σ r(s_t, a_t) + α H(π(·|s_t))]. The added entropy term encourages the policy to be as random as possible while still solving the task, which strongly improves exploration and produces robust policies. SAC trains a stochastic Gaussian actor and twin Q-critics off-policy from a replay buffer, with a temperature parameter α that can be tuned automatically by gradient on a target entropy. Because of its sample efficiency, stability, and minimal hyperparameter tuning, SAC has become a default choice for real-world robotic learning.
Policy gradient methods scale especially well across many parallel environment workers, because their gradient is naturally an expectation that splits cleanly across actors.
The basic policy gradient estimator has very high variance, and almost every practical advance in the field can be read as a variance-reduction trick. The following table lists the main techniques and where they appear.
| Technique | What it does | Cost | Used in |
|---|---|---|---|
| State-value baseline V(s) | Subtracts a state-dependent baseline from the return | Need a learned V | REINFORCE with baseline, all actor-critics |
| Advantage estimation A(s,a) = Q − V | Centres the gradient on relative quality | Same | All modern actor-critics |
| Generalised Advantage Estimation (GAE) | Exponentially weighted multi-step advantage with parameter λ | One extra hyperparameter | TRPO, PPO, IMPALA |
| Bootstrapping with V(s') | Uses a learned value to truncate the Monte Carlo return | Bias if V is wrong | A2C, A3C, n-step actor-critic |
| Trust region constraint | Bounds policy change per step in KL divergence | Conjugate gradient solve | TRPO |
| Clipped surrogate objective | Clips probability ratio to bound the effective step | Implicit only | PPO |
| Importance sampling correction | Re-weights off-policy data to look on-policy | Variance from large ratios | Off-PAC, V-trace, Retrace |
| Twin critics | Takes minimum of two Q-estimates to fight overestimation | Double critic compute | TD3, SAC |
| Entropy regularisation | Adds α·H(π) to the loss | Need to tune α | A3C, SAC, PPO (small bonus) |
| Reward normalisation and clipping | Stabilises gradient magnitudes across tasks | Loses scale info | PPO, RND, most production code |
GAE is worth singling out. Schulman et al. (ICLR 2016) defined the advantage estimator
A^{GAE(γ,λ)}_t = Σ_{l=0}^{∞} (γλ)^l · δ_{t+l}, with δ_t = r_t + γV(s_{t+1}) − V(s_t)
which smoothly interpolates between high-variance Monte Carlo (λ → 1) and high-bias one-step TD (λ → 0). In practice a value of λ = 0.95 to 0.97 strikes the standard bias-variance balance and is the default in PPO and TRPO implementations.
The most consequential application of policy gradient methods in the 2020s is reinforcement learning from human feedback (RLHF), the technique used to align large language models with human preferences. The dominant algorithm for the RL stage of RLHF is PPO, and the link is direct enough that papers like InstructGPT (Ouyang et al., NeurIPS 2022) describe their method as "PPO on a learned reward model." The pipeline that produced ChatGPT, the original Claude, and Gemini chat models follows the same outline:
The KL penalty against the SFT reference is what prevents the policy from drifting into nonsense or reward-hacking the imperfect r_φ; without it, the optimiser will quickly find inputs that the reward model rates highly but humans hate. PPO's clipped objective adds a second layer of conservatism. Ouyang et al. reported that the 1.3B InstructGPT outputs were preferred to the 175B GPT-3 outputs despite a 100x parameter gap, an early demonstration of how powerful the alignment loop is. Subsequent work has explored alternatives such as DPO (Direct Preference Optimisation, Rafailov et al. 2023), which sidesteps the reward model and the PPO loop by training directly on preference data, but PPO remains the workhorse of production RLHF pipelines.
The table below contrasts the major classes of model-free RL algorithms.
| Family | Examples | Action space | Sample efficiency | On-/off-policy | Typical stability | Where it shines |
|---|---|---|---|---|---|---|
| Pure value-based | Q-learning, DQN, SARSA | Discrete | High (off-policy + replay) | Off-policy | Sometimes brittle | Atari, discrete control, gridworlds |
| Pure policy-based | REINFORCE | Discrete or continuous | Low | On-policy | Noisy but unbiased | Pedagogy, very simple problems |
| On-policy actor-critic | A2C, A3C, TRPO, PPO, IMPALA | Both | Medium | On-policy | Robust, especially PPO | Games, RLHF, large-scale parallel training |
| Off-policy actor-critic, deterministic | DDPG, TD3, D4PG | Continuous | High | Off-policy | TD3 stable; DDPG can collapse | Robotics, continuous control |
| Off-policy actor-critic, max-entropy | SAC | Continuous | Very high | Off-policy | Very stable | Real-world robotic learning |
| Hybrid with planning | AlphaZero-style policy + value + MCTS | Discrete with structure | High | Off-policy via self-play | Stable in self-play | Board games, search-amenable domains |
A loose rule of thumb: if the problem has a small discrete action set and you can collect lots of cheap experience, a value-based method like DQN is often sample efficient enough. If the actions are continuous, you almost certainly want a policy gradient method. If you also need stability and minimal tuning, start with PPO; if you need sample efficiency for real robots, start with SAC.
The theory of policy gradient methods is unusually clean for a deep learning topic. The main results are:
Policy gradient methods are notoriously sensitive to implementation choices. The following items account for most of the gap between a working and a non-working implementation.
A short list of production-quality libraries that implement policy gradient methods:
| Framework | Maintainer | Strengths | Notes |
|---|---|---|---|
| Stable-Baselines3 | DLR-RM | Reliable PPO, A2C, SAC, TD3, DDPG; easy to use | PyTorch, single-machine focus |
| RLlib (Ray) | Anyscale | Distributed scaling, multi-agent, broad algorithm coverage | Best for large clusters |
| Tianshou | THU-ML (Tsinghua) | Highly modular; fast PPO and SAC | Research-friendly |
| TorchRL | Meta AI | TorchRL primitives, integrates with PyTorch ecosystem | Newer; growing fast |
| CleanRL | community | Single-file, research-friendly implementations | Excellent for understanding details |
| Acme | DeepMind | JAX and TF backends, distributed | Used in DeepMind research |
| Brax | JAX physics + RL, end-to-end on accelerators | Very fast for continuous control | |
| Sample Factory | Petrenko et al. | High-throughput on-policy training | Used for ViZDoom and procgen leaderboards |
The 37 Implementation Details of PPO (Huang, Dossa, et al., ICLR Blog Track 2022) is the standard reference for understanding why nominally-equivalent implementations diverge in performance. A 2025 comparative study reported that Stable-Baselines3, CleanRL, and OpenAI Baselines achieved superhuman PPO performance rates around 50% in their benchmark trials, compared to under 15% for some other libraries, illustrating just how much implementation details matter.
Policy gradient methods are not a panacea. Their persistent weaknesses are: