# Policy gradient methods

> Source: https://aiwiki.ai/wiki/policy_gradient
> Updated: 2026-07-11
> Categories: Machine Learning, Reinforcement Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Policy gradient methods** are a family of [reinforcement learning](/wiki/reinforcement_learning) algorithms that directly parameterise the agent's policy and optimise it by stochastic gradient ascent on the expected return. Instead of first learning a value function and then deriving a policy from it (the classic value-based approach used by [Q-learning](/wiki/q-learning) and [SARSA](/wiki/sarsa)), a policy gradient method maintains a parameterised policy $$\pi_\theta(a \mid s)$$, collects trajectories by acting in the environment, and adjusts $$\theta$$ in the direction that increases the expected cumulative reward $$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$$. The defining tool is the policy gradient theorem, which shows that this gradient equals $$\mathbb{E}[\nabla_\theta \log \pi_\theta(a \mid s) \cdot Q^\pi(s, a)]$$, an expectation that can be estimated from sampled experience without differentiating the environment dynamics. [2]

The modern lineage starts with Ronald J. Williams's REINFORCE paper in 1992 and was placed on a rigorous footing by the policy gradient theorem of Sutton, McAllester, Singh, and Mansour in 2000. [1][2] Over the next two decades the family expanded to include actor-critic methods, trust region methods (TRPO, 2015), proximal optimisation (PPO, 2017), deterministic policy gradients (DPG, DDPG, TD3), and maximum-entropy methods (SAC). [7][11] Today policy gradient algorithms drive a striking share of applied reinforcement learning, from OpenAI Five and AlphaStar through bipedal robot locomotion to the [RLHF](/wiki/rlhf) fine-tuning step that produced ChatGPT and Claude. The reinforcement-learning stage of InstructGPT, the model behind ChatGPT, was run with [Proximal Policy Optimisation (PPO)](/wiki/proximal_policy_optimization): "we fine-tuned the SFT model on our environment using PPO," the OpenAI authors write. [18]

## Background and motivation

A [Markov decision process](/wiki/markov_decision_process_mdp) (MDP) is defined by states s, actions a, transition dynamics $$p(s' \mid s, a)$$, reward function $$r(s, a)$$, and discount factor $$\gamma$$. The agent's goal is to choose actions that maximise the expected discounted return $$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$$. Value-based methods such as Q-learning estimate $$Q(s, a)$$, the expected return of taking action a in state s and following the optimal policy thereafter, and pick the greedy action $$\arg\max_a Q(s, a)$$. This works well in small discrete action spaces but breaks down in two regimes that matter in practice: continuous action spaces, where the argmax is itself a non-trivial optimisation problem at every step, and stochastic policies, where the optimal behaviour is genuinely randomised (as in many partially observable or adversarial settings).

Policy gradient methods sidestep these problems by working directly with a parameterised policy. The policy can be a Gaussian over actions whose mean and standard deviation are network outputs, a softmax over discrete actions, a categorical mixture, or a deterministic function of the state. There are several reasons to want this:

- Continuous action spaces are handled naturally: the policy is just a probability density (or a deterministic point) over the real-valued action vector.
- Stochastic policies are first-class. In partially observed Markov decision processes (POMDPs) and competitive games, the optimal policy is often genuinely randomised, and a value-function-with-argmax setup cannot represent that.
- Policy parameters change smoothly under gradient updates, so small parameter changes lead to small policy changes. Value methods can swing the greedy action across a discontinuity from a tiny change in Q-values, which causes oscillation.
- Domain knowledge slips in through the policy architecture (Gaussian, mixture, autoregressive, hierarchical), the action parameterisation, and constraints baked into the network.
- Convergence guarantees under function approximation are sometimes stronger for policy methods. Sutton et al. proved local convergence of policy iteration with general differentiable function approximation, which value-based methods notoriously lack outside special cases. [2]

## What is the policy gradient theorem?

The central technical result is the **policy gradient theorem** of Sutton, McAllester, Singh, and Mansour (NeurIPS 1999, published 2000, NeurIPS proceedings pages 1057-1063). [2] For a parameterised policy π_θ(a|s) and a long-run performance measure J(θ) (either an episodic start-state value or an average reward), the gradient of performance with respect to θ has a clean form:

$$
\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^\pi,\, a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot Q^\pi(s, a)\right]
$$

The expectation is taken over the discounted state-visitation distribution d^π and the policy itself, and Q^π is the action-value function under the current policy. The crucial property, in the authors' words, is that "the gradient does not contain the term ∇_θ d^π," the change in the distribution of states, which would be hard to estimate because changing θ also changes which states are visited. [2] The visitation effect cancels. This identity is an application of the score-function (log-derivative) trick: because ∇_θ π_θ = π_θ · ∇_θ log π_θ, an expectation of a gradient can be rewritten as the gradient of an expectation that is itself estimable from samples.

In practice the unknown Q^π is replaced by an estimator, and the various choices give a family of algorithms:

| Estimator for Q^π(s,a) | Resulting algorithm | Bias | Variance |
|---|---|---|---|
| Monte Carlo return $$G_t = \sum \gamma^k r_{t+k}$$ | REINFORCE | Unbiased | Very high |
| One-step bootstrapped TD: r + γ V(s') | One-step actor-critic | Biased (V is approximate) | Low |
| n-step return | n-step actor-critic | Tunable | Tunable |
| Generalised Advantage Estimation $$\mathrm{GAE}(\gamma, \lambda)$$ | A2C, A3C, TRPO, PPO with GAE | Tunable via $$\lambda$$ | Tunable via $$\lambda$$ |
| Advantage $$A(s, a) = Q(s, a) - V(s)$$ | Actor-critic with baseline | Same as Q estimator | Lower than Q alone |

Replacing the return $$G_t$$ with $$G_t - b(s)$$ for any state-dependent baseline $$b(s)$$ leaves the gradient unbiased while reducing variance, because $$\mathbb{E}[\nabla_\theta \log \pi_\theta(a \mid s) \cdot b(s)] = 0$$ under $$\pi_\theta$$. Choosing $$b(s) = V^\pi(s)$$ gives the advantage form $$Q - V$$, which is the basis of virtually every modern [actor-critic](/wiki/actor_critic). [2]

## REINFORCE: the original Monte Carlo policy gradient

Williams's 1992 paper *Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning* (Machine Learning, volume 8, pages 229-256) introduced REINFORCE, the prototypical policy gradient method. [1] The paper presents "a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units," shown to "make weight adjustments in a direction that lies along the gradient of expected reinforcement." [1] The update is starkly simple. After running an episode and observing the return G_t from each timestep, every state-action pair is updated by

$$
\theta \leftarrow \theta + \alpha \cdot \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - b(s_t))
$$

No critic, no bootstrapping, no replay. The gradient is unbiased. The catch is variance: G_t is a sum of many noisy random rewards, so its standard deviation grows roughly as √H in horizon H, and the gradient estimate is correspondingly noisy. Williams already noted that subtracting a baseline b(s_t) leaves the expected update unchanged but can dramatically reduce variance, and that an obvious good baseline is a learned estimate of V^π(s_t). [1] REINFORCE is rarely competitive on its own in modern deep RL benchmarks but it is still a useful pedagogical starting point and a building block: PPO, TRPO, and A3C all reduce to a variance-controlled version of REINFORCE in their inner loop.

## What is an actor-critic method?

Replacing the Monte Carlo return with a learned value estimate gives an **actor-critic** method. The actor is the policy π_θ, the critic is a value function V_φ(s) or Q_φ(s,a), and the two are trained concurrently. Konda and Tsitsiklis (NeurIPS 1999, proceedings pages 1008-1014) gave the first formal analysis of two-time-scale actor-critic algorithms with linear critics, in which "the critic uses temporal difference learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on information provided by the critic." [3] The key advantages are lower-variance gradients, online operation without waiting for episodes to terminate, and the ability to bootstrap off the value estimate (so credit can be assigned in environments without natural episode boundaries).

Actor-critic methods come in on-policy and off-policy variants. On-policy critics are trained on data generated by the current policy and discarded after each update; this is what A2C, A3C, TRPO, and PPO do. Off-policy critics learn from a replay buffer of historical experience, which is more sample efficient but requires importance correction or special structure to remain stable; DDPG, TD3, and SAC are the canonical examples.

### A3C and A2C

Mnih et al. (ICML 2016) introduced **Asynchronous Advantage Actor-Critic (A3C)**, which runs many actor-learner threads in parallel on a single multi-core CPU. [10] Each worker maintains its own environment and policy copy, computes gradients on short rollouts, and asynchronously pushes them to a shared parameter server. The diversity of experiences across workers acts as an implicit replay buffer and stabilises training without storing past transitions explicitly. A3C surpassed the Atari state of the art at the time while training in roughly half the wall-clock time on a single multi-core CPU as opposed to GPU-trained DQN. [10] A2C is the simpler synchronous variant, where workers wait for each other and a single batched update is applied; in practice A2C often matches A3C and is easier to tune.

## How does TRPO bound each policy update?

A chronic problem with naive policy gradient steps is that even a small parameter step can cause a large policy change in regions where the policy is steep, which destabilises training. Schulman, Levine, Moritz, Jordan, and Abbeel (ICML 2015, arXiv submitted February 2015) addressed this in **Trust Region Policy Optimisation (TRPO)** by constraining each update to keep the new policy close to the old one in KL divergence: [7]

$$
\text{maximise} \quad \mathbb{E}\left[ \frac{\pi_\theta(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)} \cdot A^{\pi_{\text{old}}}(s, a) \right]
$$

$$
\text{subject to} \quad \mathbb{E}\left[ \mathrm{KL}\left( \pi_{\theta_{\text{old}}}(\cdot \mid s) \,\|\, \pi_\theta(\cdot \mid s) \right) \right] \le \delta
$$

TRPO solves this constrained problem by linearising the surrogate objective and quadratically approximating the KL constraint, which produces a natural policy gradient direction. The update direction is found via the conjugate gradient method (avoiding explicit storage of the Fisher information matrix), and a backtracking line search adjusts the step size until the KL constraint is satisfied and the surrogate objective improved. The authors report that TRPO "tends to give monotonic improvement, with little tuning of hyperparameters," across very different problem domains. [7] The cost is that the conjugate gradient solve and line search make each iteration computationally heavy and the implementation fiddly.

## How does PPO differ from TRPO?

Schulman, Wolski, Dhariwal, Radford, and Klimov (arXiv:1707.06347, July 2017) proposed **Proximal Policy Optimisation (PPO)** as a much simpler alternative to TRPO that "has some of the benefits of trust region policy optimization (TRPO), but is much simpler to implement, more general, and has better sample complexity." [11] Instead of an explicit KL constraint, PPO penalises updates that move the probability ratio $$\rho_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_{\text{old}}}(a_t \mid s_t)$$ too far from 1. The clipped surrogate objective is

$$
L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[ \min\left( \rho_t(\theta) \cdot A_t,\; \mathrm{clip}(\rho_t(\theta), 1 - \epsilon, 1 + \epsilon) \cdot A_t \right) \right]
$$

with ε typically set to 0.2. [11] The clip operation flattens the loss as soon as the ratio leaves [1 − ε, 1 + ε] in the wrong direction, removing the incentive for the optimiser to push the policy further. The min with the unclipped term ensures that PPO still allows the policy to improve when the clipped term would be optimistic.

Three things made PPO the dominant policy gradient algorithm in practice: it works with first-order optimisers like Adam (no conjugate gradient), it permits multiple epochs of minibatch SGD per batch of collected data (so sample efficiency is much better than vanilla A2C), and it is robust to a wide range of hyperparameters. A 2022 study by Huang et al. in the ICLR Blog Track catalogued 37 implementation details that affect PPO performance; despite this complexity, PPO remains the algorithm reached for first in deep RL projects. [19] OpenAI Five (Dota 2), AlphaStar (StarCraft II), bipedal robots like Cassie, and the RLHF stage of InstructGPT and ChatGPT all used PPO or close variants. [16][17][18]

## Off-policy actor-critic methods for continuous control

A parallel line of work attacks the same continuous-control problem with off-policy actor-critic algorithms that learn from a replay buffer, which makes them much more sample efficient than on-policy PPO at the cost of more delicate tuning.

| Algorithm | Year | Policy type | Key idea | Notes |
|---|---|---|---|---|
| DPG | 2014 | Deterministic | Deterministic policy gradient theorem (Silver et al.) | Off-policy actor-critic with linear function approximation [6] |
| [DDPG](/wiki/ddpg) | 2016 | Deterministic | DPG + DQN tricks (replay, target networks) | First effective deep RL for continuous control on pixels (Lillicrap et al.) [9] |
| [TD3](/wiki/td3) | 2018 | Deterministic | Twin critics, delayed actor updates, target policy smoothing | Fixes DDPG's overestimation bias (Fujimoto, Hoof, Meger) [13] |
| [SAC](/wiki/soft_actor_critic) | 2018 | Stochastic | Maximum-entropy RL, automatic temperature tuning | State of the art for continuous control; very stable (Haarnoja et al.) [14] |

**DDPG** (Lillicrap et al., ICLR 2016) adapts the deterministic policy gradient of Silver et al. (ICML 2014) to deep networks by reusing the DQN tricks that stabilise off-policy Q-learning: a replay buffer, separate target networks for the actor and the critic, and Polyak-averaged target updates. [6][9] The critic is trained by minimising the Bellman error on $$Q(s, a)$$, and the deterministic actor is updated via the chain rule, $$\nabla_\theta J = \mathbb{E}[\nabla_a Q_\phi(s, a) \cdot \nabla_\theta \mu_\theta(s)\big|_{a = \mu_\theta(s)}]$$. DDPG works on more than twenty simulated physics tasks including dexterous manipulation, legged locomotion, and end-to-end pixel control. [9]

**TD3** (Fujimoto, Hoof, Meger, ICML 2018) addresses three failure modes of DDPG. [13] First, deep Q-functions overestimate values because the max operator in the Bellman target picks up positive noise; TD3 trains two critics and uses the smaller of the two as the target. Second, errors in the critic propagate to the actor and feed back into the data; TD3 updates the actor (and the target networks) only every two critic updates. Third, deterministic policies overfit to narrow peaks of the Q-function; TD3 adds clipped noise to the target action, smoothing the Q-learning target across nearby actions. Together these changes substantially outperform DDPG on the OpenAI Gym continuous-control suite. [13]

**SAC** (Haarnoja, Zhou, Abbeel, Levine, ICML 2018) takes a different route. [14] It frames RL as maximum-entropy RL, maximising a modified objective $$J_{\text{MaxEnt}}(\pi) = \mathbb{E}[\sum r(s_t, a_t) + \alpha H(\pi(\cdot \mid s_t))]$$. The added entropy term encourages the policy to be as random as possible while still solving the task, which strongly improves exploration and produces robust policies. SAC trains a stochastic Gaussian actor and twin Q-critics off-policy from a replay buffer, with a temperature parameter $$\alpha$$ that can be tuned automatically by gradient on a target entropy. Because of its sample efficiency, stability, and minimal hyperparameter tuning, SAC has become a default choice for real-world robotic learning.

## Asynchronous and distributed variants

Policy gradient methods scale especially well across many parallel environment workers, because their gradient is naturally an expectation that splits cleanly across actors.

- **A3C** (Mnih et al., ICML 2016): asynchronous advantage actor-critic with multiple CPU workers and a shared parameter server. [10]
- **A2C**: synchronous version of A3C; each step waits for all workers and applies a batched update.
- **IMPALA** (Espeholt et al., ICML 2018): decoupled actors and learners with the V-trace off-policy correction, achieving a throughput of about 250,000 frames per second while training on DMLab-30 and Atari-57. [12]
- **APE-X** (Horgan et al., ICLR 2018): distributed prioritised experience replay; though formulated for DQN it generalises to off-policy actor-critics.
- **R2D2** and recurrent variants: LSTM-based policies trained over long sequences with stored hidden states.
- **MAPPO** and **MADDPG**: multi-agent extensions of PPO and DDPG.
- **D4PG** (Barth-Maron et al., ICLR 2018): distributional critic for distributed deep deterministic policy gradients.

## Variance reduction techniques

The basic policy gradient estimator has very high variance, and almost every practical advance in the field can be read as a variance-reduction trick. The following table lists the main techniques and where they appear.

| Technique | What it does | Cost | Used in |
|---|---|---|---|
| State-value baseline V(s) | Subtracts a state-dependent baseline from the return | Need a learned V | REINFORCE with baseline, all actor-critics |
| Advantage estimation A(s,a) = Q − V | Centres the gradient on relative quality | Same | All modern actor-critics |
| Generalised Advantage Estimation (GAE) | Exponentially weighted multi-step advantage with parameter λ | One extra hyperparameter | TRPO, PPO, IMPALA |
| Bootstrapping with V(s') | Uses a learned value to truncate the Monte Carlo return | Bias if V is wrong | A2C, A3C, n-step actor-critic |
| Trust region constraint | Bounds policy change per step in KL divergence | Conjugate gradient solve | TRPO |
| Clipped surrogate objective | Clips probability ratio to bound the effective step | Implicit only | PPO |
| Importance sampling correction | Re-weights off-policy data to look on-policy | Variance from large ratios | Off-PAC, V-trace, Retrace |
| Twin critics | Takes minimum of two Q-estimates to fight overestimation | Double critic compute | TD3, SAC |
| Entropy regularisation | Adds α·H(π) to the loss | Need to tune α | A3C, SAC, PPO (small bonus) |
| Reward normalisation and clipping | Stabilises gradient magnitudes across tasks | Loses scale info | PPO, RND, most production code |

GAE is worth singling out. Schulman, Moritz, Levine, Jordan, and Abbeel (ICLR 2016) defined "an exponentially-weighted estimator of the advantage function that is analogous to TD(λ)" and uses value functions "to substantially reduce the variance of policy gradient estimates at the cost of some bias": [8]

$$
A^{\mathrm{GAE}(\gamma, \lambda)}_t = \sum_{l=0}^{\infty} (\gamma\lambda)^l \cdot \delta_{t+l}, \quad \text{with} \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
$$

which smoothly interpolates between high-variance Monte Carlo (λ → 1) and high-bias one-step TD (λ → 0). [8] In practice a value of λ = 0.95 to 0.97 strikes the standard bias-variance balance and is the default in PPO and TRPO implementations.

## How are policy gradients used in RLHF and LLM alignment?

The most consequential application of policy gradient methods in the 2020s is **[reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback)** (RLHF), the technique used to align large language models with human preferences. The dominant algorithm for the RL stage of RLHF is PPO. The link is direct: in InstructGPT (Ouyang et al., NeurIPS 2022), the authors state, "we fine-tuned the SFT model on our environment using PPO (Schulman et al., 2017)," and add a "per-token KL penalty from the SFT model at each token to mitigate over-optimization of the reward model," with the value function "initialized from the RM." [18] The pipeline that produced [ChatGPT](/wiki/chatgpt), the original Claude, and Gemini chat models follows the same outline:

1. **Supervised fine-tuning (SFT):** the base language model is fine-tuned on demonstrations written by humans.
2. **Reward model training:** humans rank model outputs, and a reward model r_φ is trained to predict the human preference. (InstructGPT used 6B-parameter reward models, having found 175B reward-model training "could be unstable.") [18]
3. **Reinforcement learning:** the language model is treated as a stochastic policy that emits a sequence of tokens, and PPO is run with a reward derived from $$r_\phi$$, plus a per-token KL penalty $$\beta \cdot \mathrm{KL}(\pi_\theta \,\|\, \pi_{\text{ref}})$$ against the supervised reference policy.

The KL penalty against the SFT reference is what prevents the policy from drifting into nonsense or reward-hacking the imperfect r_φ; without it, the optimiser will quickly find inputs that the reward model rates highly but humans hate. PPO's clipped objective adds a second layer of conservatism. Ouyang et al. reported that "outputs from the 1.3B parameter [InstructGPT](/wiki/instructgpt) model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters," an early demonstration of how powerful the alignment loop is. [18] Subsequent work has explored alternatives such as [DPO](/wiki/direct_preference_optimization_dpo) (Direct Preference Optimisation, Rafailov et al. 2023), which sidesteps the reward model and the PPO loop by training directly on preference data, but PPO remains a workhorse of production RLHF pipelines.

## How do policy gradient methods compare with value-function methods?

The table below contrasts the major classes of model-free RL algorithms.

| Family | Examples | Action space | Sample efficiency | On-/off-policy | Typical stability | Where it shines |
|---|---|---|---|---|---|---|
| Pure value-based | [Q-learning](/wiki/q-learning), DQN, [SARSA](/wiki/sarsa) | Discrete | High (off-policy + replay) | Off-policy | Sometimes brittle | Atari, discrete control, gridworlds |
| Pure policy-based | REINFORCE | Discrete or continuous | Low | On-policy | Noisy but unbiased | Pedagogy, very simple problems |
| On-policy actor-critic | A2C, A3C, TRPO, PPO, IMPALA | Both | Medium | On-policy | Robust, especially PPO | Games, RLHF, large-scale parallel training |
| Off-policy actor-critic, deterministic | DDPG, TD3, D4PG | Continuous | High | Off-policy | TD3 stable; DDPG can collapse | Robotics, continuous control |
| Off-policy actor-critic, max-entropy | SAC | Continuous | Very high | Off-policy | Very stable | Real-world robotic learning |
| Hybrid with planning | AlphaZero-style policy + value + MCTS | Discrete with structure | High | Off-policy via self-play | Stable in self-play | Board games, search-amenable domains |

A loose rule of thumb: if the problem has a small discrete action set and you can collect lots of cheap experience, a value-based method like DQN is often sample efficient enough. If the actions are continuous, you almost certainly want a policy gradient method. If you also need stability and minimal tuning, start with PPO; if you need sample efficiency for real robots, start with SAC.

## Theoretical results

The theory of policy gradient methods is unusually clean for a deep learning topic. The main results are:

- **Policy gradient theorem** (Sutton, McAllester, Singh, Mansour 2000): the expression for ∇_θ J above, valid for both episodic and average-reward formulations, with and without a state-dependent baseline. [2]
- **Compatible function approximation** (same paper): if the critic is linear in features that match ∇_θ log π_θ, the substitute Q_w gives an unbiased gradient estimate. [2]
- **Convergence under linear function approximation** (Konda and Tsitsiklis 2000): two-time-scale actor-critic with a linear critic converges almost surely to a local maximum of J under standard step-size conditions. [3]
- **Conservative policy iteration** (Kakade and Langford, ICML 2002): mixing the new and old policies with a small mixture coefficient guarantees monotonic improvement, and motivates the trust-region perspective. [5]
- **Natural policy gradient** (Kakade, NeurIPS 2001): premultiplying the gradient by the inverse Fisher information matrix gives the steepest ascent direction in the natural geometry of the policy manifold and accelerates convergence. [4]
- **Mirror descent perspective** (Neu et al. 2017, Tomar et al. 2020): TRPO and PPO are special cases of mirror descent in the space of policies with a KL Bregman divergence, which clarifies why the surrogate objective works.
- **Global convergence under tabular and softmax parameterisations** (Agarwal, Kakade, Lee, Mahajan 2021, *On the Theory of Policy Gradient Methods*): finite-time guarantees for natural policy gradient and projected policy gradient in idealised settings. [21]

## Practical considerations

Policy gradient methods are notoriously sensitive to implementation choices. The following items account for most of the gap between a working and a non-working implementation.

- **Hyperparameters:** PPO's clip ratio ε, learning rate, GAE λ, number of epochs per batch, and minibatch size all matter. Defaults of ε = 0.2, lr ≈ 3e-4, λ = 0.95, 10 epochs, 64 minibatches per batch work for many problems but rarely all. [19]
- **Reward shaping and clipping:** clipping rewards to a fixed range or normalising them with a running mean and standard deviation often makes training tractable on tasks with very different reward scales. The 2018 Ilyas et al. study *A Closer Look at Deep Policy Gradients* showed that several PPO components widely believed to come from the algorithm in fact come from these implementation details. [22]
- **Entropy regularisation:** small entropy bonuses (β ≈ 0.01) prevent premature collapse to a deterministic policy in PPO and A3C; SAC promotes entropy to a first-class objective and tunes its weight automatically. [14]
- **GAE λ:** λ = 0 reduces to one-step TD (low variance, high bias), λ = 1 to Monte Carlo (high variance, no bias). Most production work sits between 0.9 and 0.99. [8]
- **Parallel environment collection:** PPO scales nearly linearly in the number of parallel environments because the gradient is an expectation. OpenAI Five used a large fleet of CPU workers; even modest projects benefit from 16 to 128 parallel environments. [16]
- **Network initialisation:** orthogonal initialisation of the policy and value heads with small final-layer scales is a standard PPO trick that materially affects early training. [19]
- **Action space scaling:** continuous policies output actions in a normalised range and scale them to the environment range, which keeps gradients well-behaved.

## Frameworks and implementations

A short list of production-quality libraries that implement policy gradient methods:

| Framework | Maintainer | Strengths | Notes |
|---|---|---|---|
| Stable-Baselines3 | DLR-RM | Reliable PPO, A2C, SAC, TD3, DDPG; easy to use | PyTorch, single-machine focus |
| RLlib (Ray) | Anyscale | Distributed scaling, multi-agent, broad algorithm coverage | Best for large clusters |
| Tianshou | THU-ML (Tsinghua) | Highly modular; fast PPO and SAC | Research-friendly |
| TorchRL | Meta AI | TorchRL primitives, integrates with PyTorch ecosystem | Newer; growing fast |
| CleanRL | community | Single-file, research-friendly implementations | Excellent for understanding details |
| Acme | DeepMind | JAX and TF backends, distributed | Used in DeepMind research |
| Brax | Google | JAX physics + RL, end-to-end on accelerators | Very fast for continuous control |
| Sample Factory | Petrenko et al. | High-throughput on-policy training | Used for ViZDoom and procgen leaderboards |

The 37 Implementation Details of PPO (Huang, Dossa, et al., ICLR Blog Track 2022) is the standard reference for understanding why nominally-equivalent implementations diverge in performance. [19]

## Real-world applications

- **Game playing:** OpenAI Five used PPO, "a policy gradient method," running on 256 GPUs and 128,000 CPU cores and playing about 180 years' worth of games per day, to defeat the Dota 2 world champions OG in San Francisco in April 2019, the first time an AI system beat an esports world champion at a professional event. [16] AlphaStar reached Grandmaster level in StarCraft II in 2019, ranking above 99.8% of human players, using a multi-agent reinforcement learning approach with population-based league training (Vinyals et al., Nature 575, 350-354). [17] AlphaGo and AlphaZero combine a policy network trained partly with policy gradient ideas with a Monte Carlo tree search planner.
- **Robotics:** PPO is a workhorse for sim-to-real bipedal locomotion. Cassie, the bipedal robot built by Agility Robotics and Oregon State University, learned to walk, run, and climb stairs from neural-network controllers trained with PPO-style reinforcement learning in simulation and transferred to hardware (Xie et al., Conference on Robot Learning 2020). [23] Quadruped locomotion (Hwangbo et al. 2019, on the ANYmal robot) and dexterous in-hand manipulation (OpenAI 2019, on the Shadow Hand) also rely on PPO-style training.
- **LLM alignment:** the [reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback) stage of InstructGPT, ChatGPT, the early Claude models, Gemini chat models, and a long tail of open-source instruction-tuned models all run PPO against a learned reward model with a KL penalty against the supervised reference policy. [18] This is the single largest commercial deployment of a policy gradient algorithm to date.
- **Autonomous driving:** policy gradient methods drive lane keeping, decision making, and motion planning in research and increasingly in production. Wayve, for instance, has publicly described training driving policies with deep RL.
- **Recommendation systems:** YouTube, Spotify, and others have published work on using policy gradients (REINFORCE-style methods with off-policy correction) to learn slate recommendations and next-video policies.
- **Resource scheduling and traffic control:** RL-based controllers are applied to traffic signal control and to data-centre and industrial control problems, where policies are optimised against operational cost or energy objectives.
- **Healthcare and finance:** RL-based treatment policies and portfolio optimisation use policy gradient methods, though deployment is limited by the difficulty of safe exploration.

## Limitations

Policy gradient methods are not a panacea. Their persistent weaknesses are:

- **Sample efficiency.** On-policy methods like PPO discard data after each update and require many environment interactions; for small discrete environments, value-based methods with replay are often dramatically more sample efficient.
- **High variance.** Even with GAE and baselines, the gradient estimator is inherently noisy. Hyperparameter sweeps that look fine on average can show huge run-to-run variation.
- **Reward design.** The output of policy gradient training is only as good as the reward function. Reward hacking (Krakovna et al. 2020 catalogued dozens of cases) is endemic, and shaping rewards by hand is brittle.
- **Hyperparameter sensitivity.** As Ilyas et al. and the PPO 37 Details article documented, even canonical PPO depends heavily on implementation details that the original paper did not emphasise. [19][22]
- **Local optima.** Gradient ascent on a non-convex policy objective can converge to local maxima that are far from globally optimal, and exploration alone is not enough to escape them in many high-dimensional problems.
- **Catastrophic forgetting.** Off-policy actor-critics in particular can suddenly collapse during training if the replay distribution drifts; TD3 and SAC mitigate this but do not eliminate it.
- **Credit assignment over long horizons.** Discounted returns blur cause and effect over hundreds of timesteps; hierarchical and option-based extensions help but remain an active research area.

## See also

- [Reinforcement learning](/wiki/reinforcement_learning)
- [Proximal Policy Optimization (PPO)](/wiki/proximal_policy_optimization)
- [Q-learning](/wiki/q-learning)
- [SARSA](/wiki/sarsa)
- [Reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback)
- [Direct Preference Optimization (DPO)](/wiki/direct_preference_optimization_dpo)
- [Temporal difference learning](/wiki/temporal_difference_learning)
- [Importance sampling](/wiki/importance_sampling)

## References

1. Williams, R. J. (1992). "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning." *Machine Learning*, 8, 229-256. https://link.springer.com/article/10.1007/BF00992696
2. Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). "Policy Gradient Methods for Reinforcement Learning with Function Approximation." *Advances in Neural Information Processing Systems 12*, pp. 1057-1063. http://papers.neurips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf
3. Konda, V. R. and Tsitsiklis, J. N. (2000). "Actor-Critic Algorithms." *Advances in Neural Information Processing Systems 12*, pp. 1008-1014. https://proceedings.neurips.cc/paper/1786-actor-critic-algorithms.pdf
4. Kakade, S. M. (2001). "A Natural Policy Gradient." *Advances in Neural Information Processing Systems 14*. https://homes.cs.washington.edu/~sham/papers/rl/natural.pdf
5. Kakade, S. and Langford, J. (2002). "Approximately Optimal Approximate Reinforcement Learning." *International Conference on Machine Learning*. https://www.cs.cmu.edu/~jcl/presentation/RL/RL.ps
6. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). "Deterministic Policy Gradient Algorithms." *International Conference on Machine Learning*. https://proceedings.mlr.press/v32/silver14.pdf
7. Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). "Trust Region Policy Optimization." *International Conference on Machine Learning*. https://arxiv.org/abs/1502.05477
8. Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and Abbeel, P. (2016). "High-Dimensional Continuous Control Using Generalized Advantage Estimation." *International Conference on Learning Representations*. https://arxiv.org/abs/1506.02438
9. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). "Continuous control with deep reinforcement learning." *International Conference on Learning Representations*. https://arxiv.org/abs/1509.02971
10. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). "Asynchronous Methods for Deep Reinforcement Learning." *International Conference on Machine Learning*. https://arxiv.org/abs/1602.01783
11. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347. https://arxiv.org/abs/1707.06347
12. Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu, K. (2018). "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures." *International Conference on Machine Learning*. https://arxiv.org/abs/1802.01561
13. Fujimoto, S., van Hoof, H., and Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." *International Conference on Machine Learning*. https://arxiv.org/abs/1802.09477
14. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *International Conference on Machine Learning*. https://arxiv.org/abs/1801.01290
15. Sutton, R. S. and Barto, A. G. (2018). *Reinforcement Learning: An Introduction*, 2nd ed., Chapter 13: Policy Gradient Methods. MIT Press. http://incompleteideas.net/book/the-book-2nd.html
16. OpenAI, et al. (2019). "Dota 2 with Large Scale Deep Reinforcement Learning." arXiv:1912.06680. https://arxiv.org/abs/1912.06680
17. Vinyals, O., et al. (2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning." *Nature*, 575(7782), 350-354. https://www.nature.com/articles/s41586-019-1724-z
18. Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." *Advances in Neural Information Processing Systems*. arXiv:2203.02155. https://arxiv.org/abs/2203.02155
19. Huang, S., Dossa, R. F. J., Raffin, A., Kanervisto, A., and Wang, W. (2022). "The 37 Implementation Details of Proximal Policy Optimization." ICLR Blog Track. https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
20. OpenAI Spinning Up Documentation: Vanilla Policy Gradient, TRPO, PPO, DDPG, TD3, SAC. https://spinningup.openai.com/
21. Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). "On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift." *Journal of Machine Learning Research*. https://jmlr.org/papers/volume22/19-736/19-736.pdf
22. Ilyas, A., Engstrom, L., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. (2020). "A Closer Look at Deep Policy Gradients." *International Conference on Learning Representations* (arXiv:1811.02553). https://arxiv.org/abs/1811.02553
23. Xie, Z., Clary, P., Dao, J., Morais, P., Hurst, J., and van de Panne, M. (2020). "Learning Locomotion Skills for Cassie: Iterative Design and Sim-to-Real." *Conference on Robot Learning (CoRL) 2019*, PMLR 100. https://proceedings.mlr.press/v100/xie20a.html