Policy gradient methods

Policy gradient methods are a family of reinforcement learning algorithms that directly parameterise the agent's policy and optimise it by stochastic gradient ascent on the expected return. Instead of first learning a value function and then deriving a policy from it (the classic value-based approach used by Q-learning and SARSA), a policy gradient method maintains a parameterised policy π_θ(a|s), collects trajectories by acting in the environment, and adjusts θ in the direction that increases the expected cumulative reward J(θ) = E_{τ ~ π_θ}[R(τ)].

The modern lineage starts with Ronald J. Williams's REINFORCE paper in 1992 and was placed on a rigorous footing by the policy gradient theorem of Sutton, McAllester, Singh, and Mansour in 2000. Over the next two decades the family expanded to include actor-critic methods, trust region methods (TRPO), proximal optimisation (PPO), deterministic policy gradients (DPG, DDPG, TD3), and maximum-entropy methods (SAC). Today policy gradient algorithms drive a striking share of applied reinforcement learning, from OpenAI Five and AlphaStar through bipedal robot locomotion to the RLHF fine-tuning step that produced ChatGPT and Claude.

Background and motivation

A Markov decision process (MDP) is defined by states s, actions a, transition dynamics p(s'|s,a), reward function r(s,a), and discount factor γ. The agent's goal is to choose actions that maximise the expected discounted return G_t = Σ_{k=0}^{∞} γ^k r_{t+k+1}. Value-based methods such as Q-learning estimate Q(s,a), the expected return of taking action a in state s and following the optimal policy thereafter, and pick the greedy action argmax_a Q(s,a). This works well in small discrete action spaces but breaks down in two regimes that matter in practice: continuous action spaces, where the argmax is itself a non-trivial optimisation problem at every step, and stochastic policies, where the optimal behaviour is genuinely randomised (as in many partially observable or adversarial settings).

Policy gradient methods sidestep these problems by working directly with a parameterised policy. The policy can be a Gaussian over actions whose mean and standard deviation are network outputs, a softmax over discrete actions, a categorical mixture, or a deterministic function of the state. There are several reasons to want this:

Continuous action spaces are handled naturally: the policy is just a probability density (or a deterministic point) over the real-valued action vector.
Stochastic policies are first-class. In partially observed Markov decision processes (POMDPs) and competitive games, the optimal policy is often genuinely randomised, and a value-function-with-argmax setup cannot represent that.
Policy parameters change smoothly under gradient updates, so small parameter changes lead to small policy changes. Value methods can swing the greedy action across a discontinuity from a tiny change in Q-values, which causes oscillation.
Domain knowledge slips in through the policy architecture (Gaussian, mixture, autoregressive, hierarchical), the action parameterisation, and constraints baked into the network.
Convergence guarantees under function approximation are sometimes stronger for policy methods. Sutton et al. proved local convergence of policy iteration with general differentiable function approximation, which value-based methods notoriously lack outside special cases.

The policy gradient theorem

The central technical result is the policy gradient theorem of Sutton, McAllester, Singh, and Mansour (NeurIPS 1999, published 2000). For a parameterised policy π_θ(a|s) and a long-run performance measure J(θ) (either an episodic start-state value or an average reward), the gradient of performance with respect to θ has a clean form:

∇_θ J(θ) = E_{s ~ d^π, a ~ π_θ}[∇_θ log π_θ(a|s) · Q^π(s, a)]

The expectation is taken over the discounted state-visitation distribution d^π and the policy itself, and Q^π is the action-value function under the current policy. The crucial property is that the gradient does not contain the term ∇_θ d^π, which would be hard to estimate because changing θ also changes which states are visited. The visitation effect cancels.

In practice the unknown Q^π is replaced by an estimator, and the various choices give a family of algorithms:

Estimator for Q^π(s,a)	Resulting algorithm	Bias	Variance
Monte Carlo return G_t = Σ γ^k r_{t+k}	REINFORCE	Unbiased	Very high
One-step bootstrapped TD: r + γ V(s')	One-step actor-critic	Biased (V is approximate)	Low
n-step return	n-step actor-critic	Tunable	Tunable
Generalised Advantage Estimation GAE(γ, λ)	A2C, A3C, TRPO, PPO with GAE	Tunable via λ	Tunable via λ
Advantage A(s,a) = Q(s,a) − V(s)	Actor-critic with baseline	Same as Q estimator	Lower than Q alone

Replacing the return G_t with G_t − b(s) for any state-dependent baseline b(s) leaves the gradient unbiased while reducing variance, because E[∇_θ log π_θ(a|s) · b(s)] = 0 under π_θ. Choosing b(s) = V^π(s) gives the advantage form Q − V, which is the basis of virtually every modern actor-critic.

REINFORCE: the original Monte Carlo policy gradient

Williams's 1992 paper Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning introduced REINFORCE, the prototypical policy gradient method. The update is starkly simple. After running an episode and observing the return G_t from each timestep, every state-action pair is updated by

θ ← θ + α · ∇_θ log π_θ(a_t | s_t) · (G_t − b(s_t))

No critic, no bootstrapping, no replay. The gradient is unbiased. The catch is variance: G_t is a sum of many noisy random rewards, so its standard deviation grows roughly as √H in horizon H, and the gradient estimate is correspondingly noisy. Williams already noted that subtracting a baseline b(s_t) leaves the expected update unchanged but can dramatically reduce variance, and that an obvious good baseline is a learned estimate of V^π(s_t). REINFORCE is rarely competitive on its own in modern deep RL benchmarks but it is still a useful pedagogical starting point and a building block: PPO, TRPO, and A3C all reduce to a variance-controlled version of REINFORCE in their inner loop.

Actor-critic methods

Replacing the Monte Carlo return with a learned value estimate gives an actor-critic method. The actor is the policy π_θ, the critic is a value function V_φ(s) or Q_φ(s,a), and the two are trained concurrently. Konda and Tsitsiklis (NeurIPS 2000) gave the first formal analysis of two-time-scale actor-critic algorithms with linear critics. The key advantages are lower-variance gradients, online operation without waiting for episodes to terminate, and the ability to bootstrap off the value estimate (so credit can be assigned in environments without natural episode boundaries).

Actor-critic methods come in on-policy and off-policy variants. On-policy critics are trained on data generated by the current policy and discarded after each update; this is what A2C, A3C, TRPO, and PPO do. Off-policy critics learn from a replay buffer of historical experience, which is more sample efficient but requires importance correction or special structure to remain stable; DDPG, TD3, and SAC are the canonical examples.

A3C and A2C

Mnih et al. (ICML 2016) introduced Asynchronous Advantage Actor-Critic (A3C), which runs many actor-learner threads in parallel on a single multi-core CPU. Each worker maintains its own environment and policy copy, computes gradients on short rollouts, and asynchronously pushes them to a shared parameter server. The diversity of experiences across workers acts as an implicit replay buffer and stabilises training without storing past transitions explicitly. A3C surpassed the Atari state of the art at the time while training in half the wall-clock time on a single CPU as opposed to GPU-trained DQN. A2C is the simpler synchronous variant, where workers wait for each other and a single batched update is applied; in practice A2C often matches A3C and is easier to tune.

Trust region policy optimisation (TRPO)

A chronic problem with naive policy gradient steps is that even a small parameter step can cause a large policy change in regions where the policy is steep, which destabilises training. Schulman, Levine, Moritz, Jordan, and Abbeel (ICML 2015) addressed this in Trust Region Policy Optimisation (TRPO) by constraining each update to keep the new policy close to the old one in KL divergence:

maximise   E[ π_θ(a|s) / π_θ_old(a|s) · A^{π_old}(s,a) ]
subject to E[ KL( π_θ_old(·|s) || π_θ(·|s) ) ] ≤ δ

TRPO solves this constrained problem by linearising the surrogate objective and quadratically approximating the KL constraint, which produces a natural policy gradient direction. The update direction is found via the conjugate gradient method (avoiding explicit storage of the Fisher information matrix), and a backtracking line search adjusts the step size until the KL constraint is satisfied and the surrogate objective improved. The trust region machinery makes TRPO impressively stable across very different problem domains, but the conjugate gradient solve and line search make each iteration computationally heavy and the implementation fiddly.

Proximal policy optimisation (PPO)

Schulman, Wolski, Dhariwal, Radford, and Klimov (arXiv 2017) proposed Proximal Policy Optimisation (PPO) as a much simpler alternative to TRPO. Instead of an explicit KL constraint, PPO penalises updates that move the probability ratio ρ_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) too far from 1. The clipped surrogate objective is

L^CLIP(θ) = E_t[ min( ρ_t(θ) · A_t , clip(ρ_t(θ), 1 − ε, 1 + ε) · A_t ) ]

with ε typically set to 0.2. The clip operation flattens the loss as soon as the ratio leaves [1 − ε, 1 + ε] in the wrong direction, removing the incentive for the optimiser to push the policy further. The min with the unclipped term ensures that PPO still allows the policy to improve when the clipped term would be optimistic.

Three things made PPO the dominant policy gradient algorithm in practice: it works with first-order optimisers like Adam (no conjugate gradient), it permits multiple epochs of minibatch SGD per batch of collected data (so sample efficiency is much better than vanilla A2C), and it is robust to a wide range of hyperparameters. A 2022 study by Huang et al. in the ICLR Blog Track catalogued 37 implementation details that affect PPO performance; despite this complexity, PPO remains the algorithm reached for first in deep RL projects. OpenAI Five (Dota 2), AlphaStar (StarCraft II), bipedal robots like Cassie, and the RLHF stage of InstructGPT and ChatGPT all used PPO or close variants.

Off-policy actor-critic methods for continuous control

A parallel line of work attacks the same continuous-control problem with off-policy actor-critic algorithms that learn from a replay buffer, which makes them much more sample efficient than on-policy PPO at the cost of more delicate tuning.

Algorithm	Year	Policy type	Key idea	Notes
DPG	2014	Deterministic	Deterministic policy gradient theorem (Silver et al.)	Off-policy actor-critic with linear function approximation
DDPG	2016	Deterministic	DPG + DQN tricks (replay, target networks)	First effective deep RL for continuous control on pixels (Lillicrap et al.)
TD3	2018	Deterministic	Twin critics, delayed actor updates, target policy smoothing	Fixes DDPG's overestimation bias (Fujimoto, Hoof, Meger)
SAC	2018	Stochastic	Maximum-entropy RL, automatic temperature tuning	State of the art for continuous control; very stable (Haarnoja et al.)

DDPG (Lillicrap et al., ICLR 2016) adapts the deterministic policy gradient of Silver et al. (ICML 2014) to deep networks by reusing the DQN tricks that stabilise off-policy Q-learning: a replay buffer, separate target networks for the actor and the critic, and Polyak-averaged target updates. The critic is trained by minimising the Bellman error on Q(s,a), and the deterministic actor is updated via the chain rule, ∇_θ J = E[∇_a Q_φ(s, a) · ∇θ μ_θ(s)|{a=μ_θ(s)}]. DDPG works on more than twenty simulated physics tasks including dexterous manipulation, legged locomotion, and end-to-end pixel control.

TD3 (Fujimoto, Hoof, Meger, ICML 2018) addresses three failure modes of DDPG. First, deep Q-functions overestimate values because the max operator in the Bellman target picks up positive noise; TD3 trains two critics and uses the smaller of the two as the target. Second, errors in the critic propagate to the actor and feed back into the data; TD3 updates the actor (and the target networks) only every two critic updates. Third, deterministic policies overfit to narrow peaks of the Q-function; TD3 adds clipped noise to the target action, smoothing the Q-learning target across nearby actions. Together these changes substantially outperform DDPG on the OpenAI Gym continuous-control suite.

SAC (Haarnoja, Zhou, Abbeel, Levine, ICML 2018) takes a different route. It frames RL as maximum-entropy RL, maximising a modified objective J_MaxEnt(π) = E[Σ r(s_t, a_t) + α H(π(·|s_t))]. The added entropy term encourages the policy to be as random as possible while still solving the task, which strongly improves exploration and produces robust policies. SAC trains a stochastic Gaussian actor and twin Q-critics off-policy from a replay buffer, with a temperature parameter α that can be tuned automatically by gradient on a target entropy. Because of its sample efficiency, stability, and minimal hyperparameter tuning, SAC has become a default choice for real-world robotic learning.

Asynchronous and distributed variants

Policy gradient methods scale especially well across many parallel environment workers, because their gradient is naturally an expectation that splits cleanly across actors.

A3C (Mnih et al., ICML 2016): asynchronous advantage actor-critic with multiple CPU workers and a shared parameter server.
A2C: synchronous version of A3C; each step waits for all workers and applies a batched update.
IMPALA (Espeholt et al., ICML 2018): decoupled actors and learners with the V-trace off-policy correction, achieving 250,000 frames per second and outperforming A3C by 30x while training on DMLab-30 and Atari-57.
APE-X (Horgan et al., ICLR 2018): distributed prioritised experience replay; though formulated for DQN it generalises to off-policy actor-critics.
R2D2 and recurrent variants: LSTM-based policies trained over long sequences with stored hidden states.
MAPPO and MADDPG: multi-agent extensions of PPO and DDPG.
D4PG (Barth-Maron et al., ICLR 2018): distributional critic for distributed deep deterministic policy gradients.

Variance reduction techniques

The basic policy gradient estimator has very high variance, and almost every practical advance in the field can be read as a variance-reduction trick. The following table lists the main techniques and where they appear.

Technique	What it does	Cost	Used in
State-value baseline V(s)	Subtracts a state-dependent baseline from the return	Need a learned V	REINFORCE with baseline, all actor-critics
Advantage estimation A(s,a) = Q − V	Centres the gradient on relative quality	Same	All modern actor-critics
Generalised Advantage Estimation (GAE)	Exponentially weighted multi-step advantage with parameter λ	One extra hyperparameter	TRPO, PPO, IMPALA
Bootstrapping with V(s')	Uses a learned value to truncate the Monte Carlo return	Bias if V is wrong	A2C, A3C, n-step actor-critic
Trust region constraint	Bounds policy change per step in KL divergence	Conjugate gradient solve	TRPO
Clipped surrogate objective	Clips probability ratio to bound the effective step	Implicit only	PPO
Importance sampling correction	Re-weights off-policy data to look on-policy	Variance from large ratios	Off-PAC, V-trace, Retrace
Twin critics	Takes minimum of two Q-estimates to fight overestimation	Double critic compute	TD3, SAC
Entropy regularisation	Adds α·H(π) to the loss	Need to tune α	A3C, SAC, PPO (small bonus)
Reward normalisation and clipping	Stabilises gradient magnitudes across tasks	Loses scale info	PPO, RND, most production code

GAE is worth singling out. Schulman et al. (ICLR 2016) defined the advantage estimator

A^{GAE(γ,λ)}_t = Σ_{l=0}^{∞} (γλ)^l · δ_{t+l},  with  δ_t = r_t + γV(s_{t+1}) − V(s_t)

which smoothly interpolates between high-variance Monte Carlo (λ → 1) and high-bias one-step TD (λ → 0). In practice a value of λ = 0.95 to 0.97 strikes the standard bias-variance balance and is the default in PPO and TRPO implementations.

Connection to RLHF and language model alignment

The most consequential application of policy gradient methods in the 2020s is reinforcement learning from human feedback (RLHF), the technique used to align large language models with human preferences. The dominant algorithm for the RL stage of RLHF is PPO, and the link is direct enough that papers like InstructGPT (Ouyang et al., NeurIPS 2022) describe their method as "PPO on a learned reward model." The pipeline that produced ChatGPT, the original Claude, and Gemini chat models follows the same outline:

Supervised fine-tuning (SFT): the base language model is fine-tuned on demonstrations written by humans.
Reward model training: humans rank pairs of model outputs, and a reward model r_φ is trained to predict the human preference.
Reinforcement learning: the language model is treated as a stochastic policy that emits a sequence of tokens, and PPO is run with a per-token reward derived from r_φ at the end of the sequence, plus a KL penalty β · KL(π_θ || π_ref) against the supervised reference policy.

The KL penalty against the SFT reference is what prevents the policy from drifting into nonsense or reward-hacking the imperfect r_φ; without it, the optimiser will quickly find inputs that the reward model rates highly but humans hate. PPO's clipped objective adds a second layer of conservatism. Ouyang et al. reported that the 1.3B InstructGPT outputs were preferred to the 175B GPT-3 outputs despite a 100x parameter gap, an early demonstration of how powerful the alignment loop is. Subsequent work has explored alternatives such as DPO (Direct Preference Optimisation, Rafailov et al. 2023), which sidesteps the reward model and the PPO loop by training directly on preference data, but PPO remains the workhorse of production RLHF pipelines.

Comparison with value-function methods

The table below contrasts the major classes of model-free RL algorithms.

Family	Examples	Action space	Sample efficiency	On-/off-policy	Typical stability	Where it shines
Pure value-based	Q-learning, DQN, SARSA	Discrete	High (off-policy + replay)	Off-policy	Sometimes brittle	Atari, discrete control, gridworlds
Pure policy-based	REINFORCE	Discrete or continuous	Low	On-policy	Noisy but unbiased	Pedagogy, very simple problems
On-policy actor-critic	A2C, A3C, TRPO, PPO, IMPALA	Both	Medium	On-policy	Robust, especially PPO	Games, RLHF, large-scale parallel training
Off-policy actor-critic, deterministic	DDPG, TD3, D4PG	Continuous	High	Off-policy	TD3 stable; DDPG can collapse	Robotics, continuous control
Off-policy actor-critic, max-entropy	SAC	Continuous	Very high	Off-policy	Very stable	Real-world robotic learning
Hybrid with planning	AlphaZero-style policy + value + MCTS	Discrete with structure	High	Off-policy via self-play	Stable in self-play	Board games, search-amenable domains

A loose rule of thumb: if the problem has a small discrete action set and you can collect lots of cheap experience, a value-based method like DQN is often sample efficient enough. If the actions are continuous, you almost certainly want a policy gradient method. If you also need stability and minimal tuning, start with PPO; if you need sample efficiency for real robots, start with SAC.

Theoretical results

The theory of policy gradient methods is unusually clean for a deep learning topic. The main results are:

Policy gradient theorem (Sutton, McAllester, Singh, Mansour 2000): the expression for ∇_θ J above, valid for both episodic and average-reward formulations, with and without a state-dependent baseline.
Compatible function approximation (same paper): if the critic is linear in features that match ∇_θ log π_θ, the substitute Q_w gives an unbiased gradient estimate.
Convergence under linear function approximation (Konda and Tsitsiklis 2000): two-time-scale actor-critic with a linear critic converges almost surely to a local maximum of J under standard step-size conditions.
Conservative policy iteration (Kakade and Langford, ICML 2002): mixing the new and old policies with a small mixture coefficient guarantees monotonic improvement, and motivates the trust-region perspective.
Natural policy gradient (Kakade, NeurIPS 2001): premultiplying the gradient by the inverse Fisher information matrix gives the steepest ascent direction in the natural geometry of the policy manifold and accelerates convergence.
Mirror descent perspective (Neu et al. 2017, Tomar et al. 2020): TRPO and PPO are special cases of mirror descent in the space of policies with a KL Bregman divergence, which clarifies why the surrogate objective works.
Global convergence under tabular and softmax parameterisations (Agarwal, Kakade, Lee, Mahajan 2021, On the Theory of Policy Gradient Methods): finite-time guarantees for natural policy gradient and projected policy gradient in idealised settings.

Practical considerations

Policy gradient methods are notoriously sensitive to implementation choices. The following items account for most of the gap between a working and a non-working implementation.

Hyperparameters: PPO's clip ratio ε, learning rate, GAE λ, number of epochs per batch, and minibatch size all matter. Defaults of ε = 0.2, lr ≈ 3e-4, λ = 0.95, 10 epochs, 64 minibatches per batch work for many problems but rarely all.
Reward shaping and clipping: clipping rewards to a fixed range or normalising them with a running mean and standard deviation often makes training tractable on tasks with very different reward scales. The 2019 Ilyas et al. study A Closer Look at Deep Policy Gradients showed that several PPO components widely believed to come from the algorithm in fact come from these implementation details.
Entropy regularisation: small entropy bonuses (β ≈ 0.01) prevent premature collapse to a deterministic policy in PPO and A3C; SAC promotes entropy to a first-class objective and tunes its weight automatically.
GAE λ: λ = 0 reduces to one-step TD (low variance, high bias), λ = 1 to Monte Carlo (high variance, no bias). Most production work sits between 0.9 and 0.99.
Parallel environment collection: PPO scales nearly linearly in the number of parallel environments because the gradient is an expectation. OpenAI Five used thousands of CPU workers; even modest projects benefit from 16 to 128 parallel environments.
Network initialisation: orthogonal initialisation of the policy and value heads with small final-layer scales is a standard PPO trick that materially affects early training.
Action space scaling: continuous policies output actions in a normalised range and scale them to the environment range, which keeps gradients well-behaved.

Frameworks and implementations

A short list of production-quality libraries that implement policy gradient methods:

Framework	Maintainer	Strengths	Notes
Stable-Baselines3	DLR-RM	Reliable PPO, A2C, SAC, TD3, DDPG; easy to use	PyTorch, single-machine focus
RLlib (Ray)	Anyscale	Distributed scaling, multi-agent, broad algorithm coverage	Best for large clusters
Tianshou	THU-ML (Tsinghua)	Highly modular; fast PPO and SAC	Research-friendly
TorchRL	Meta AI	TorchRL primitives, integrates with PyTorch ecosystem	Newer; growing fast
CleanRL	community	Single-file, research-friendly implementations	Excellent for understanding details
Acme	DeepMind	JAX and TF backends, distributed	Used in DeepMind research
Brax	Google	JAX physics + RL, end-to-end on accelerators	Very fast for continuous control
Sample Factory	Petrenko et al.	High-throughput on-policy training	Used for ViZDoom and procgen leaderboards

The 37 Implementation Details of PPO (Huang, Dossa, et al., ICLR Blog Track 2022) is the standard reference for understanding why nominally-equivalent implementations diverge in performance. A 2025 comparative study reported that Stable-Baselines3, CleanRL, and OpenAI Baselines achieved superhuman PPO performance rates around 50% in their benchmark trials, compared to under 15% for some other libraries, illustrating just how much implementation details matter.

Real-world applications

Game playing: OpenAI Five used scaled PPO with 256 GPUs and 128,000 CPU cores to defeat the Dota 2 world champions OG in April 2019, after self-playing the equivalent of about 180 years of games per day. AlphaStar used a similar actor-critic approach with population-based training to reach Grandmaster level in StarCraft II in 2019. AlphaGo and AlphaZero combine a policy network trained partly with policy gradient ideas with a Monte Carlo tree search planner.
Robotics: PPO is the workhorse for sim-to-real bipedal locomotion. Cassie, the bipedal robot built by Agility Robotics and Oregon State University, was the first 3D bipedal robot to walk in the real world using a learned end-to-end neural network policy, trained with PPO in simulation and transferred zero-shot to hardware (Xie et al. 2019, Li et al. 2024). Boston Dynamics has used learned controllers for Atlas and Spot. Quadruped locomotion (Hwangbo et al. 2019, on the ANYmal robot) and dexterous in-hand manipulation (OpenAI 2019, on the Shadow Hand) also rely on PPO-style training.
LLM alignment: the reinforcement learning from human feedback stage of InstructGPT, ChatGPT, the early Claude models, Gemini chat models, and a long tail of open-source instruction-tuned models all run PPO against a learned reward model with a KL penalty against the supervised reference policy. This is the single largest commercial deployment of a policy gradient algorithm to date.
Autonomous driving: policy gradient methods drive lane keeping, decision making, and motion planning in research and increasingly in production. Wayve, for instance, has publicly described training driving policies with deep RL.
Recommendation systems: YouTube, Spotify, and others have published work on using policy gradients (REINFORCE-style methods with off-policy correction) to learn slate recommendations and next-video policies.
Resource scheduling and traffic control: data-centre cooling control (DeepMind's reduction of Google data-centre cooling energy by 40% in 2016) and traffic signal control are classic policy gradient applications.
Healthcare and finance: RL-based treatment policies and portfolio optimisation use policy gradient methods, though deployment is limited by the difficulty of safe exploration.

Limitations

Policy gradient methods are not a panacea. Their persistent weaknesses are:

Sample efficiency. On-policy methods like PPO discard data after each update and require many environment interactions; for small discrete environments, value-based methods with replay are often dramatically more sample efficient.
High variance. Even with GAE and baselines, the gradient estimator is inherently noisy. Hyperparameter sweeps that look fine on average can show huge run-to-run variation.
Reward design. The output of policy gradient training is only as good as the reward function. Reward hacking (Krakovna et al. 2020 catalogued dozens of cases) is endemic, and shaping rewards by hand is brittle.
Hyperparameter sensitivity. As Ilyas et al. and the PPO 37 Details article documented, even canonical PPO depends heavily on implementation details that the original paper did not emphasise.
Local optima. Gradient ascent on a non-convex policy objective can converge to local maxima that are far from globally optimal, and exploration alone is not enough to escape them in many high-dimensional problems.
Catastrophic forgetting. Off-policy actor-critics in particular can suddenly collapse during training if the replay distribution drifts; TD3 and SAC mitigate this but do not eliminate it.
Credit assignment over long horizons. Discounted returns blur cause and effect over hundreds of timesteps; hierarchical and option-based extensions help but remain an active research area.

References

Williams, R. J. (1992). "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning." *Machine Learning*, 8, 229-256. https://link.springer.com/article/10.1007/BF00992696
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). "Policy Gradient Methods for Reinforcement Learning with Function Approximation." *Advances in Neural Information Processing Systems 12*. http://papers.neurips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf
Konda, V. R. and Tsitsiklis, J. N. (2000). "Actor-Critic Algorithms." *Advances in Neural Information Processing Systems 12*. https://proceedings.neurips.cc/paper/1786-actor-critic-algorithms.pdf
Kakade, S. M. (2001). "A Natural Policy Gradient." *Advances in Neural Information Processing Systems 14*. https://homes.cs.washington.edu/~sham/papers/rl/natural.pdf
Kakade, S. and Langford, J. (2002). "Approximately Optimal Approximate Reinforcement Learning." *International Conference on Machine Learning*. https://www.cs.cmu.edu/~jcl/presentation/RL/RL.ps
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). "Deterministic Policy Gradient Algorithms." *International Conference on Machine Learning*. https://proceedings.mlr.press/v32/silver14.pdf
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). "Trust Region Policy Optimization." *International Conference on Machine Learning*. https://arxiv.org/abs/1502.05477
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and Abbeel, P. (2016). "High-Dimensional Continuous Control Using Generalized Advantage Estimation." *International Conference on Learning Representations*. https://arxiv.org/abs/1506.02438
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). "Continuous control with deep reinforcement learning." *International Conference on Learning Representations*. https://arxiv.org/abs/1509.02971
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). "Asynchronous Methods for Deep Reinforcement Learning." *International Conference on Machine Learning*. https://arxiv.org/abs/1602.01783
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347. https://arxiv.org/abs/1707.06347
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu, K. (2018). "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures." *International Conference on Machine Learning*. https://arxiv.org/abs/1802.01561
Fujimoto, S., van Hoof, H., and Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." *International Conference on Machine Learning*. https://arxiv.org/abs/1802.09477
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *International Conference on Machine Learning*. https://arxiv.org/abs/1801.01290
Sutton, R. S. and Barto, A. G. (2018). *Reinforcement Learning: An Introduction*, 2nd ed., Chapter 13: Policy Gradient Methods. MIT Press. http://incompleteideas.net/book/the-book-2nd.html
OpenAI, et al. (2019). "Dota 2 with Large Scale Deep Reinforcement Learning." arXiv:1912.06680. https://cdn.openai.com/dota-2.pdf
Vinyals, O., et al. (2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning." *Nature*. https://deepmind.google/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning/
Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." *Advances in Neural Information Processing Systems*. https://arxiv.org/abs/2203.02155
Huang, S., Dossa, R. F. J., Raffin, A., Kanervisto, A., and Wang, W. (2022). "The 37 Implementation Details of Proximal Policy Optimization." ICLR Blog Track. https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
OpenAI Spinning Up Documentation: Vanilla Policy Gradient, TRPO, PPO, DDPG, TD3, SAC. https://spinningup.openai.com/
Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). "On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift." *Journal of Machine Learning Research*. https://jmlr.org/papers/volume22/19-736/19-736.pdf

Policy gradient methods

Background and motivation

The policy gradient theorem

REINFORCE: the original Monte Carlo policy gradient

Actor-critic methods

A3C and A2C

Trust region policy optimisation (TRPO)

Proximal policy optimisation (PPO)

Off-policy actor-critic methods for continuous control

Asynchronous and distributed variants

Variance reduction techniques

Connection to RLHF and language model alignment

Comparison with value-function methods

Theoretical results

Practical considerations

Frameworks and implementations

Real-world applications

Limitations

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

Proximal Policy Optimization (PPO)

L0 Regularization

L1 Loss

Policy gradient methods

Background and motivation

The policy gradient theorem

REINFORCE: the original Monte Carlo policy gradient

Actor-critic methods

A3C and A2C

Trust region policy optimisation (TRPO)

Proximal policy optimisation (PPO)

Off-policy actor-critic methods for continuous control

Asynchronous and distributed variants

Variance reduction techniques

Connection to RLHF and language model alignment

Comparison with value-function methods

Theoretical results

Practical considerations

Frameworks and implementations

Real-world applications

Limitations

See also

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

Proximal Policy Optimization (PPO)

L0 Regularization

L1 Loss