Proximal Policy Optimization (PPO)
Last reviewed
May 16, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 ยท 3,998 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 ยท 3,998 words
Add missing citations, update stale details, or suggest a clearer explanation.
Proximal Policy Optimization (PPO) is an on-policy policy gradient reinforcement learning algorithm introduced in July 2017 by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov at OpenAI. The paper "Proximal Policy Optimization Algorithms" (arXiv:1707.06347) proposed a family of objective functions that approximate the trust region behavior of TRPO using only first-order optimization. PPO became OpenAI's default reinforcement learning algorithm in 2017 and remained one of the most widely used policy gradient methods for nearly a decade across robotics, video game agents, and large language model alignment.
PPO is most famous outside the RL community as the workhorse algorithm of reinforcement learning from human feedback (RLHF) for large language models. It was the optimizer behind OpenAI's InstructGPT (2022), ChatGPT, and GPT-4, and behind early versions of Anthropic's Claude and Google's Gemini. From 2022 through 2024, almost every aligned production LLM was post-trained with PPO. Starting in 2023, simpler offline methods such as Direct Preference Optimization (DPO) and the critic-free Group Relative Policy Optimization (GRPO) began displacing PPO for many RLHF workloads, but PPO remains the standard baseline and is still used inside many frontier post-training stacks.
Reinforcement learning trains an agent to maximize cumulative reward by interacting with an environment. In the policy gradient family, the agent's behavior is a parameterized stochastic policy pi_theta(a|s) mapping states s to a distribution over actions a. The parameters theta are updated by ascending the gradient of the expected return J(theta) = E[sum_t gamma^t r_t]. The classic REINFORCE estimator, due to Ronald Williams in 1992, uses Monte Carlo returns to estimate this gradient. While unbiased, REINFORCE has high variance, and its single-step updates can produce destructively large policy changes that collapse training.
Actor-critic methods reduce variance by introducing a learned value function V_phi(s) as a baseline. The policy is updated using the advantage A_t = Q(s_t, a_t) - V(s_t), which estimates how much better an action is than the policy's average behavior in that state. Schulman, Moritz, Levine, Jordan, and Abbeel formalized a flexible family of advantage estimators in 2015 with Generalized Advantage Estimation (GAE), which mixes multi-step temporal-difference errors via an exponentially weighted average controlled by the parameter lambda. GAE has since become a near-universal component of practical policy gradient pipelines, including every standard implementation of PPO.
The immediate predecessor of PPO is Trust Region Policy Optimization (TRPO), introduced in February 2015 by Schulman, Levine, Moritz, Jordan, and Abbeel (arXiv:1502.05477). TRPO addressed the destructive-update problem by maximizing a surrogate objective subject to a constraint on the average KL divergence between the new and old policies:
maximize E_t [ pi_theta(a_t|s_t) / pi_theta_old(a_t|s_t) * A_t ]
subject to E_t [ KL( pi_theta_old(.|s_t) || pi_theta(.|s_t) ) ] <= delta
The constraint defines a "trust region" inside which the linear approximation of the objective is reliable. TRPO solves the constrained problem with a second-order method using the Fisher information matrix and conjugate gradients to find the natural gradient direction, followed by a line search to enforce the KL constraint.
TRPO produces monotonic improvement guarantees in theory and stable training in practice, but has serious engineering drawbacks. The second-order step is awkward with deep networks that use dropout, batch norm, or shared policy-value parameters, and it is difficult to parallelize across GPUs. These limitations motivated the search for a simpler first-order method, which is exactly what PPO delivered.
The paper "Proximal Policy Optimization Algorithms" was submitted to arXiv on July 20, 2017 by five authors all at OpenAI: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. It proposed two related variants of a clipped or penalized surrogate objective that could be optimized with standard stochastic gradient ascent on minibatches, allowing multiple epochs of updates from a single batch of trajectories.
The paper's empirical claim was that PPO outperformed prior online policy gradient methods on a suite of benchmark tasks (MuJoCo continuous control and Atari 2600 games) while being far simpler to implement than TRPO. PPO matched or exceeded TRPO on most MuJoCo tasks and outperformed actor-critic with experience replay (ACER) on Atari. The simplicity argument was arguably more important: PPO could be implemented in roughly fifty lines of code on top of any automatic differentiation framework, with no need for second-order optimization or specialized matrix-vector product routines.
PPO defines a probability ratio between the new and old policies:
r_t(theta) = pi_theta(a_t|s_t) / pi_theta_old(a_t|s_t)
At the start of each optimization phase pi_theta_old is set to pi_theta, so initially r_t = 1. As the policy parameters change, r_t drifts away from one. The naive surrogate objective is E_t [ r_t(theta) * A_t ], which is the same first-order term that appears in TRPO. Maximizing this objective without a constraint can drive r_t to extreme values and destabilize training.
The more popular variant, often called PPO-Clip, defines the clipped surrogate objective:
L_CLIP(theta) = E_t [ min( r_t(theta) * A_t,
clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_t ) ]
The clip function squashes the ratio into [1 - epsilon, 1 + epsilon], and the outer min ensures the objective is a pessimistic lower bound on the unclipped surrogate. When the advantage is positive, the loss stops increasing once r_t > 1 + epsilon, removing the incentive for large positive steps. When the advantage is negative, the loss stops decreasing once r_t < 1 - epsilon. Gradient updates that would push the ratio far outside the trust region receive zero gradient signal from those samples.
The epsilon hyperparameter controls trust region size. The original paper used epsilon = 0.2 and reported PPO-Clip was robust across a wide range of values. This simple clipping trick is what most modern implementations refer to as "PPO."
The second variant, PPO-Penalty, replaces the clip with an explicit KL divergence penalty:
L_KLPEN(theta) = E_t [ r_t(theta) * A_t - beta * KL( pi_theta_old(.|s_t) || pi_theta(.|s_t) ) ]
The coefficient beta is adapted online: after each policy update the empirical KL is measured, and beta is doubled if KL exceeds the target by 1.5x or halved if it falls below by 1.5x. This adaptive scheme approximates a KL constraint without solving a constrained optimization problem.
The original paper reported that PPO-Penalty performed worse than PPO-Clip on MuJoCo, which is why the clipped variant became the default. PPO-Penalty later proved useful for LLM RLHF, where a KL penalty against a reference policy is essential.
In the actor-critic implementation used in the paper and in essentially every modern codebase, the full objective combines three terms:
L(theta) = L_CLIP(theta) - c1 * L_VF(theta) + c2 * S(pi_theta)
where L_VF = (V_phi(s_t) - V_target_t)^2 is the value loss, S is the entropy of the policy distribution, and c1, c2 are scalar coefficients. The value loss supervises the critic, and the entropy bonus encourages exploration.
The original paper's MuJoCo hyperparameters, with small modifications, remain the recommended defaults in most modern PPO implementations.
| Hyperparameter | Original paper value | Typical range |
|---|---|---|
Clip parameter epsilon | 0.2 | 0.1 to 0.3 |
Discount factor gamma | 0.99 | 0.95 to 0.999 |
| GAE lambda | 0.95 | 0.9 to 0.99 |
| Learning rate (Adam) | 3e-4 | 1e-5 to 5e-4 |
| Number of actors | 1 to 32 | 1 to thousands |
Horizon T per actor | 2048 | 128 to 4096 |
| Optimization epochs per batch | 10 | 3 to 30 |
| Minibatch size | 64 | 32 to 4096 |
Value loss coefficient c1 | 1.0 | 0.5 to 1.0 |
Entropy coefficient c2 | 0.0 | 0.0 to 0.01 |
| Target KL (for early stopping) | not in paper | 0.01 to 0.05 |
| Gradient clipping (global norm) | 0.5 | 0.5 to 1.0 |
An important fact missing from the original paper is that low-level implementation details matter enormously in practice. A 2020 study by Logan Engstrom and colleagues at MIT, "Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO," showed that much of PPO's empirical edge over TRPO came from engineering choices such as value function clipping, observation normalization, reward scaling, orthogonal initialization, and learning rate annealing. The clipping objective contributed less than the scaffolding around it. CleanRL's widely cited "The 37 Implementation Details of Proximal Policy Optimization" (Huang et al., 2022) catalogs the engineering tricks any production-grade PPO must respect.
A typical PPO training loop looks like the following. The structure is identical across CleanRL, Stable-Baselines3, RLlib, and TRL.
initialize policy parameters theta and value parameters phi
for iteration = 1, 2, ... do
for actor = 1 to N do
run policy pi_theta in the environment for T timesteps
record states, actions, rewards, log-probs, and value predictions
compute advantages A_hat_t with GAE(gamma, lambda)
compute returns R_t = A_hat_t + V_phi(s_t)
normalize advantages (mean zero, unit variance)
for epoch = 1 to K do
shuffle data and split into minibatches
for each minibatch do
compute ratio r_t = pi_theta(a_t|s_t) / pi_theta_old(a_t|s_t)
compute clipped surrogate loss L_CLIP
compute value loss L_VF
compute entropy bonus S
gradient step on (- L_CLIP + c1 * L_VF - c2 * S) with Adam
optionally clip global gradient norm
optionally early-stop epoch if mean KL exceeds target
set theta_old = theta
The inner double loop (epochs over minibatches) is what distinguishes PPO from vanilla policy gradient: each batch is reused for multiple gradient steps, which gives PPO its sample efficiency. The clipping makes the multi-step updates safe; without it the ratio would drift further from one with each step and the policy gradient estimate would become unreliable.
GAE is so tightly coupled with PPO that practitioners often treat them as one algorithm. The advantage at time t is A_hat_t = sum_{l>=0} (gamma * lambda)^l * delta_{t+l}, where delta_t = r_t + gamma * V_phi(s_{t+1}) - V_phi(s_t) is the one-step TD error. The parameter lambda interpolates between high-variance Monte Carlo returns (lambda = 1) and high-bias one-step TD estimates (lambda = 0). The GAE paper recommends lambda in the range 0.92 to 0.98, with 0.95 the most common default.
The table below contrasts PPO with its predecessor TRPO and with the two algorithms most often used to replace it in LLM alignment: DPO and GRPO.
| Property | TRPO (2015) | PPO (2017) | DPO (2023) | GRPO (2024) |
|---|---|---|---|---|
| Family | On-policy RL | On-policy RL | Offline supervised | On-policy RL |
| Trust region method | Second-order, KL constraint | First-order, clipped ratio | Implicit via KL regularizer | First-order, clipped ratio |
| Requires reward model | Yes (in RLHF) | Yes (in RLHF) | No | Yes (in RLHF) |
| Requires value model (critic) | Yes | Yes | No | No |
| Models in memory (RLHF) | 4 (policy, value, reward, reference) | 4 | 2 (policy, reference) | 3 (policy, reward, reference) |
| Optimization | Conjugate gradient + line search | Adam SGD | Adam SGD | Adam SGD |
| Sample efficiency | High | High | Highest (offline) | High |
| Implementation complexity | Very high | Moderate | Low | Low |
| Hyperparameter sensitivity | High | High | Moderate | Moderate |
| Memory overhead vs PPO | Same | Baseline | -50% | -25 to -33% |
| First major LLM use | None | InstructGPT, ChatGPT, GPT-4 | Zephyr, Llama 3 | DeepSeek-R1, Qwen, Llama post-training |
| Year of dominance | Never (replaced by PPO) | 2017 to 2024 | 2023 to 2025 | 2024 onward |
TRPO and PPO share the same on-policy actor-critic skeleton and differ only in how they enforce the trust region: a hard KL constraint solved by second-order methods (TRPO) versus a soft clipped objective solved by Adam (PPO). DPO is not a reinforcement learning algorithm at all; it reformulates the RLHF objective into a closed-form supervised loss on preference pairs. GRPO keeps the on-policy structure of PPO but replaces the learned critic with a group-based baseline from multiple sampled completions of the same prompt. Other variants such as KTO (Kahneman-Tversky Optimization) explore alternative offline preference objectives.
The most consequential use of PPO has been the alignment of large language models via reinforcement learning from human feedback. The LLM is the policy pi_theta(a|s); the state s is the prompt plus the tokens generated so far, the action a is the next token, and the reward comes from a separate reward model trained on human preference data. To prevent the policy from drifting too far from the pretrained base, RLHF augments the reward with a per-token KL penalty against a frozen reference policy, typically the supervised fine-tuned model.
The foundational paper is OpenAI's "Training language models to follow instructions with human feedback," published in March 2022 by Long Ouyang and colleagues (arXiv:2203.02155). It described a three-stage pipeline: supervised fine-tuning on demonstration data, reward model training on pairwise human preferences, and PPO fine-tuning against the learned reward. The PPO stage used a per-token KL penalty added to the reward, often called the PPO-PTX variant when an extra pretraining loss is mixed in. The same recipe powered ChatGPT (November 2022) and GPT-4 (March 2023).
The InstructGPT paper became the practical template for industry RLHF. Anthropic adopted a similar PPO-based pipeline for early Claude, augmented with Constitutional AI. DeepMind's Sparrow used PPO with KL regularization. Google used PPO for instruction tuning of LaMDA and PaLM-derived models that became Bard and early Gemini. Llama 2 Chat and most other open RLHF replications relied on PPO.
Several factors contributed to PPO's dominance in LLM RLHF from 2022 to 2024. PPO was already the de facto RL standard with mature implementations. The clip-based trust region behaves well even with huge transformers and sparse reward. The per-token KL penalty against a reference policy fits naturally into PPO-Penalty style training. PPO can be combined with a value head sharing parameters with the LLM, keeping the critic relatively cheap.
The canonical reference is the 2023 paper "Secrets of RLHF in Large Language Models Part I: PPO" by the MOSS-RLHF team at Fudan (arXiv:2307.04964), which documents the implementation details (reward normalization, value clipping, KL coefficient warmup, per-token KL penalty design) that determine whether PPO trains stably on an LLM. Without these tricks, PPO-on-LLM training is notoriously brittle.
Despite its success, PPO has several characteristics that became liabilities at frontier-model scale. It requires four large model copies (policy, value, reward, reference) resident in GPU memory. The value model is typically initialized from the pretrained backbone, doubling the trainable memory cost. The on-policy sampling loop is computationally expensive for billion-parameter LLMs that must generate long sequences. Finally, the KL coefficient and other hyperparameters remain genuinely sensitive to tuning, with small perturbations producing dramatic performance swings.
These pain points motivated two waves of replacements. DPO (Rafailov et al., May 2023) eliminated reinforcement learning entirely by recognizing that the constrained reward maximization underlying RLHF admits a closed-form solution, allowing the reward function to be reparameterized in terms of the optimal policy itself. The result is a simple binary cross-entropy loss over preference pairs, optimized offline with no reward model, no value model, and no sampling. By late 2023, Mixtral 8x7B Instruct, Zephyr 7B, and Phi-3 had switched to DPO-style training.
GRPO (Shao et al., February 2024) kept the on-policy RL structure but eliminated the value model. Instead of a learned critic, GRPO samples multiple completions per prompt and uses the group's mean reward as a baseline. This cuts memory cost by roughly a third. DeepSeek-R1 used GRPO to train a frontier reasoning model in late 2024 and early 2025, and after that release GRPO and its descendants (DAPO, Dr. GRPO, GSPO, REINFORCE++) became the dominant post-training algorithm for open reasoning models. By 2025, Sebastian Raschka described the year as "RLVR plus GRPO," replacing the 2022 era of "RLHF plus PPO."
| Year | System | Organization | RL algorithm | Notes |
|---|---|---|---|---|
| 2017 | OpenAI Five (Dota 2) | OpenAI | PPO | First PPO demonstration at scale; defeated pros |
| 2019 | Fine-tuning LMs from human preferences | OpenAI | PPO | Ziegler et al., first PPO-on-LM paper |
| 2020 | Learning to summarize from human feedback | OpenAI | PPO | Stiennon et al., showed RLHF beat MLE |
| 2022 | InstructGPT | OpenAI | PPO-PTX | Ouyang et al., template for industry RLHF |
| 2022 | ChatGPT | OpenAI | PPO | Built on InstructGPT pipeline |
| 2022 | Sparrow | DeepMind | PPO | Glaese et al. |
| 2022 | Claude (v1) | Anthropic | PPO | Combined with Constitutional AI |
| 2023 | GPT-4 | OpenAI | PPO | Refined RLHF pipeline based on InstructGPT |
| 2023 | Llama 2 Chat | Meta | PPO | Open weights with RLHF |
| 2023 | Bard / early Gemini | PPO | Built on PaLM-2 with RLHF | |
| 2023 | Zephyr 7B | Hugging Face | DPO | First major open model trained purely with DPO |
| 2023 | Mixtral 8x7B Instruct | Mistral AI | DPO | DPO superseded PPO for many open models |
| 2024 | Llama 3 Instruct | Meta | DPO + PPO + Rejection Sampling | Hybrid pipeline |
| 2024 | DeepSeekMath | DeepSeek | GRPO | First paper to introduce GRPO |
| 2025 | DeepSeek-R1 | DeepSeek | GRPO | Frontier reasoning model with critic-free RL |
| 2025 | Qwen 2.5 / Qwen 3 reasoners | Alibaba | GRPO variants | DAPO and GSPO variants |
| 2025 | Llama 4 post-training | Meta | GRPO + DPO | Mixed pipeline with GRPO for reasoning |
PPO is implemented in essentially every RL library released since 2018. The most widely used implementations are:
| Library | Maintainer | Primary use case | Notes |
|---|---|---|---|
| OpenAI Baselines | OpenAI | Research reference | Original implementation, now archived |
| Stable-Baselines3 (SB3) | DLR-RM | Single-machine RL | PyTorch, beginner-friendly, most cited baseline |
| RLlib | Anyscale (Ray) | Distributed RL | Large clusters and multi-agent setups |
| CleanRL | Costa Huang et al. | Research and education | Transparent single-file implementations |
| TRL | Hugging Face | LLM RLHF | PPOTrainer and PPOv2Trainer, integrates with PEFT |
| trlX | CarperAI | LLM RLHF at scale | Distributed PPO for billion-parameter LMs |
| OpenRLHF | OpenLLMAI | LLM RLHF at scale | DeepSpeed-based; supports PPO, DPO, GRPO |
| verl | ByteDance | LLM RLHF at scale | Hybrid engine, used at ByteDance |
| Tianshou | THU-ML | Modular RL research | Lightweight PyTorch library |
A 2025 paper by Dissanayaka and colleagues, "On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations" (arXiv:2503.22575), found dramatic performance differences across these libraries on identical benchmarks: Stable-Baselines3, CleanRL, and the original Baselines formed a high-performing group, while RLlib and Tianshou underperformed substantially on some Atari tasks. The lesson is that the abstract PPO algorithm is not a single object but a family of closely related procedures whose performance depends sensitively on dozens of implementation details.
PPO's enduring popularity rests on a small number of robust empirical properties. The clipped trust region prevents catastrophic policy collapse, which is the most common failure mode of vanilla policy gradients. The objective is only a few lines of code on top of any automatic differentiation framework, with no second-order solvers or specialized linear algebra. Multiple epochs of minibatch updates per environment rollout give PPO better sample efficiency than single-step policy gradients. The same algorithm trains Atari agents, MuJoCo robots, Dota 2 bots, and trillion-token language models with only minor changes to the reward and the rollout machinery.
The weaknesses of PPO have also been catalogued extensively, especially in the LLM RLHF setting. PPO requires careful tuning of the KL coefficient, clip range, learning rate, and value function settings, and implementations vary substantially in performance. In LLM RLHF it requires four large models in memory simultaneously, which makes training very large models prohibitively expensive. The on-policy sampling loop means fresh rollouts are required for every gradient update batch, which is expensive for billion-parameter LLMs generating long sequences. The value function is hard to train when rewards arrive only at the end of long sequences, and a poorly trained critic can drag down the policy. Like all reward-model-based RLHF methods, PPO can exploit artifacts in the reward model rather than improving the underlying behavior. Engstrom et al. (2020) further showed that most of PPO's advantage over TRPO comes from engineering tricks rather than the clipped objective itself.
These limitations were the proximate cause of the post-2023 migration toward DPO, GRPO, and related methods for LLM alignment. PPO remains the dominant choice for environments where preference data is unavailable or where on-policy interaction is essential, such as game-playing agents, robotics, and code execution feedback loops.
Reinforcement learning, policy gradient, TRPO, GAE, RLHF, InstructGPT, DPO, GRPO, KTO, John Schulman, OpenAI.