Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is an on-policy policy gradient reinforcement learning algorithm introduced in July 2017 by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov at OpenAI. The paper "Proximal Policy Optimization Algorithms" (arXiv:1707.06347) proposed a family of objective functions that approximate the trust region behavior of TRPO using only first-order optimization. PPO became OpenAI's default reinforcement learning algorithm in 2017 and remained one of the most widely used policy gradient methods for nearly a decade across robotics, video game agents, and large language model alignment.

PPO is most famous outside the RL community as the workhorse algorithm of reinforcement learning from human feedback (RLHF) for large language models. It was the optimizer behind OpenAI's InstructGPT (2022), ChatGPT, and GPT-4, and behind early versions of Anthropic's Claude and Google's Gemini. From 2022 through 2024, almost every aligned production LLM was post-trained with PPO. Starting in 2023, simpler offline methods such as Direct Preference Optimization (DPO) and the critic-free Group Relative Policy Optimization (GRPO) began displacing PPO for many RLHF workloads, but PPO remains the standard baseline and is still used inside many frontier post-training stacks.

Background

Policy gradient methods

Reinforcement learning trains an agent to maximize cumulative reward by interacting with an environment. In the policy gradient family, the agent's behavior is a parameterized stochastic policy pi_theta(a|s) mapping states s to a distribution over actions a. The parameters theta are updated by ascending the gradient of the expected return J(theta) = E[sum_t gamma^t r_t]. The classic REINFORCE estimator, due to Ronald Williams in 1992, uses Monte Carlo returns to estimate this gradient. While unbiased, REINFORCE has high variance, and its single-step updates can produce destructively large policy changes that collapse training.

Actor-critic methods reduce variance by introducing a learned value function V_phi(s) as a baseline. The policy is updated using the advantage A_t = Q(s_t, a_t) - V(s_t), which estimates how much better an action is than the policy's average behavior in that state. Schulman, Moritz, Levine, Jordan, and Abbeel formalized a flexible family of advantage estimators in 2015 with Generalized Advantage Estimation (GAE), which mixes multi-step temporal-difference errors via an exponentially weighted average controlled by the parameter lambda. GAE has since become a near-universal component of practical policy gradient pipelines, including every standard implementation of PPO.

Trust region policy optimization

The immediate predecessor of PPO is Trust Region Policy Optimization (TRPO), introduced in February 2015 by Schulman, Levine, Moritz, Jordan, and Abbeel (arXiv:1502.05477). TRPO addressed the destructive-update problem by maximizing a surrogate objective subject to a constraint on the average KL divergence between the new and old policies:

maximize    E_t [ pi_theta(a_t|s_t) / pi_theta_old(a_t|s_t) * A_t ]
subject to  E_t [ KL( pi_theta_old(.|s_t) || pi_theta(.|s_t) ) ] <= delta

The constraint defines a "trust region" inside which the linear approximation of the objective is reliable. TRPO solves the constrained problem with a second-order method using the Fisher information matrix and conjugate gradients to find the natural gradient direction, followed by a line search to enforce the KL constraint.

TRPO produces monotonic improvement guarantees in theory and stable training in practice, but has serious engineering drawbacks. The second-order step is awkward with deep networks that use dropout, batch norm, or shared policy-value parameters, and it is difficult to parallelize across GPUs. These limitations motivated the search for a simpler first-order method, which is exactly what PPO delivered.

The PPO paper

The paper "Proximal Policy Optimization Algorithms" was submitted to arXiv on July 20, 2017 by five authors all at OpenAI: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. It proposed two related variants of a clipped or penalized surrogate objective that could be optimized with standard stochastic gradient ascent on minibatches, allowing multiple epochs of updates from a single batch of trajectories.

The paper's empirical claim was that PPO outperformed prior online policy gradient methods on a suite of benchmark tasks (MuJoCo continuous control and Atari 2600 games) while being far simpler to implement than TRPO. PPO matched or exceeded TRPO on most MuJoCo tasks and outperformed actor-critic with experience replay (ACER) on Atari. The simplicity argument was arguably more important: PPO could be implemented in roughly fifty lines of code on top of any automatic differentiation framework, with no need for second-order optimization or specialized matrix-vector product routines.

The clipped surrogate objective

PPO defines a probability ratio between the new and old policies:

r_t(theta) = pi_theta(a_t|s_t) / pi_theta_old(a_t|s_t)

At the start of each optimization phase pi_theta_old is set to pi_theta, so initially r_t = 1. As the policy parameters change, r_t drifts away from one. The naive surrogate objective is E_t [ r_t(theta) * A_t ], which is the same first-order term that appears in TRPO. Maximizing this objective without a constraint can drive r_t to extreme values and destabilize training.

PPO-Clip

The more popular variant, often called PPO-Clip, defines the clipped surrogate objective:

L_CLIP(theta) = E_t [ min( r_t(theta) * A_t,
                            clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_t ) ]

The clip function squashes the ratio into [1 - epsilon, 1 + epsilon], and the outer min ensures the objective is a pessimistic lower bound on the unclipped surrogate. When the advantage is positive, the loss stops increasing once r_t > 1 + epsilon, removing the incentive for large positive steps. When the advantage is negative, the loss stops decreasing once r_t < 1 - epsilon. Gradient updates that would push the ratio far outside the trust region receive zero gradient signal from those samples.

The epsilon hyperparameter controls trust region size. The original paper used epsilon = 0.2 and reported PPO-Clip was robust across a wide range of values. This simple clipping trick is what most modern implementations refer to as "PPO."

PPO-Penalty

The second variant, PPO-Penalty, replaces the clip with an explicit KL divergence penalty:

L_KLPEN(theta) = E_t [ r_t(theta) * A_t - beta * KL( pi_theta_old(.|s_t) || pi_theta(.|s_t) ) ]

The coefficient beta is adapted online: after each policy update the empirical KL is measured, and beta is doubled if KL exceeds the target by 1.5x or halved if it falls below by 1.5x. This adaptive scheme approximates a KL constraint without solving a constrained optimization problem.

The original paper reported that PPO-Penalty performed worse than PPO-Clip on MuJoCo, which is why the clipped variant became the default. PPO-Penalty later proved useful for LLM RLHF, where a KL penalty against a reference policy is essential.

Full objective with value and entropy terms

In the actor-critic implementation used in the paper and in essentially every modern codebase, the full objective combines three terms:

L(theta) = L_CLIP(theta) - c1 * L_VF(theta) + c2 * S(pi_theta)

where L_VF = (V_phi(s_t) - V_target_t)^2 is the value loss, S is the entropy of the policy distribution, and c1, c2 are scalar coefficients. The value loss supervises the critic, and the entropy bonus encourages exploration.

Default hyperparameters

The original paper's MuJoCo hyperparameters, with small modifications, remain the recommended defaults in most modern PPO implementations.

Hyperparameter	Original paper value	Typical range
Clip parameter `epsilon`	0.2	0.1 to 0.3
Discount factor `gamma`	0.99	0.95 to 0.999
GAE lambda	0.95	0.9 to 0.99
Learning rate (Adam)	3e-4	1e-5 to 5e-4
Number of actors	1 to 32	1 to thousands
Horizon `T` per actor	2048	128 to 4096
Optimization epochs per batch	10	3 to 30
Minibatch size	64	32 to 4096
Value loss coefficient `c1`	1.0	0.5 to 1.0
Entropy coefficient `c2`	0.0	0.0 to 0.01
Target KL (for early stopping)	not in paper	0.01 to 0.05
Gradient clipping (global norm)	0.5	0.5 to 1.0

An important fact missing from the original paper is that low-level implementation details matter enormously in practice. A 2020 study by Logan Engstrom and colleagues at MIT, "Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO," showed that much of PPO's empirical edge over TRPO came from engineering choices such as value function clipping, observation normalization, reward scaling, orthogonal initialization, and learning rate annealing. The clipping objective contributed less than the scaffolding around it. CleanRL's widely cited "The 37 Implementation Details of Proximal Policy Optimization" (Huang et al., 2022) catalogs the engineering tricks any production-grade PPO must respect.

Algorithm pseudocode

A typical PPO training loop looks like the following. The structure is identical across CleanRL, Stable-Baselines3, RLlib, and TRL.

initialize policy parameters theta and value parameters phi
for iteration = 1, 2, ... do
  for actor = 1 to N do
    run policy pi_theta in the environment for T timesteps
    record states, actions, rewards, log-probs, and value predictions
  compute advantages A_hat_t with GAE(gamma, lambda)
  compute returns R_t = A_hat_t + V_phi(s_t)
  normalize advantages (mean zero, unit variance)
  for epoch = 1 to K do
    shuffle data and split into minibatches
    for each minibatch do
      compute ratio r_t = pi_theta(a_t|s_t) / pi_theta_old(a_t|s_t)
      compute clipped surrogate loss L_CLIP
      compute value loss L_VF
      compute entropy bonus S
      gradient step on (- L_CLIP + c1 * L_VF - c2 * S) with Adam
      optionally clip global gradient norm
      optionally early-stop epoch if mean KL exceeds target
  set theta_old = theta

The inner double loop (epochs over minibatches) is what distinguishes PPO from vanilla policy gradient: each batch is reused for multiple gradient steps, which gives PPO its sample efficiency. The clipping makes the multi-step updates safe; without it the ratio would drift further from one with each step and the policy gradient estimate would become unreliable.

Generalized advantage estimation

GAE is so tightly coupled with PPO that practitioners often treat them as one algorithm. The advantage at time t is A_hat_t = sum_{l>=0} (gamma * lambda)^l * delta_{t+l}, where delta_t = r_t + gamma * V_phi(s_{t+1}) - V_phi(s_t) is the one-step TD error. The parameter lambda interpolates between high-variance Monte Carlo returns (lambda = 1) and high-bias one-step TD estimates (lambda = 0). The GAE paper recommends lambda in the range 0.92 to 0.98, with 0.95 the most common default.

The table below contrasts PPO with its predecessor TRPO and with the two algorithms most often used to replace it in LLM alignment: DPO and GRPO.

Property	TRPO (2015)	PPO (2017)	DPO (2023)	GRPO (2024)
Family	On-policy RL	On-policy RL	Offline supervised	On-policy RL
Trust region method	Second-order, KL constraint	First-order, clipped ratio	Implicit via KL regularizer	First-order, clipped ratio
Requires reward model	Yes (in RLHF)	Yes (in RLHF)	No	Yes (in RLHF)
Requires value model (critic)	Yes	Yes	No	No
Models in memory (RLHF)	4 (policy, value, reward, reference)	4	2 (policy, reference)	3 (policy, reward, reference)
Optimization	Conjugate gradient + line search	Adam SGD	Adam SGD	Adam SGD
Sample efficiency	High	High	Highest (offline)	High
Implementation complexity	Very high	Moderate	Low	Low
Hyperparameter sensitivity	High	High	Moderate	Moderate
Memory overhead vs PPO	Same	Baseline	-50%	-25 to -33%
First major LLM use	None	InstructGPT, ChatGPT, GPT-4	Zephyr, Llama 3	DeepSeek-R1, Qwen, Llama post-training
Year of dominance	Never (replaced by PPO)	2017 to 2024	2023 to 2025	2024 onward

TRPO and PPO share the same on-policy actor-critic skeleton and differ only in how they enforce the trust region: a hard KL constraint solved by second-order methods (TRPO) versus a soft clipped objective solved by Adam (PPO). DPO is not a reinforcement learning algorithm at all; it reformulates the RLHF objective into a closed-form supervised loss on preference pairs. GRPO keeps the on-policy structure of PPO but replaces the learned critic with a group-based baseline from multiple sampled completions of the same prompt. Other variants such as KTO (Kahneman-Tversky Optimization) explore alternative offline preference objectives.

Application to RLHF

The most consequential use of PPO has been the alignment of large language models via reinforcement learning from human feedback. The LLM is the policy pi_theta(a|s); the state s is the prompt plus the tokens generated so far, the action a is the next token, and the reward comes from a separate reward model trained on human preference data. To prevent the policy from drifting too far from the pretrained base, RLHF augments the reward with a per-token KL penalty against a frozen reference policy, typically the supervised fine-tuned model.

InstructGPT and ChatGPT

The foundational paper is OpenAI's "Training language models to follow instructions with human feedback," published in March 2022 by Long Ouyang and colleagues (arXiv:2203.02155). It described a three-stage pipeline: supervised fine-tuning on demonstration data, reward model training on pairwise human preferences, and PPO fine-tuning against the learned reward. The PPO stage used a per-token KL penalty added to the reward, often called the PPO-PTX variant when an extra pretraining loss is mixed in. The same recipe powered ChatGPT (November 2022) and GPT-4 (March 2023).

The InstructGPT paper became the practical template for industry RLHF. Anthropic adopted a similar PPO-based pipeline for early Claude, augmented with Constitutional AI. DeepMind's Sparrow used PPO with KL regularization. Google used PPO for instruction tuning of LaMDA and PaLM-derived models that became Bard and early Gemini. Llama 2 Chat and most other open RLHF replications relied on PPO.

Why PPO worked for LLMs

Several factors contributed to PPO's dominance in LLM RLHF from 2022 to 2024. PPO was already the de facto RL standard with mature implementations. The clip-based trust region behaves well even with huge transformers and sparse reward. The per-token KL penalty against a reference policy fits naturally into PPO-Penalty style training. PPO can be combined with a value head sharing parameters with the LLM, keeping the critic relatively cheap.

The canonical reference is the 2023 paper "Secrets of RLHF in Large Language Models Part I: PPO" by the MOSS-RLHF team at Fudan (arXiv:2307.04964), which documents the implementation details (reward normalization, value clipping, KL coefficient warmup, per-token KL penalty design) that determine whether PPO trains stably on an LLM. Without these tricks, PPO-on-LLM training is notoriously brittle.

Why PPO was eventually replaced

Despite its success, PPO has several characteristics that became liabilities at frontier-model scale. It requires four large model copies (policy, value, reward, reference) resident in GPU memory. The value model is typically initialized from the pretrained backbone, doubling the trainable memory cost. The on-policy sampling loop is computationally expensive for billion-parameter LLMs that must generate long sequences. Finally, the KL coefficient and other hyperparameters remain genuinely sensitive to tuning, with small perturbations producing dramatic performance swings.

These pain points motivated two waves of replacements. DPO (Rafailov et al., May 2023) eliminated reinforcement learning entirely by recognizing that the constrained reward maximization underlying RLHF admits a closed-form solution, allowing the reward function to be reparameterized in terms of the optimal policy itself. The result is a simple binary cross-entropy loss over preference pairs, optimized offline with no reward model, no value model, and no sampling. By late 2023, Mixtral 8x7B Instruct, Zephyr 7B, and Phi-3 had switched to DPO-style training.

GRPO (Shao et al., February 2024) kept the on-policy RL structure but eliminated the value model. Instead of a learned critic, GRPO samples multiple completions per prompt and uses the group's mean reward as a baseline. This cuts memory cost by roughly a third. DeepSeek-R1 used GRPO to train a frontier reasoning model in late 2024 and early 2025, and after that release GRPO and its descendants (DAPO, Dr. GRPO, GSPO, REINFORCE++) became the dominant post-training algorithm for open reasoning models. By 2025, Sebastian Raschka described the year as "RLVR plus GRPO," replacing the 2022 era of "RLHF plus PPO."

Application history in LLM RLHF

Year	System	Organization	RL algorithm	Notes
2017	OpenAI Five (Dota 2)	OpenAI	PPO	First PPO demonstration at scale; defeated pros
2019	Fine-tuning LMs from human preferences	OpenAI	PPO	Ziegler et al., first PPO-on-LM paper
2020	Learning to summarize from human feedback	OpenAI	PPO	Stiennon et al., showed RLHF beat MLE
2022	InstructGPT	OpenAI	PPO-PTX	Ouyang et al., template for industry RLHF
2022	ChatGPT	OpenAI	PPO	Built on InstructGPT pipeline
2022	Sparrow	DeepMind	PPO	Glaese et al.
2022	Claude (v1)	Anthropic	PPO	Combined with Constitutional AI
2023	GPT-4	OpenAI	PPO	Refined RLHF pipeline based on InstructGPT
2023	Llama 2 Chat	Meta	PPO	Open weights with RLHF
2023	Bard / early Gemini	Google	PPO	Built on PaLM-2 with RLHF
2023	Zephyr 7B	Hugging Face	DPO	First major open model trained purely with DPO
2023	Mixtral 8x7B Instruct	Mistral AI	DPO	DPO superseded PPO for many open models
2024	Llama 3 Instruct	Meta	DPO + PPO + Rejection Sampling	Hybrid pipeline
2024	DeepSeekMath	DeepSeek	GRPO	First paper to introduce GRPO
2025	DeepSeek-R1	DeepSeek	GRPO	Frontier reasoning model with critic-free RL
2025	Qwen 2.5 / Qwen 3 reasoners	Alibaba	GRPO variants	DAPO and GSPO variants
2025	Llama 4 post-training	Meta	GRPO + DPO	Mixed pipeline with GRPO for reasoning

Implementations

PPO is implemented in essentially every RL library released since 2018. The most widely used implementations are:

Library	Maintainer	Primary use case	Notes
OpenAI Baselines	OpenAI	Research reference	Original implementation, now archived
Stable-Baselines3 (SB3)	DLR-RM	Single-machine RL	PyTorch, beginner-friendly, most cited baseline
RLlib	Anyscale (Ray)	Distributed RL	Large clusters and multi-agent setups
CleanRL	Costa Huang et al.	Research and education	Transparent single-file implementations
TRL	Hugging Face	LLM RLHF	PPOTrainer and PPOv2Trainer, integrates with PEFT
trlX	CarperAI	LLM RLHF at scale	Distributed PPO for billion-parameter LMs
OpenRLHF	OpenLLMAI	LLM RLHF at scale	DeepSpeed-based; supports PPO, DPO, GRPO
verl	ByteDance	LLM RLHF at scale	Hybrid engine, used at ByteDance
Tianshou	THU-ML	Modular RL research	Lightweight PyTorch library

A 2025 paper by Dissanayaka and colleagues, "On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations" (arXiv:2503.22575), found dramatic performance differences across these libraries on identical benchmarks: Stable-Baselines3, CleanRL, and the original Baselines formed a high-performing group, while RLlib and Tianshou underperformed substantially on some Atari tasks. The lesson is that the abstract PPO algorithm is not a single object but a family of closely related procedures whose performance depends sensitively on dozens of implementation details.

Strengths and limitations

PPO's enduring popularity rests on a small number of robust empirical properties. The clipped trust region prevents catastrophic policy collapse, which is the most common failure mode of vanilla policy gradients. The objective is only a few lines of code on top of any automatic differentiation framework, with no second-order solvers or specialized linear algebra. Multiple epochs of minibatch updates per environment rollout give PPO better sample efficiency than single-step policy gradients. The same algorithm trains Atari agents, MuJoCo robots, Dota 2 bots, and trillion-token language models with only minor changes to the reward and the rollout machinery.

The weaknesses of PPO have also been catalogued extensively, especially in the LLM RLHF setting. PPO requires careful tuning of the KL coefficient, clip range, learning rate, and value function settings, and implementations vary substantially in performance. In LLM RLHF it requires four large models in memory simultaneously, which makes training very large models prohibitively expensive. The on-policy sampling loop means fresh rollouts are required for every gradient update batch, which is expensive for billion-parameter LLMs generating long sequences. The value function is hard to train when rewards arrive only at the end of long sequences, and a poorly trained critic can drag down the policy. Like all reward-model-based RLHF methods, PPO can exploit artifacts in the reward model rather than improving the underlying behavior. Engstrom et al. (2020) further showed that most of PPO's advantage over TRPO comes from engineering tricks rather than the clipped objective itself.

These limitations were the proximate cause of the post-2023 migration toward DPO, GRPO, and related methods for LLM alignment. PPO remains the dominant choice for environments where preference data is unavailable or where on-policy interaction is essential, such as game-playing agents, robotics, and code execution feedback loops.

References

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P. (2015). "Trust Region Policy Optimization." ICML 2015. arXiv:1502.05477.
Schulman, J., Moritz, P., Levine, S., Jordan, M.I., Abbeel, P. (2015). "High-Dimensional Continuous Control Using Generalized Advantage Estimation." arXiv:1506.02438.
Williams, R.J. (1992). "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine Learning 8: 229-256.
Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." arXiv:2203.02155.
Ziegler, D.M., et al. (2019). "Fine-Tuning Language Models from Human Preferences." arXiv:1909.08593.
Stiennon, N., et al. (2020). "Learning to Summarize from Human Feedback." arXiv:2009.01325.
Engstrom, L., et al. (2020). "Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO." ICLR 2020. arXiv:2005.12729.
Huang, S., Dossa, R.F.J., Raffin, A., Kanervisto, A., and Wang, W. (2022). "The 37 Implementation Details of Proximal Policy Optimization." ICLR Blog Post Track.
Zheng, R., et al. (2023). "Secrets of RLHF in Large Language Models Part I: PPO." arXiv:2307.04964.
Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023. arXiv:2305.18290.
Shao, Z., et al. (2024). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300.
Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288.
Dubey, A., et al. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783.
Raffin, A., et al. (2021). "Stable-Baselines3: Reliable Reinforcement Learning Implementations." JMLR 22(268): 1-8. https://github.com/DLR-RM/stable-baselines3
Huang, S., et al. (2022). "CleanRL: High-Quality Single-File Implementations of Deep Reinforcement Learning Algorithms." JMLR 23(274): 1-18. https://github.com/vwxyzjn/cleanrl
Liang, E., et al. (2018). "RLlib: Abstractions for Distributed Reinforcement Learning." ICML 2018.
Hugging Face. "TRL: Transformer Reinforcement Learning library." https://github.com/huggingface/trl
OpenAI (2017). "Proximal Policy Optimization." OpenAI Blog. https://openai.com/index/openai-baselines-ppo/
Berner, C., et al. (2019). "Dota 2 with Large Scale Deep Reinforcement Learning." arXiv:1912.06680.
Glaese, A., et al. (2022). "Improving Alignment of Dialogue Agents via Targeted Human Judgements." arXiv:2209.14375.
Bai, Y., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862.
Dissanayaka, R., et al. (2025). "On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations." arXiv:2503.22575.
Ahmadian, A., et al. (2024). "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs." ACL 2024. arXiv:2402.14740.

Proximal Policy Optimization (PPO)

Background

Policy gradient methods

Trust region policy optimization

The PPO paper

The clipped surrogate objective

PPO-Clip

PPO-Penalty

Full objective with value and entropy terms

Default hyperparameters

Algorithm pseudocode

Generalized advantage estimation

Comparison with related algorithms

Application to RLHF

InstructGPT and ChatGPT

Why PPO worked for LLMs

Why PPO was eventually replaced

Application history in LLM RLHF

Implementations

Strengths and limitations

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

Policy gradient methods

L0 Regularization

L1 Loss

Proximal Policy Optimization (PPO)

Background

Policy gradient methods

Trust region policy optimization

The PPO paper

The clipped surrogate objective

PPO-Clip

PPO-Penalty

Full objective with value and entropy terms

Default hyperparameters

Algorithm pseudocode

Generalized advantage estimation

Comparison with related algorithms

Application to RLHF

InstructGPT and ChatGPT

Why PPO worked for LLMs

Why PPO was eventually replaced

Application history in LLM RLHF

Implementations

Strengths and limitations

See also

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

Policy gradient methods

L0 Regularization

L1 Loss