RLOO (REINFORCE Leave-One-Out)

19 min read

Updated Jul 23, 2026

RLOO (REINFORCE Leave-One-Out) is an online reinforcement learning algorithm for aligning large language models with reward signals such as those derived from human preferences. The method samples $k$ completions for each prompt, then for each completion uses the mean reward of the other $k-1$ completions as a baseline in a REINFORCE-style policy gradient update. The leave-one-out baseline is a classical variance-reduction trick that traces back to Williams's 1992 paper introducing REINFORCE and was adapted for sequence models by Kool, van Hoof and Welling in 2019.^[1]^[2] Its modern application to large language model alignment was popularized by Arash Ahmadian and colleagues at Cohere For AI in the February 2024 paper "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs," which argued that PPO is overengineered for the rlhf setting and that a simpler REINFORCE variant with the leave-one-out baseline outperforms both PPO and DPO on standard alignment benchmarks.^[3] RLOO is implemented in the HuggingFace huggingface trl library as RLOOTrainer and in the OpenRLHF framework, and it has been adopted as a training building block by cohere and by independent teams reproducing reasoning models.^[4]^[5]

Background

RLHF and the role of PPO

Reinforcement learning from human feedback was introduced for language models by Christiano and colleagues in 2017 and scaled to instruction-following with InstructGPT by Ouyang and colleagues in 2022.^[6] In the canonical formulation, an initial language model is fine-tuned on demonstrations, a reward model is trained on pairwise human preferences, and a policy is then optimized against that reward model using a reinforcement learning algorithm. For several years, Proximal Policy Optimization (PPO) was the dominant choice for that final step, used in InstructGPT, GPT-4, Claude, and many open implementations.^[6]^[7] The Ahmadian et al. paper that introduced the LLM-era version of RLOO opens by noting that PPO "has been positioned by recent literature as the canonical method for the RL part of RLHF."^[3]

ppo originated in continuous-control settings where state-action trajectories are long, individual rewards are sparse and per-step, and gradient updates must be cautious to avoid catastrophic policy divergence.^[7] These conditions led to PPO's central design choices: a learned value network that produces per-token baselines, a generalized advantage estimator (GAE) that interpolates between bias and variance, and a clipped surrogate objective that limits how much the policy can change between updates. Each of these components has a cost.^[3]^[4]

Cost of a value network

The value network is the most expensive of these. In a typical PPO-for-RLHF setup, four full-size model copies must reside in GPU memory at training time: the policy being updated, the frozen reference policy used for the KL penalty, the reward model, and the value (critic) network.^[4] Loading four copies of, say, a 70-billion-parameter model creates significant memory pressure even on H100 clusters. Beyond memory, the value network must itself be trained, requiring a separate loss term, additional gradients, and tuning of its learning rate and update schedule. Ahmadian et al. and the HuggingFace blog post by Costa Huang and Arash Ahmadian both report that the value network is among the most sensitive components of the PPO-for-RLHF pipeline to get right, and that its quality directly affects the quality of the advantage estimates that drive the policy update.^[3]^[4]

The value network is also initialized in a specific way for PPO-RLHF: typically from the reward model rather than from the policy backbone, on the assumption that a model that has learned to predict rewards is a better starting point than one that has learned to predict tokens.^[3] This means the value network must either be loaded separately from disk (consuming extra disk and host memory) or constructed by adding a scalar head to a reward-model checkpoint. Either route adds engineering surface area, and bugs in this initialization have historically been a source of training instability.^[3]^[4]

Mismatch between PPO's assumptions and RLHF

Several authors observed that the assumptions justifying PPO's full apparatus do not hold cleanly in RLHF. Generations from an instruction-tuned policy are short by RL standards (typically under a thousand tokens), the only "true" reward usually arrives at the end of the sequence from the reward model, and the policy is initialized from a high-quality supervised fine-tune that is already close to the optimum. In that regime, treating each token as a separate action and learning a per-token value function may add noise rather than reduce it.^[3] Ahmadian et al. revisit these assumptions and propose treating the entire generation as a single action, removing the value network, and using a Monte Carlo baseline computed from sibling samples.^[3]

In particular, when reward is delivered only at the EOS token, the per-token advantages constructed by GAE in PPO are largely the result of bootstrapping through the learned value function. If that value function is mis-specified or undertrained, the per-token advantages can carry noise that the policy then chases. Ahmadian et al. report ablations showing that several of PPO's defenses against this noise (clipping, value-loss clipping, entropy bonuses) are individually unhelpful in the RLHF setting, suggesting they exist primarily to compensate for problems that RLOO sidesteps by not creating those problems in the first place.^[3]

Origins of the leave-one-out baseline

The variance-reduction technique RLOO uses was not new. Williams's 1992 paper, which introduced REINFORCE, already noted that the gradient estimator $\nabla_\theta \log \pi_\theta(a \mid s) \cdot R$ can be made unbiased and lower-variance by subtracting any baseline that does not depend on the sampled action.^[1] In structured prediction, Kool, van Hoof and Welling's 2019 ICLR workshop paper "Buy 4 REINFORCE Samples, Get a Baseline for Free!" applied the leave-one-out idea to neural combinatorial optimization, showing that drawing several samples per input and using the mean of the others as a baseline removed the need for a learned critic on the Travelling Salesman Problem.^[2] Ahmadian et al. carry that observation across to LLM alignment.^[3]

The Kool, van Hoof and Welling paper also derived a variant of the estimator for sampling without replacement using Stochastic Beam Search, which can be advantageous when the policy's distribution concentrates and naive multinomial sampling produces near-duplicates.^[2] In LLM alignment, the standard practice is sampling with replacement at a non-zero temperature, but the without-replacement variant is occasionally used in research code when group diversity is the bottleneck. Earlier work on REINFORCE-with-baseline for sequence models, including self-critical sequence training by Rennie and colleagues in 2017, used a greedy decoding of the same policy as the baseline. The leave-one-out baseline is a strict generalization that does not require a separate decoding pass.^[1]^[2]

Algorithm

For each prompt $x$ in a training batch, RLOO draws $k$ completions $y_1, y_2, \ldots, y_k$ from the current policy $\pi_\theta(\cdot \mid x)$ . A reward model $R$ scores each completion, producing scalar rewards $R(y_i, x)$ . A KL penalty against a frozen reference policy $\pi_{\text{ref}}$ is folded into the reward, giving an adjusted reward $r_i = R(y_i, x) - \beta \cdot \mathrm{KL}\big[\pi_\theta(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x)\big]$ where $\beta$ controls the KL strength.^[5] The leave-one-out baseline for completion $i$ is the mean reward over the other $k-1$ completions of the same prompt:

$b_i = \frac{1}{k-1} \sum_{j \ne i} r_j$

The advantage assigned to completion $i$ is the difference between its own reward and this baseline:

$A_i = r_i - b_i$

The REINFORCE loss is then

$\mathcal{L}_{\text{RLOO}}(\theta) = -\frac{1}{k} \sum_{i=1}^{k} A_i \cdot \log \pi_\theta(y_i \mid x)$

where $\log \pi_\theta(y_i \mid x)$ is the sum of token log-probabilities over the entire completion.^[3]^[5] Crucially, RLOO treats each sampled completion as one action: there is no per-token advantage and no per-token baseline. This is the design choice that lets RLOO drop the value network entirely.^[3]^[4]

Because the baseline $b_i$ does not depend on the value of the corresponding sample $y_i$ (only on its siblings), subtracting it leaves the gradient unbiased.^[1]^[2] The estimator is also strictly lower variance than the naive REINFORCE estimator with no baseline whenever the rewards across siblings are positively correlated, which is the typical case when all $k$ completions share a prompt.^[2]^[3]

A vectorized implementation, used in huggingface trl, reshapes the reward tensor and computes the baseline in a single pass:

rlhf_reward = rlhf_reward.reshape(rloo_k, local_batch_size)
baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
advantages = rlhf_reward - baseline

This avoids an inner loop and exposes the computation to standard tensor parallelism.^[4]

The Ahmadian et al. paper experiments with $k=2$ and $k=4$ .^[3] With $k=2$ , RLOO degenerates to a paired comparison: the advantage for each sample is half the gap to its single sibling. With $k=4$ , the baseline averages three siblings and is correspondingly less noisy, at the cost of four times the sampling work per prompt. The HuggingFace TRL implementation reports that $k=2$ already beats PPO on the TL;DR summarization benchmark with a 6.9B Pythia base model and that the 6.9B $k=2$ checkpoint reached a 78.7% preferred rate against a reference, exceeding a prior 77.9% benchmark at $k=4$ .^[4]

In TRL versions before 0.22, the parameter controlling the number of samples per prompt was named rloo_k; in 0.22 and later it is num_generations, harmonizing naming with the grpo trainer.^[5] The current TRL trainer also supports a clipped importance-sampling correction for the case when multiple gradient steps are taken on the same generation batch, in which case the off-policy ratio $\pi_\theta(y_i \mid x) / \pi_{\theta_{\text{old}}}(y_i \mid x)$ is clipped to $[1 - \epsilon, 1 + \epsilon]$ . With a single gradient step per generation (the default), this ratio is identically one and the loss reduces to plain REINFORCE.^[5]

Comparison to PPO

The structural difference between RLOO and ppo in the LLM setting is the value network and the granularity of the action.^[3]^[4]

PPO models each token as an action, learns a per-token value function $V(s_t)$ on top of the policy backbone (often sharing a body with the policy and adding a scalar head), and computes per-token advantages using GAE. A clipped surrogate objective bounds the per-token importance-sampling ratio.^[7] The value network is updated jointly with the policy, requiring its own loss term and learning rate.^[4]

RLOO models each completion as a single action, uses the leave-one-out baseline computed across siblings as a closed-form Monte Carlo estimate of the expected return, and (in the default online setting) takes one gradient step per batch so that no importance-sampling clipping is needed.^[3]^[4]^[5] In the multi-step or off-policy regime, RLOO can optionally adopt the same clipped objective as PPO, but applied at the sequence level rather than the token level.^[5]

The practical effect on resources is significant. The TRL "Putting RL back in RLHF" blog post by Huang and Ahmadian, which accompanied the initial TRL implementation in June 2024, reports that RLOO uses approximately 50% to 70% less GPU memory (vRAM) than PPO depending on model size, runs roughly 2x faster than PPO with a 1-billion-parameter model, and runs approximately 3x faster than PPO with a 6.9-billion-parameter model.^[4] Memory consumption drops because RLOO needs only three model copies in memory (policy, reference, reward) versus PPO's four (policy, reference, reward, value).^[4] The speed gain is partly the missing value-network forward and backward passes and partly that RLOO's effective batch is amortized across $k$ generations, which improves throughput on the generation step.^[4]

Hyperparameter complexity also drops. PPO's behavior is sensitive to the value-network learning rate, the GAE $\lambda$ parameter, the clip range $\epsilon$ , the value-loss coefficient, and the entropy bonus.^[7] RLOO has effectively three knobs: the number of samples per prompt $k$ , the KL coefficient $\beta$ , and the policy learning rate.^[3]^[5] Ahmadian et al. frame this hyperparameter reduction as the central practical benefit of the method.^[3]

Comparison to GRPO

grpo (Group Relative Policy Optimization) was introduced by Shao and colleagues at DeepSeek in the DeepSeekMath paper of February 2024 and reached wide attention through the DeepSeek-R1 paper of January 2025.^[8] Like RLOO, GRPO removes the value network and computes advantages from a group of sibling samples. Unlike RLOO, GRPO normalizes the advantage by both the group mean and the group standard deviation:

$A_i^{\text{GRPO}} = \frac{r_i - \mathrm{mean}(\{r_1, \ldots, r_G\})}{\mathrm{std}(\{r_1, \ldots, r_G\})}$

where $G$ is the group size.^[8]^[9] By contrast, RLOO subtracts only the leave-one-out mean and does not divide by the standard deviation:

$A_i^{\text{RLOO}} = r_i - \frac{1}{k-1}\sum_{j \ne i} r_j$

and the mean it subtracts excludes the current sample.^[3]^[5]

These differences are small in code but consequential in theory. Excluding the current sample from its own baseline (the leave-one-out choice) makes the RLOO estimator strictly unbiased; including it in the mean, as GRPO does, introduces a small bias that vanishes only as $G \to \infty$ .^[2]^[9] Dividing by the standard deviation, on the other hand, is a normalization that rescales the advantage and tends to stabilize the gradient magnitude when reward scales vary across prompts.^[8]^[9] In practice GRPO has been the preferred algorithm for reasoning-style training where rewards are binary correctness signals (the standard deviation across a group then sharpens the gradient on prompts where the model is partially correct), while RLOO has been more widely used for open-ended preference-style alignment where the reward is a dense scalar from a learned reward model.^[4]^[8]

Both methods have been criticized for their reliance on prompt-level normalization. The REINFORCE++ paper of January 2025 by Jian Hu and colleagues at OpenRLHF argues that "critic-free algorithms like GRPO and RLOO typically rely on prompt-level (local) advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and is a theoretically biased estimator." REINFORCE++ proposes Global Advantage Normalization across the entire training batch as a corrective.^[10]

Empirical results

The Ahmadian et al. paper evaluates RLOO on two standard RLHF benchmarks: TL;DR summarization (Stiennon et al.) and Anthropic's Helpful and Harmless (HH) preference dataset. The base models are Llama-7B and Pythia-6.9B, with reward models trained on the corresponding preference data.^[3]

Reported win rates against PPO on the TL;DR task are approximately ten percentage points higher for RLOO under matched compute budgets, and on the HH dataset RLOO with the Pythia-6.9B base model wins against PPO by roughly 14.5 percentage points, with Llama-7B showing an even larger gap.^[3] Against DPO, RLOO is reported as comparable or superior across all tested configurations. The paper also includes RAFT (Reward-rAnked Fine-Tuning), an offline rejection-sampling baseline; RLOO with $k=2$ matches or exceeds the performance of RAFT with $k=4$ , halving the sampling cost.^[3]

The TL;DR summarization benchmark consists of approximately 116,000 instructions paired with 93,000 human preference pairs, and the HH preference dataset contributes 112,000 training pairs.^[3] All comparisons are conducted with matched reward models, matched KL coefficients, and matched batch sizes, so the gaps reported by Ahmadian et al. are attributable to the algorithm rather than to hyperparameter choices. Win rates are reported as fractions of the time that a GPT-4-style judge prefers the RLOO output to the comparator, with both outputs blinded.^[3]

The TRL implementation, which used Pythia 1B and Pythia 6.9B as policies and EleutherAI's reward model checkpoints on TL;DR summarization, reproduced the qualitative ordering. The 1B RLOO checkpoint achieves a GPT-4-judged win rate of 40.1% against SFT, versus 21.3% for the SFT model against itself, and the 6.9B checkpoint reaches 78.7% with $k=2$ . These numbers are reported with vLLM-accelerated generation and Weights and Biases tracking.^[4]

Beyond the original paper, the TRL benchmark scripts also recorded the training-time GPU footprint. On a single A100 80GB at the 1B scale, PPO training required approximately 70 to 80 GB of memory while RLOO training fit comfortably under 40 GB, allowing for larger batch sizes or longer sequences on the same hardware.^[4] At the 6.9B scale, the memory delta translated to a difference between requiring eight GPUs (for PPO) versus four GPUs (for RLOO) at the same effective batch size in the TRL benchmark configurations.^[4]

The paper "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs" was published in the proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics in August 2024 (Volume 1: Long Papers, pages 12248 to 12267).^[11]

Adoption

Frameworks

RLOO is implemented in the HuggingFace TRL library as RLOOTrainer, contributed initially by Costa Huang and Arash Ahmadian in June 2024 and later refactored by Shirin Yamani to align with the GRPO trainer.^[4]^[5] The current trainer supports custom reward functions, vision-language models including Qwen2.5-VL and Gemma 3, and large-scale training with DeepSpeed ZeRO Stage 3 and vLLM generation servers.^[5] The TRL 0.22 release reorganized the configuration interface and renamed several legacy parameters: rloo_k became num_generations, cliprange became epsilon, kl_coef became beta, num_ppo_epochs became num_iterations, and response_length became max_completion_length.^[5] These renamings reflect the convergence between RLOO and GRPO in TRL's internals; both trainers now share a common backbone.^[5]

OpenRLHF, the Ray-based RLHF framework developed by Jian Hu and collaborators, implements RLOO alongside PPO, GRPO, REINFORCE++, and DAPO. In OpenRLHF, RLOO can be selected through the --algo.advantage.estimator rloo flag and is integrated with the framework's per-token KL plus PPO-clip loss machinery.^[10]^[12] OpenRLHF separates the actor, reward, reference, and critic models across different GPU groups using Ray, allowing scalable training of models in the 70B-plus parameter range. With RLOO selected, the critic is simply absent from the deployment graph.^[12]

Other frameworks that include RLOO or close variants include veRL, Volcano Engine's RL toolkit, and several research codebases derived from OpenRLHF. The algorithm is also reproducible in a few hundred lines of plain PyTorch given access to a generation engine, a reward model, and the policy itself, which has contributed to its adoption in academic settings.^[12]

Production use

The Cohere blog and TRL documentation both attribute the algorithm to Cohere For AI, the Cohere research lab where the Ahmadian et al. paper was produced. Cohere has used REINFORCE-style optimization in its preference-tuning pipelines for the command r family, though the company describes its post-training as "a mixture of supervised and preference-based fine-tuning" without specifying the exact algorithm version per model release.^[13]

Cross-method studies of RLHF for llama 3 have reported that Meta used a combination of supervised fine-tuning, rejection sampling, PPO, and DPO rather than RLOO specifically, so attributions of RLOO to particular flagship LLMs should be made with care.^[14] In open-source reproduction projects, RLOO is most often seen as a memory-light alternative to PPO on Pythia and Qwen base models in the 0.5B to 8B range, and as a baseline against which GRPO is benchmarked on reasoning tasks.^[4]^[5]^[10]

The OLMo and Tülu projects from AI2, as well as various reasoning-model reproductions in the Open-R1 ecosystem, have used GRPO rather than RLOO as their primary online RL algorithm, citing GRPO's stability on reasoning rewards. Where RLOO appears in published preference-style training runs, it is typically for instruction-following or summarization tasks rather than for math or code reasoning. The choice reflects empirical observation rather than a sharp theoretical line; the two methods can be interchanged with small code changes and similar compute footprints.^[4]^[8]

Limitations

RLOO inherits several limitations from its REINFORCE foundations and from its sibling-sampling design.

The leave-one-out baseline only reduces variance when sibling completions share enough structure that their rewards are correlated. On prompts where the policy already produces mostly identical completions (low entropy), all $k$ samples receive similar rewards, the leave-one-out baseline closely approximates each sample's own reward, and the resulting advantages are near zero. The TRL trainer logs frac_reward_zero_std to surface this case, which indicates that for some prompts there is little diversity and no learning signal.^[5] Curriculum or sampling-temperature schedules can mitigate this but are not part of the base algorithm.

Computing the policy gradient by summing per-token log probabilities into a single sequence-level log probability creates numerical issues in bf16. The TRL maintainers documented that RLOO clips between 20% and 40% of batch gradients due to bf16 precision loss compounding across the entire sequence (versus roughly 3% for PPO, which works at the token level). This is tracked in HuggingFace Transformers issue 31267 and is a known stability concern for large-context RLOO training.^[4]

RLOO is on-policy in its default form. Each generation batch is used for one gradient step, then discarded. PPO's clipped surrogate objective was originally designed to allow several gradient steps per generation batch, which improves sample efficiency in classical RL. Modern RLOO implementations bring this back via an optional clipped importance-sampling ratio applied at the sequence level, but they then re-introduce the clip-range hyperparameter that RLOO was meant to eliminate.^[5]

Because RLOO's advantage is computed only from rewards within a single prompt's group, the magnitude of the advantage depends on the local reward variance. Two prompts with different reward scales contribute differently to the gradient, even when both have similar within-group structure. The REINFORCE++ paper formalizes this objection and proposes global normalization across the batch as a remedy.^[10]

Finally, RLOO requires drawing $k$ completions per prompt at training time. For $k=4$ , this is a 4x increase in generation FLOPs versus single-sample DPO. The wall-clock savings reported by Huang and Ahmadian come from the absence of the value-network forward and backward passes, not from cheaper generation.^[4]^[15] For very short prompts and very long completions, the generation overhead can dominate total training time, and PPO with a smaller batch may end up faster in some regimes.

References

^Ronald J. Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning", Machine Learning, Springer, 1992. link.springer.com/...BF00992696. Accessed 2026-05-26.
^Wouter Kool, Herke van Hoof, Max Welling, "Buy 4 REINFORCE Samples, Get a Baseline for Free!", ICLR 2019 Deep RL meets Structured Prediction Workshop, 2019-05. openreview.net/forum Accessed 2026-05-26.
^Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, Sara Hooker, "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs", arXiv:2402.14740, 2024-02-22. arxiv.org/...2402.14740. Accessed 2026-05-26.
^Shengyi Costa Huang, Michael Noukhovitch, Arash Ahmadian, Kashif Rasul, Lewis Tunstall, "Putting RL back in RLHF", HuggingFace Blog, 2024-06-12. huggingface.co/...putting_rl_back_in_rlhf_with_rloo. Accessed 2026-05-26.
^HuggingFace, "RLOO Trainer", TRL Documentation, 2026. huggingface.co/...rloo_trainer. Accessed 2026-05-26.
^Long Ouyang et al., "Training language models to follow instructions with human feedback", arXiv:2203.02155, 2022-03-04. arxiv.org/...2203.02155. Accessed 2026-05-26.
^John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, "Proximal Policy Optimization Algorithms", arXiv:1707.06347, 2017-07-20. arxiv.org/...1707.06347. Accessed 2026-05-26.
^Zhihong Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", arXiv:2402.03300, 2024-02-05. arxiv.org/...2402.03300. Accessed 2026-05-26.
^DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", arXiv:2501.12948, 2025-01-22. arxiv.org/...2501.12948. Accessed 2026-05-26.
^Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen, "REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization", arXiv:2501.03262, 2025-01-04. arxiv.org/...2501.03262. Accessed 2026-05-26.
^Arash Ahmadian et al., "Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs", Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248-12267, 2024-08. aclanthology.org/2024.acl-long.662 Accessed 2026-05-26.
^OpenRLHF Contributors, "OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework", GitHub Repository, 2026. github.com/...OpenRLHF. Accessed 2026-05-26.
^Cohere, "Cohere's Command R Model", Cohere Documentation, 2026. docs.cohere.com/...command-r. Accessed 2026-05-26.
^Sebastian Raschka, "How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?", Ahead of AI, 2024-04-20. magazine.sebastianraschka.com/...latest-open-llms. Accessed 2026-05-26.
^Rafael Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", arXiv:2305.18290, 2023-05-29. arxiv.org/...2305.18290. Accessed 2026-05-26.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · v4 · 3,855 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

Reinforcement Learning from Human Feedback (RLHF)