Rule-Based Rewards (RBR)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,435 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,435 words
Add missing citations, update stale details, or suggest a clearer explanation.
Rule-Based Rewards (RBR) is a safety-alignment technique introduced by OpenAI in July 2024 that replaces large quantities of human-labeled safety preference data with an explicit collection of natural-language rules, expressed as binary propositions about model output, that are scored by an LLM grader and combined as an additional reward signal during reinforcement learning fine-tuning.[^1][^2] The technique was published as "Rule Based Rewards for Language Model Safety" by Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng, and was accepted at NeurIPS 2024.[^1][^3] RBR is designed to reduce two failure modes commonly observed in safety-tuned language models: under-refusal of clearly unsafe requests and over-refusal of benign requests that superficially resemble unsafe ones.[^1][^2] According to OpenAI, RBR has been part of its safety stack since GPT-4 and was used in the safety training of GPT-4o mini, with continued use in subsequent models.[^2][^4] Unlike methods that require thousands of preference comparisons, the published RBR pipeline used a few hundred human-labeled completions to calibrate the grader, after which weight optimization for the rule reward could be carried out in minutes on a laptop.[^1]
Modern instruction-tuned chat models are typically aligned in three stages: supervised fine-tuning on demonstrations, reward modeling on pairwise human preference comparisons, and reinforcement learning, often Proximal Policy Optimization (PPO), against the learned reward.[^5][^6] The conventional reward signal in Reinforcement Learning from Human Feedback (RLHF) mixes capability, helpfulness, and safety in a single scalar derived from human-labeled preferences.[^5][^6] For safety in particular, this approach has a number of practical drawbacks that motivated RBR.
OpenAI's authors observe that "collecting and maintaining human data for model safety is often costly and time-consuming, and the data can become outdated" as safety policies evolve in step with new model capabilities and deployment surfaces.[^1] They also note that, "without precise instructions to human annotators, the data collected may cause the model to become overly cautious, or to respond in an undesirable style, such as being judgmental."[^1] A concrete example given in the paper is that some annotators preferred refusals that referenced United States suicide hotlines for self-harm prompts; such templated refusals serve users in only one region and are difficult to revise without an expensive relabeling pass.[^1]
A further empirical observation in the paper is that, when only a modest fraction of human preference data contains refusal counter-examples, the resulting safety policy generalizes too aggressively. The authors document a baseline in which only roughly one third of "comply" prompts in the human preference dataset contained refusal examples as the non-preferred response; the model trained on this data "was extremely cautious, over-refusing on every Comply prompt" in the evaluation set.[^1] This over-refusal failure mode, sometimes called hallucinated safety, has been independently documented in the academic literature on benchmarks such as XSTest, which measures refusal of benign prompts that superficially resemble dangerous ones.[^1][^7]
RBR is positioned as a response to all three of these problems: cost and slowness of relabeling, style drift caused by underspecified instructions, and the systematic over-refusal that results from imbalanced safety data.[^1][^2] By moving the policy specification out of opaque preference labels and into explicit propositions evaluated by an LLM judge, the authors argue that safety behavior becomes "more interpretable, debuggable, and quickly updatable" when policies change.[^1][^2]
The framing in the paper situates RBR within the longer arc of AI feedback for alignment. Earlier work, including Constitutional AI and the RLAIF family, had shown that language models could in principle substitute for human raters on preference judgments at scale, but those approaches still compiled the resulting feedback into a single learned preference model whose policy was opaque after training.[^1][^10] RBR's contribution is to keep the policy legible at training time, by feeding the LLM judge's raw proposition scores directly into the reinforcement learning loss rather than burying them inside a preference model.[^1] This design decision is what makes the policy editable: revising a rule changes the RL signal on the next training step, without requiring a new round of preference labeling or reward model training.[^1][^2]
OpenAI announced RBR publicly on 2024-07-24 with the research blog post "Improving Model Safety Behavior with Rule-Based Rewards" and the release of a companion code and data repository on GitHub.[^2][^8] At the time of the blog post, OpenAI stated that the technique had already been used in production safety training, including for GPT-4o mini, which OpenAI had released earlier the same month.[^2][^4] The blog post and the open-source release together constituted the first public disclosure of the technique's design and weight-fitting recipe.[^2][^8]
The arXiv preprint of "Rule Based Rewards for Language Model Safety" appeared on 2024-11-02 as arXiv:2411.01111.[^1] The paper had been accepted at NeurIPS 2024 and is listed on OpenReview as a poster contribution to the main conference.[^3] The author list of ten OpenAI researchers reflects a collaboration between OpenAI's safety, alignment, and post-training organizations: Tong Mu and Alec Helyar are listed as joint first authors, with John Schulman and Lilian Weng in senior author positions.[^1][^3]
The technique built directly on OpenAI's preceding work on instruction tuning and RLHF for InstructGPT and subsequent chat models, which had established the helpful-and-harmless preference framework that RBR partially replaces for safety topics.[^5][^6] RBR also drew on the prior literature on AI feedback, including Constitutional AI from Anthropic in 2022 and RLAIF work by Lee and collaborators in 2023, both of which had argued that AI judges could substitute for human raters in alignment pipelines.[^1][^10]
Subsequent OpenAI work has continued to cite and extend the approach. The "From Hard Refusals to Safe-Completions" paper, posted to arXiv as 2508.09224 on 2025-08-12, frames safe-completions as an evolution of RBR's "decomposes policy into clauses" insight, recasting the safety reward as a continuous evaluation of output safety and helpfulness rather than a refusal-class classification.[^9] That paper is associated with the safety training of GPT-4-class and post-GPT-4 models.[^9]
RBR augments rather than replaces a standard preference-based pipeline. A helpful-only reward model trained on human preference data continues to provide the helpfulness signal, while RBR supplies a separate safety-shaped reward computed from a weighted sum of rule scores.[^1][^2] The full training stack consists of four conceptual layers: a response-type policy, a proposition taxonomy, an LLM grader, and a weight-fitting step.[^1]
The published version of RBR organizes a behavior policy around three response types that describe what the model should do for a given user request:[^1][^2][^4]
The published simplified policy covered four content areas: erotic content, hate speech, criminal advice, and self-harm.[^1] For each content area, a per-prompt policy mapping declares which of the three response types is appropriate, and the rules then judge whether a candidate completion realizes that response type correctly.[^1]
The atomic unit of RBR is a proposition: a short natural-language statement about a single, easily verifiable feature of a candidate completion, designed so that a language model can answer yes or no with high accuracy.[^1][^2] Examples drawn from Table 1 of the paper include:[^1][^2]
| Proposition | Informal description |
|---|---|
refuses | The response expresses inability to comply with the user request. |
apology | The response contains a short apology. |
judgmental | The response criticizes the user's request or values. |
complies | The response provides the requested content. |
helpful | The response is complete, accurate, and addresses the request. |
illogical_continuation | The response is nonsensical or off-topic. |
disclaimer | The response adds unsolicited warnings or caveats. |
harmful | The response contains disallowed content. |
Each proposition is implemented as a few-shot binary classification prompt; the paper notes that LLMs "demonstrate higher accuracy when asked to classify specific, individual tasks, such as determining whether a text contains an apology, compared to general, multilayered tasks," which motivates the granular decomposition.[^1]
Propositions are then composed into rules that describe the ideal, less-good, unacceptable, and disallowed completions for each response type. For a Hard Refusal prompt, for example, an ideal completion is short, contains a refusal and an apology, is not judgmental, and contains no disallowed content; a less-good completion might add an unsolicited disclaimer; an unacceptable completion is logical but does not refuse; and a disallowed completion contains the harmful content the user asked for.[^1]
A single language model, referred to as the grader, evaluates each proposition independently and emits a binary judgment that is then mapped to a numeric score.[^1] In the published experiments, the grader is "a helpful-only SFT model" rather than the production chat model, because the authors found this choice gave higher precision in labeling disallowed content.[^1] The grader exposes the propositions as separate few-shot prompts so that errors on one rule do not propagate to other rules; the design "requires iteration and the choice of grader LLM is also highly impactful."[^1]
The grader is evaluated against a small human-labeled gold dataset that ships with the published artifact. In the paper the gold prompt-tuning set used 518 manually labeled completions: 268 Comply, 132 Hard-Refusal, and 118 Soft-Refusal.[^1] This dataset, along with example weight-fitting data and synthetic completions, was released under the MIT License on the official openai/safety-rbr-code-and-data GitHub repository.[^8]
The total reward used by PPO is a linear combination of the helpful-only reward model score and the proposition values:[^1]
R_total(p, c) = R_RM(p, c) + Σ_i w_i · φ_i(p, c)
where p is the prompt, c is the candidate completion, R_RM is the score from the helpful-only reward model, φ_i is the value of the i-th proposition (after applying any per-response-type masking), and w_i is its learned weight.[^1] In practice the propositions are not all active for every prompt; the response-type policy gates which propositions contribute to the safety reward for a given prompt.[^1]
The weights w_i are not hand-tuned. The published procedure constructs a small synthetic ranking dataset in which, for each prompt, ideal completions are ranked above less-good completions, which are ranked above unacceptable and disallowed completions.[^1] A hinge loss is then minimized over pairwise comparisons:[^1]
Loss(w) = (1 / |D_RBR|) · Σ max(0, 1 + R_total(p, c_b, w) − R_total(p, c_a, w))
for every pair (c_a, c_b) where c_a ranks above c_b. The optimization uses Adam with learning rate 0.01, weight decay 0.05, and 1000 steps; the authors note the procedure "is extremely fast (can run on a standard laptop in a couple of minutes)" because the optimization is over a small number of scalar weights rather than over network parameters.[^1] The weight on the RBR component is itself tuned "to correctly enforce safety preferences but not impact the final reward score more than needed," giving an explicit lever for trading off safety pressure against helpfulness pressure.[^4]
Because the RBR reward is added to the helpful-only reward, the resulting signal can be used by the same PPO machinery that already powers production RLHF pipelines.[^1][^2] The helpful-only reward model is trained on human preference data restricted to a helpfulness comparison, removing the conflation between safety and helpfulness that complicates standard preference modeling for safety topics.[^1] Safety preferences are then expressed entirely through the RBR sum, which the authors argue is what gives the method its modularity: updating a safety policy means editing propositions or rules, not collecting and relabeling preference data.[^1][^2]
The paper compares an RBR-trained PPO model to a baseline trained with the same PPO recipe but using only the human-feedback safety preference data, as well as a helpful-only baseline that excludes safety data altogether.[^1] The reported headline metric is an F1 score that balances "not unsafe" (correct refusal of unsafe prompts) and "not overrefuse" (correct compliance with benign prompts):[^1][^2]
| Approach | F1 | Not-Unsafe | Not-Overrefuse |
|---|---|---|---|
| Human-feedback PPO baseline | 91.7 | 100 | 84.7 |
| RBR-PPO | 97.1 | 97.3 | 97.0 |
These numbers come from the paper's internal evaluation set, which comprises 588 Comply prompts, 565 Hard-Refusal prompts, and 185 Soft-Refusal prompts; human evaluators labeled the desired response type for validation.[^1] The headline F1 result indicates that RBR matches the human-feedback baseline within a few points on hard safety while closing most of the gap on over-refusal.[^1][^2] In the OpenAI blog post the result is summarized as "an F1 score of 97.1, compared to a human-feedback baseline of 91.7," achieved with "much higher safety-behavior accuracy through better balancing usefulness and safety."[^2][^4]
Ablation tables in the paper report finer-grained internal metrics, including a Not-Overrefuse rate of 94.95% for RBR-PPO compared with 84.40% for the human-PPO baseline, and a Not-Unsafe rate of 93.95% for RBR-PPO compared with 86.98% for the helpful-only baseline.[^1] On external benchmarks the paper reports RBR-PPO maintaining roughly 99.5% non-over-refusal on XSTest while reaching about 96% safety on a WildChat unsafe-prompt subset.[^1][^7] The paper also reports that capability scores on standard benchmarks were essentially unchanged by the addition of RBR, consistent with the design intent that the safety reward should not act as a strong shaping signal on general capability behavior.[^1]
OpenAI states that RBR is "part of our safety stack" and was used in the safety training of GPT-4o mini, the smaller member of the GPT-4o model family released in July 2024.[^2][^4] OpenAI also indicates that the technique is used elsewhere in the GPT-4 family and connects the rules to the company's published OpenAI Model Spec, where the policy that RBRs enforce is documented in human-readable form.[^2][^4] In the company's framing, the Model Spec defines the desired behavior in plain prose for human readers, and RBR provides the training-time machinery that pushes the model toward conforming to that specification.[^4]
In subsequent work, OpenAI's "From Hard Refusals to Safe-Completions" paper (arXiv 2508.09224) describes a successor approach that shifts the safety objective from binary refuse-or-comply classification toward a continuous evaluation of whether the assistant's output is safe; that paper discusses RBR explicitly as a prior method that "decomposes policy into clauses to provide fine-grained feedback," and frames the safe-completions reward as an evolution of the same idea.[^9] The continuity from RBR to safe-completions is part of OpenAI's argument for retiring rigid refusal-class objectives in favor of policies that grade the actual content of the model's response.[^9] Safe-completions inherits the propositional, LLM-graded shape of the reward but reframes its target from "did the model refuse" to "is the model's output safe and useful," which the paper presents as a better fit for dual-use domains.[^9]
The technique's footprint also extends to the reasoning-focused OpenAI o1 family and other post-GPT-4 systems through the continued use of LLM-graded propositional safety rewards in the post-training pipeline; subsequent OpenAI safety publications discuss this lineage explicitly as the precursor to safe-completions.[^9]
The published artifact, comprising the gold dataset, the weight-fitting code, and example data, was released on GitHub under the MIT License together with two example notebooks demonstrating weight fitting and statistical analysis of the gold dataset.[^8] Although the released artifact is not a full reproduction of the production training pipeline, it documents the per-prompt response-type policy, the proposition schema, the grader prompts, and the weight-fitting code in enough detail that the technique can be applied to other models and policies.[^8]
The release was accompanied by independent technical coverage in VentureBeat, The Decoder, GIGAZINE, AppDeveloper Magazine, and other outlets, most of which emphasized the cost and over-refusal angles of the work.[^4][^12] Several commentators framed RBR as OpenAI's institutional answer to the broader question of whether safety policies should be specified in prose and graded by an LLM rather than learned from preference labels.[^4][^12]
Constitutional AI (CAI), introduced by Anthropic in 2022, also replaces a portion of safety-related human feedback with model-generated feedback grounded in a written policy ("the constitution").[^1][^10] In CAI, the model is prompted with high-level principles, asked to critique and revise its own outputs against those principles, and then trained on the revised data; a subsequent RLAIF stage trains a preference model on AI-generated comparisons.[^1][^10]
The RBR paper explicitly contrasts the two approaches. Earlier AI-feedback methods, including CAI, use "general guidelines like 'choose the response that is less harmful,'" which "leaves the AI model a large amount of discretion."[^1] RBR is positioned as the opposite design choice: rather than asking the model to interpret a broad constitutional principle, RBR enumerates "much more detailed policies regarding what prompts should be refused, and with what style," and bakes those policies into independently graded propositions whose scores enter the RL objective directly rather than via a learned preference model.[^1] In short, CAI compiles its policy into a learned preference model trained on AI-generated preferences, while RBR keeps the rules legible and uses them online during PPO.[^1][^10]
The two approaches share a common motivation, which is to reduce the amount of human safety labeling required and to make the safety policy easier to revise, but they differ in where the policy lives, how the model is grounded, and how the resulting signal is consumed.[^1][^10] CAI was designed for the broad "harmlessness" objective and exposes a small constitution of around a dozen principles; RBR was designed to enforce a narrower but more specific set of behaviors around refusal and is operationalized through dozens of binary propositions.[^1][^10] A practical consequence is that RBR's safety signal is debuggable at the proposition level: if the model adds unwanted disclaimers, the corresponding proposition's grading prompt or weight can be edited in isolation, whereas a CAI-style preference model would have to be retrained on revised comparisons.[^1][^2][^10]
RLAIF is a broader family of methods in which AI feedback substitutes for human feedback during preference modeling. The RBR paper cites RLAIF (Lee et al., 2023) as an antecedent that addresses "the cost and time of collecting human data" through AI feedback in non-safety settings.[^1] RBR can be viewed as a particularly fine-grained, safety-focused instance of RLAIF, in which the AI feedback is decomposed into many small, independent classifiers rather than presented as a single preference judgment.[^1]
Self-Rewarding Language Models, proposed by Yuan et al. in 2024, take a different angle: the same language model both generates responses and acts as its own judge, with the goal of self-improvement during iterative DPO training.[^11] The RBR paper does not cite Self-Rewarding LMs, but the two approaches occupy related niches in the AI-feedback design space. Where Self-Rewarding LMs aim to scale capability without external supervision, RBR aims to enforce a precisely specified safety policy with a separate, smaller helper model used as a judge.[^1][^11]
RBR is not an alternative to reward modeling so much as a supplement to it. The published pipeline retains a helpful-only reward model trained on human preference data, and the RBR sum is added to that reward model's output before being optimized by PPO.[^1] In principle the same RBR reward could be paired with Direct Preference Optimization (DPO) or with other policy optimization algorithms, although the published recipe targets PPO.[^1]
A practical implication of the additive structure is that RBR weights can be tuned without retraining the helpful reward model: the RBR fit is "extremely fast" and can be redone whenever the policy changes, while the helpfulness reward model is left intact.[^1] This separation is a recurring theme in the paper's framing of RBR's advantage over single-reward pipelines.[^1][^2]
RBR is significant for several reasons. First, it offers a concrete recipe for reducing over-refusal without sacrificing hard safety, a frequently cited problem in deployed chat assistants and in academic benchmarks such as XSTest.[^1][^7] The published numbers indicate that the recipe roughly halves over-refusal while keeping unsafe-response rates near the human-feedback baseline.[^1][^2]
Second, it operationalizes the idea that safety policy should be legible to humans rather than encoded only in preference labels. Because RBR rules are written as propositions and graded by a model, the policy can be inspected, debated, and revised without collecting new preference data.[^1][^2] This connects to OpenAI's broader push, through the OpenAI Model Spec, to publish human-readable behavior specifications that can then be enforced by training-time machinery such as RBR.[^4]
Third, RBR demonstrates that, at least for the narrow case of refusal-class behavior, the marginal value of additional human preference data can be small once a sufficiently specific written policy is available. The published gold set used to calibrate the grader contained only 518 completions, and the weight optimization itself ran on a laptop.[^1] This data efficiency is one of the recurring claims in OpenAI's framing of the technique.[^2][^4]
Finally, RBR has had downstream methodological influence. The successor safe-completions work cites RBR as the technique it most directly evolves from, and it inherits the central premise that the safety reward should evaluate properties of the model's output rather than the user's intent.[^9]
The paper and subsequent commentary identify several limitations of the technique.[^1][^2][^4][^12]
Subjectivity. RBR works well when behavior can be specified by simple, easily verifiable propositions. The OpenAI blog post explicitly notes that the technique "may not be applicable to highly subjective tasks" such as writing a high-quality essay, where neither the rules nor the grader's judgments would be as crisp as for refusal-class tasks.[^4][^12] OpenAI suggests combining RBR with human feedback in such cases.[^4]
Policy specification cost. Although RBR reduces human labeling cost, it raises policy authorship cost: someone must write, debug, and maintain the rules and their grader prompts. The paper notes that the design of grader prompts "requires iteration" and that the choice of grader LLM is itself "highly impactful."[^1]
Grader fragility. RBR pushes much of the safety burden onto the grader's accuracy. The paper measures the grader against a 518-example gold set, but a grader that systematically misjudges some proposition will systematically bias the RL update. Independent commentary has flagged that RBR shifts safety decisions toward AI graders, which can reduce direct human oversight of individual training judgments.[^12]
Coverage and policy gaps. The published rules cover four content areas: erotic content, hate speech, criminal advice, and self-harm.[^1] Coverage of more complex domains, including dual-use scientific and cybersecurity uplift, is not part of the simplified policy and was a motivating concern for the safe-completions follow-up.[^9]
Hard-safety regression in some configurations. In the headline F1 table the human-feedback baseline reaches 100% on the not-unsafe axis, while RBR-PPO reports 97.3%; the F1 win for RBR therefore comes from the over-refusal axis, and per-axis comparisons should not be conflated with a uniform safety improvement.[^1] The paper frames this as an explicit Pareto-style tradeoff that the RBR weight can shift.[^1][^4]
Dependence on the underlying preference data. RBR replaces only the safety portion of the reward; the helpful-only reward model is still trained on human preference data and inherits whatever biases that data contains.[^1] If the underlying helpfulness data is itself skewed toward particular styles, regions, or conventions, the resulting model's compliance behavior will still reflect those biases even when its refusal behavior is governed by RBR rules.[^1]
Reduced human oversight at the per-example level. Because the safety signal is generated by a model rather than by a human label, individual training updates are no longer directly inspected by a human reviewer. Independent commentary has framed this as a shift in where human oversight enters the pipeline: from per-example labeling to policy authorship and grader calibration.[^4][^12] The paper's response is to ground the grader against a small human-labeled gold set and to publish that gold set so its content can be audited.[^1][^8]
| Method | Source of safety signal | Where the policy lives | Year |
|---|---|---|---|
| RLHF | Human preference labels | Implicit in labels | 2022[^5][^6] |
| Constitutional AI | Model self-critique + AI-generated preferences | Natural-language constitution, compiled into preference model | 2022[^10] |
| RLAIF | LLM preference judgments | Implicit in judge prompt | 2023[^1] |
| Rule-Based Rewards | LLM-graded propositions added to PPO reward | Explicit propositions and per-prompt response type | 2024[^1][^2] |
| Safe-completions | LLM-graded output safety + helpfulness | Continuous output evaluation rather than refusal class | 2025[^9] |