Self-Rewarding Language Models
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,297 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,297 words
Add missing citations, update stale details, or suggest a clearer explanation.
Self-Rewarding Language Models (SRLM) is an iterative alignment method in which a single large language model alternately plays the role of policy (generating candidate responses to user prompts) and reward model (scoring those responses through LLM-as-a-Judge-style prompting), then trains itself on the resulting preference pairs with Direct Preference Optimization (DPO) before repeating the cycle.[^1] The procedure was introduced by Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston of Meta AI and New York University in the paper "Self-Rewarding Language Models", posted to arXiv on 18 January 2024 (arXiv:2401.10020) and later presented at ICML 2024.[^1][^2] Starting from Llama 2 70B and a seed of approximately 3,200 instruction examples plus a small evaluation-tuning set, three rounds of self-rewarding raised the model's AlpacaEval 2.0 win rate against GPT-4-Turbo from 9.94% to 20.44%, surpassing reference systems such as Claude 2, Gemini Pro, and GPT-4 0613 in the same benchmark snapshot.[^1][^3] The contribution is widely cited as the first publicly demonstrated case where the policy and the reward model are not only the same network but co-improve through training, opening a research direction extended by Meta-Rewarding Language Models, Direct Nash Optimization, CREAM, SCIR, and related work.[^4][^5][^6][^7]
Modern aligned chat assistants are typically produced by combining supervised fine-tuning on instruction data with a preference-learning stage. The dominant pipeline, Reinforcement Learning from Human Feedback (RLHF), first collects pairwise human comparisons of model outputs, trains a separate Bradley-Terry-style reward model on those comparisons, and finally optimises the policy against the frozen reward model with proximal policy optimisation or a related algorithm.[^8] Direct Preference Optimization, introduced by Rafailov et al. at NeurIPS 2023, removes the explicit reward model and on-policy sampling step by re-deriving the RLHF objective as a closed-form classification loss over preference pairs, but the underlying preference data and the implicit reward model induced from it are still typically fixed.[^9]
Yuan and colleagues observed two consequences of this design that they treat as the main motivation for self-rewarding training. First, because the reward model is trained once on a finite human-annotated dataset and then frozen, alignment quality is capped at "human level": once the policy outperforms the data the reward model has been trained on, the reward signal becomes uninformative and may actively mislead optimisation.[^1] They write that "to achieve superhuman agents, future models require superhuman feedback" and argue that frozen reward models cannot supply it.[^1] Second, decoupling the policy from the reward model wastes information: the same network that generates a response often contains relevant evaluative knowledge (criteria, style, factual checks) that a frozen reward model trained from scratch must rediscover.[^1] The self-rewarding setup makes the reward model co-evolve with the policy by tying them to the same parameters, so each new iteration that improves response quality also (in principle) improves judgement quality.
The approach builds on a cluster of antecedents. Constitutional AI and RLAIF, introduced by Bai et al. at Anthropic, replaced human harmlessness labels with AI-generated critiques and preferences guided by a written constitution, but kept a separate (AI-derived) preference model for the RL stage.[^10] The LLM-as-a-judge methodology, formalised by Zheng et al. for MT-Bench and Chatbot Arena, showed that strong LLMs agree with human preferences on chatbot evaluation at over 80%, comparable to inter-annotator agreement, and identified position, verbosity, and self-enhancement biases as the main failure modes.[^11] Iterative DPO and rejection-sampling-style self-distillation pipelines such as Mistral's reinforced rejection sampling and Llama 2's chat training had already established the engineering pattern of "generate, score, retrain"; the missing piece was using the same model for both the generation and the scoring.[^1][^11]
A self-rewarding training run is parameterised by a base model $M_0$ (in the original paper, Llama 2 70B), a small seed of instruction-following examples (Instruction Fine-Tuning data, IFT), an equally small seed of evaluation examples (Evaluation Fine-Tuning data, EFT), and a fixed scoring prompt template.[^1] Training proceeds in successive iterations $M_1, M_2, \dots, M_T$:
The authors emphasise that step 5 is a single DPO run on the freshly generated preference pairs, not an online policy-gradient update, so the procedure inherits DPO's training stability while still being on-policy in the sense that the preference pairs come from the current model.[^1][^9]
The seed IFT consists of about 3,200 first-turn examples from the Open Assistant dataset, restricted to English and to conversations whose first reply was rated highest in the original ranking.[^1] The seed EFT contains 1,630 training and 541 evaluation examples constructed by reformatting Open Assistant ranked responses into "instruction, response, score" tuples, where the score is produced by prompting GPT-4 with a fixed 0-to-5 additive rubric for relevance, coverage, usefulness, clarity, and expertise.[^1]
The scoring template, reproduced from Figure 2 of the paper, awards points cumulatively:
The judge is asked to produce a brief justification and a single integer in the range 0 to 5 on a designated final line, which can be parsed deterministically. Because the rubric is additive rather than holistic, scores tend to cluster in the middle of the range, which reduces variance from minor wording differences.[^1]
For each new prompt the system samples $N = 4$ candidate responses with temperature $T = 0.7$ and nucleus $p = 0.9$, then scores each candidate $K = 3$ times under the rubric and averages.[^1] If at least one candidate strictly outscores another, the highest-scoring response becomes the "chosen" example and the lowest-scoring becomes the "rejected" example, forming a single preference pair. Prompts where all candidates receive identical scores are discarded, which biases the resulting dataset towards harder cases on which the model has discriminative signal.
The size of the AIFT set grows across iterations because more discriminable pairs survive: 3,964 pairs were used to train $M_2$ from $M_1$, and 6,942 pairs to train $M_3$ from $M_2$.[^1] Prompts for sampling are drawn from a held-out portion of Open Assistant and supplemented by automatically generated instructions following the Self-Instruct paradigm.
Each new $M_{t+1}$ is initialised from the previous SFT model (denoted $M_1$ in the paper, since the EFT-augmented SFT model is treated as a stable anchor) rather than from $M_t$ itself, then fine-tuned on the freshly produced AIFT preference pairs with DPO. Reported hyperparameters include a learning rate of 1e-6 linearly decayed to 1e-7, a batch size of 16, and a DPO temperature parameter $\beta = 0.1$.[^1] Restarting from the SFT anchor avoids compounding noise across iterations but means that any improvement must come from the AIFT pairs rather than from initialisation drift.
A distinctive aspect of the recipe is that the very first SFT stage combines two objectives in a single training pass. Half of the seed teaches the model to follow instructions, in the standard way; the other half teaches it to apply the evaluation rubric to (instruction, response) pairs and emit a numeric score with reasoning. The paper shows that this Evaluation Fine-Tuning is the key ingredient that allows the model to act as a reliable judge for itself: a model trained on IFT alone reaches only 65.1% pairwise accuracy at predicting human preferences on the held-out EFT set, whereas adding EFT raises this to 78.7% even before any self-rewarding loop is run.[^1] After three iterations the figure rises to 81.7%, supporting the paper's central claim that the model's judge improves in lock-step with its instruction-following ability.[^1]
The paper's headline metric is the AlpacaEval 2.0 win rate against GPT-4-Turbo (gpt-4-1106-preview) as judged by GPT-4-Turbo itself, computed on 805 instructions and reported under the original (non-length-controlled) protocol used at the time of submission.[^1] Length-controlled AlpacaEval, which fits a logistic regression to remove the well-known length bias in autoannotators, was published by Dubois et al. only after the Self-Rewarding paper.[^12]
| Model | AlpacaEval 2.0 win rate vs GPT-4-Turbo |
|---|---|
| SFT baseline (IFT only) | 9.94% |
| $M_1$ (IFT + EFT) | 9.94% |
| $M_2$ (one self-rewarding iter) | 15.38% |
| $M_3$ (two self-rewarding iters) | 20.44% |
[^1]
For comparison, the same paper's leaderboard snapshot reported Claude 2 at 17.19%, Gemini Pro at 16.85%, and GPT-4 0613 at 15.76%, all below the third iteration of Self-Rewarding Llama 2 70B.[^1] Head-to-head GPT-4 evaluations between iterations showed $M_3$ beating $M_2$ in 47.7% of cases versus losses in 12.5% (the rest were ties), and beating the SFT baseline in 62.5% of cases versus 9.8%.[^3]
A second axis of evaluation tracks how well the model predicts the same Open Assistant pairwise ranking the EFT data was derived from. The reported pairwise accuracy rises monotonically:
| Model | Pairwise reward accuracy |
|---|---|
| SFT baseline (IFT only) | 65.1% |
| $M_1$ (IFT + EFT) | 78.7% |
| $M_2$ | 80.4% |
| $M_3$ | 81.7% |
[^1]
The pattern is the paper's key empirical claim: training on self-generated preference pairs not only improves response quality but also tightens the model's agreement with human evaluators, contradicting the intuition that pure self-distillation should make the judge collapse onto its own biases.
MT-Bench, a 160-question two-turn benchmark scored by GPT-4 on a 1-to-10 scale, shows a smaller but consistent climb across iterations: 6.85 for the SFT baseline (without EFT), 6.78 for $M_1$, 7.01 for $M_2$, and 7.25 for $M_3$.[^1] The small drop from baseline to $M_1$ reflects the dilution of IFT data by EFT scoring examples and is recovered by the first self-rewarding step.
A complementary set of zero-shot tasks (ARC, HellaSwag, MMLU, OpenBookQA, GSM8K, TriviaQA in the published evaluation) was used to test for capability regression. The reported numbers are essentially flat across $M_0$ through $M_3$, suggesting that self-rewarding modifies the model's chat behaviour without degrading its underlying knowledge or reasoning at the scale tested.[^1] The authors interpret this as evidence that self-rewarding adjusts surface style and instruction-following while leaving the base distribution largely intact, though they do not claim it improves these benchmarks either.
A persistent caveat is that average response length grew sharply across iterations: roughly 1,092 characters for $M_1$, 1,552 for $M_2$, and 2,552 for $M_3$.[^1] Because GPT-4 evaluators favour longer responses in AlpacaEval-style protocols, an unknown fraction of the headline win-rate gains may be attributable to length rather than substance. The authors acknowledge the issue and later researchers explicitly target it (see Meta-Rewarding and length-controlled AlpacaEval, below).[^4][^12]
Wu et al. (Meta FAIR, UC Berkeley, NYU) introduced Meta-Rewarding Language Models in July 2024 (arXiv:2407.19594).[^4] The procedure adds a third role to the actor/judge dyad: a meta-judge that compares pairs of judgements produced by the same model and assigns preference labels to them, which are then used to fine-tune the model's judging behaviour in addition to its acting behaviour.[^4] The meta-judge stage explicitly addresses the rapid saturation of pure self-rewarding by training the model to produce better evaluations, not only better responses.
The paper also introduces a length-control mechanism. A quality tier parameter $\rho \in [0,1]$ selects the top fraction of candidates by score, then "opt[s] for the shortest response within this top tier" as the chosen example, with the longest low-scoring candidate as the rejected example, counteracting the judge's length bias.[^4] Judge-level filtering additionally drops preference pairs whose chosen judgements exceed a length threshold (varying from about 1,100 to 1,000 characters across iterations).[^4]
Applied to Llama-3-8B-Instruct (with a seed model whose AlpacaEval 2 length-controlled win rate is 22.92%), four iterations of meta-rewarding raised the length-controlled win rate to 39.44% and the raw win rate to 39.45%, gaining 13.97 percentage points over the seed.[^4] On Arena-Hard the same model rose from 20.6% to 29.1%. MT-Bench Turn-1 score increased from 8.319 to 8.738 across the same four iterations, with less than 0.1 reduction in Turn-2 score. The paper reports that Meta-Rewarding outperforms a Self-Rewarding baseline with the same length-control mechanism by about 3.95 LC points at iteration 4, isolating the contribution of the meta-judge.[^4]
Microsoft Research's Direct Nash Optimization (DNO), introduced by Rosset et al. in April 2024 (arXiv:2404.03715), reformulates self-improvement as approaching a Nash equilibrium of a general-preference game rather than maximising a scalar reward.[^5] DNO replaces the Bradley-Terry assumption underlying DPO with a more general pairwise preference function, then iteratively trains the policy against preferences elicited by the current model itself. The authors report that a 7B Orca-2.5 model aligned by DNO reached a 33% AlpacaEval 2.0 win rate against GPT-4-Turbo, surpassing a 70B Self-Rewarding model on the same benchmark and matching Mistral Large.[^5] DNO is often grouped with Self-Rewarding LMs as evidence that the policy/reward duality can be exploited beyond the strict DPO framework.
Two ICLR-class follow-ups attack the reward-noise problem directly. CREAM ("Consistency Regularized Self-Rewarding Language Models", arXiv:2410.12735) measures the consistency of self-assigned rankings across training iterations and adds a regularisation term that penalises preference pairs on which earlier and later versions of the model disagree, on the grounds that those pairs are likely noise.[^6] SCIR ("Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models", arXiv:2502.08922) generalises the idea to enforce agreement among multiple internal reward-extraction strategies within a single model.[^7] Both report that reward consistency rises from roughly 0.4-0.5 in vanilla Self-Rewarding to above 0.9 with their regularisation, with corresponding gains on AlpacaEval-style metrics; both also explicitly attribute the rapid saturation of vanilla Self-Rewarding to noise in the self-generated preference labels.[^6][^7]
Earlier work on SPIN (Self-Play Fine-Tuning, Chen et al. 2024) shares the iterative DPO scaffold but replaces the self-judge with a fixed objective: distinguish the previous-iteration policy's outputs from a reference distribution.[^13] SPIN is therefore complementary, since it does not require any evaluator at all but does not improve the model's judgement.
Constitutional AI and the broader RLAIF family use AI-generated preferences but retain a separate, frozen preference model trained on AI-labelled comparisons; they predate Self-Rewarding LMs and inspired the LLM-as-a-judge step.[^10] Several 2024-2025 systems combine the two ideas, replacing the static constitution with an evolving self-judge prompt while retaining safety-oriented critique prompts as guard rails.
Self-rewarding has been positioned as a route to extend alignment training beyond what is feasible with hand-collected preference data. Concrete uses fall into three broad categories:
self-rewarding-lm-pytorch) makes the procedure available with a Hugging Face Transformers / TRL backend, and several open chat models published between mid-2024 and 2025 report using a self-rewarding stage on top of standard DPO, although exact recipes vary.[^14]The paper's results have been replicated qualitatively in the smaller-model regime by the Meta-Rewarding study on Llama-3-8B-Instruct and by community fine-tunes on Mistral 7B-class base models, although smaller models tend to plateau sooner without explicit consistency or length controls.[^4][^6]
The most repeated critique is that Self-Rewarding plateaus quickly. The original paper itself reports diminishing returns by iteration 3, and follow-up work by Wu et al. (Meta-Rewarding) and CREAM frames this as the expected outcome of training a judge only via its actor role: each iteration produces preference pairs that the judge has already learned to discriminate, so the marginal informational content of the AIFT set falls.[^4][^6] Meta-Rewarding's explicit judge training and CREAM's consistency regularisation each extend the productive iteration count, but neither claims indefinite improvement.
Because the judge and the policy share parameters, biases in the judge propagate directly into the policy. The most visible such bias is verbosity: GPT-4-style evaluators (including the model itself) favour longer responses on AlpacaEval-style protocols, and self-rewarding monotonically inflates response length, more than doubling from $M_1$ to $M_3$ in the original experiments.[^1][^11] The same effect is the textbook example of reward hacking, in which the policy exploits a defect in the reward function (here, length preference) without improving on the underlying objective.[^1] Subsequent benchmarks (length-controlled AlpacaEval, Arena-Hard's length-aware judging) and length-control mechanisms inside Meta-Rewarding were designed in part to discount the inflation.[^4][^12]
A related concern is the self-enhancement bias identified by Zheng et al.: LLM judges systematically prefer answers in their own style, which means a self-rewarding model effectively rewards its own idiosyncrasies on every iteration.[^11] To the extent that those idiosyncrasies do not correlate with human preferences, the procedure can drift away from human judgement even while raising its self-reported scores. The original paper attempts to bound this risk by also reporting pairwise accuracy against the original Open Assistant rankings, which remains a fixed external anchor.[^1]
More generally, Self-Rewarding LMs make explicit a tension long discussed in AI alignment: optimising hard against any imperfect proxy reward eventually causes capability divergence (the proxy goes up while the true objective stagnates or falls), a pattern captured by Goodhart's law. Self-rewarding makes the proxy and the optimiser the same artefact, which amplifies both the upside (faster adaptation) and the risk (no external check). The literature on scalable oversight frequently uses Self-Rewarding as a case study in this trade-off, sometimes pairing it with AI safety via debate or recursive reward modelling as complementary mechanisms for keeping the proxy honest.[^15]
The recipe is sensitive to the quality and breadth of seed data. The reported gains use 3,200 IFT and roughly 1,600 EFT examples derived from Open Assistant; the paper does not show whether equivalent gains are achievable from scratch on a model without any prior chat tuning, and replication attempts on bases with weaker initial instruction-following often saturate at lower win rates.[^1][^6] Because the seed is small relative to the base model, the procedure is also brittle to distribution mismatch between the seed evaluator's notion of quality and the held-out prompt distribution.
Because the same family of GPT-4-Turbo evaluators is used to score AlpacaEval 2.0, the headline numbers do not establish independence from the judge family. Subsequent work has emphasised the importance of cross-judge evaluation, replacing GPT-4 with stronger or differently aligned judges to triangulate gains; for example, Meta-Rewarding reports both AlpacaEval and Arena-Hard scores in part to address this concern.[^4]
Although standard NLP benchmarks remain flat in the published experiments, several follow-up studies report that aggressive self-rewarding can degrade structured-reasoning behaviour on tasks outside the seed's distribution, presumably because the judge does not reward step-by-step reasoning unless the seed taught it to.[^6] Combining self-rewarding with process-style supervision such as a process reward model (PRM) is one proposed remedy.
| Method | Reward source | Reward updated during training? | Policy update rule | Representative reference |
|---|---|---|---|---|
| Classical RLHF (InstructGPT-style) | Bradley-Terry reward model from human pairs | No (frozen) | PPO against frozen RM | Ouyang et al. 2022[^8] |
| DPO | Implicit reward induced by preference pairs | No (fixed dataset) | DPO classification loss | Rafailov et al. 2023[^9] |
| Constitutional AI / RLAIF | AI-generated preferences from constitution | Reward model retrained but separate from policy | PPO against AI-trained RM | Bai et al. 2022[^10] |
| Self-Rewarding LM | Same model as policy via LLM-as-a-Judge | Yes, jointly with policy | DPO on self-labelled pairs | Yuan et al. 2024[^1] |
| Meta-Rewarding LM | Same model, with a meta-judge | Yes, with explicit judge training | DPO on self-labelled pairs with length control | Wu et al. 2024[^4] |
| Direct Nash Optimization | Same model via general pairwise preferences | Yes, approaches Nash equilibrium | Contrastive regression objective | Rosset et al. 2024[^5] |
| SPIN | Self vs reference distribution discriminator | Implicit in self-play loss | DPO-style self-play loss | Chen et al. 2024[^13] |
The table emphasises the central novelty of Self-Rewarding LMs: not the use of AI feedback (Constitutional AI and RLAIF already did that), nor the choice of DPO as optimiser (now standard), but the explicit fusion of policy and reward parameters into one network whose weights are updated each iteration.