Self-Rewarding Language Models

Self-Rewarding Language Models (SRLM) is an iterative alignment method in which a single large language model alternately plays the role of policy (generating candidate responses to user prompts) and reward model (scoring those responses through LLM-as-a-Judge-style prompting), then trains itself on the resulting preference pairs with Direct Preference Optimization (DPO) before repeating the cycle.[^1] The procedure was introduced by Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston of Meta AI and New York University in the paper "Self-Rewarding Language Models", posted to arXiv on 18 January 2024 (arXiv:2401.10020) and later presented at ICML 2024.[^1][^2] Starting from Llama 2 70B and a seed of approximately 3,200 instruction examples plus a small evaluation-tuning set, three rounds of self-rewarding raised the model's AlpacaEval 2.0 win rate against GPT-4-Turbo from 9.94% to 20.44%, surpassing reference systems such as Claude 2, Gemini Pro, and GPT-4 0613 in the same benchmark snapshot.[^1][^3] The contribution is widely cited as the first publicly demonstrated case where the policy and the reward model are not only the same network but co-improve through training, opening a research direction extended by Meta-Rewarding Language Models, Direct Nash Optimization, CREAM, SCIR, and related work.[^4][^5][^6][^7]

Background and motivation

Modern aligned chat assistants are typically produced by combining supervised fine-tuning on instruction data with a preference-learning stage. The dominant pipeline, Reinforcement Learning from Human Feedback (RLHF), first collects pairwise human comparisons of model outputs, trains a separate Bradley-Terry-style reward model on those comparisons, and finally optimises the policy against the frozen reward model with proximal policy optimisation or a related algorithm.[^8] Direct Preference Optimization, introduced by Rafailov et al. at NeurIPS 2023, removes the explicit reward model and on-policy sampling step by re-deriving the RLHF objective as a closed-form classification loss over preference pairs, but the underlying preference data and the implicit reward model induced from it are still typically fixed.[^9]

Yuan and colleagues observed two consequences of this design that they treat as the main motivation for self-rewarding training. First, because the reward model is trained once on a finite human-annotated dataset and then frozen, alignment quality is capped at "human level": once the policy outperforms the data the reward model has been trained on, the reward signal becomes uninformative and may actively mislead optimisation.[^1] They write that "to achieve superhuman agents, future models require superhuman feedback" and argue that frozen reward models cannot supply it.[^1] Second, decoupling the policy from the reward model wastes information: the same network that generates a response often contains relevant evaluative knowledge (criteria, style, factual checks) that a frozen reward model trained from scratch must rediscover.[^1] The self-rewarding setup makes the reward model co-evolve with the policy by tying them to the same parameters, so each new iteration that improves response quality also (in principle) improves judgement quality.

The approach builds on a cluster of antecedents. Constitutional AI and RLAIF, introduced by Bai et al. at Anthropic, replaced human harmlessness labels with AI-generated critiques and preferences guided by a written constitution, but kept a separate (AI-derived) preference model for the RL stage.[^10] The LLM-as-a-judge methodology, formalised by Zheng et al. for MT-Bench and Chatbot Arena, showed that strong LLMs agree with human preferences on chatbot evaluation at over 80%, comparable to inter-annotator agreement, and identified position, verbosity, and self-enhancement biases as the main failure modes.[^11] Iterative DPO and rejection-sampling-style self-distillation pipelines such as Mistral's reinforced rejection sampling and Llama 2's chat training had already established the engineering pattern of "generate, score, retrain"; the missing piece was using the same model for both the generation and the scoring.[^1][^11]

How it works

Pipeline overview

A self-rewarding training run is parameterised by a base model $M_0$ (in the original paper, Llama 2 70B), a small seed of instruction-following examples (Instruction Fine-Tuning data, IFT), an equally small seed of evaluation examples (Evaluation Fine-Tuning data, EFT), and a fixed scoring prompt template.[^1] Training proceeds in successive iterations $M_1, M_2, \dots, M_T$:

Build $M_1$ by supervised fine-tuning $M_0$ on the union of IFT and EFT. This step gives the model both the ability to follow instructions and the ability to score responses on a 0-to-5 rubric.[^1]
For each prompt drawn from a held-out pool, sample $N$ candidate responses from the current model $M_t$ with stochastic decoding.
Use the same model $M_t$, prompted with the evaluation template, to assign a numeric score to each candidate; repeat the scoring $K$ times and average to reduce noise.
Pair the highest-scoring and lowest-scoring candidates per prompt into a preference dataset called AI Feedback Training data (AIFT).
Train $M_{t+1}$ from $M_t$ (or, in the paper, from $M_1$, the SFT seed) using DPO on AIFT.
Loop back to step 2 with $M_{t+1}$.

The authors emphasise that step 5 is a single DPO run on the freshly generated preference pairs, not an online policy-gradient update, so the procedure inherits DPO's training stability while still being on-policy in the sense that the preference pairs come from the current model.[^1][^9]

Seed data and the LLM-as-a-Judge prompt

The seed IFT consists of about 3,200 first-turn examples from the Open Assistant dataset, restricted to English and to conversations whose first reply was rated highest in the original ranking.[^1] The seed EFT contains 1,630 training and 541 evaluation examples constructed by reformatting Open Assistant ranked responses into "instruction, response, score" tuples, where the score is produced by prompting GPT-4 with a fixed 0-to-5 additive rubric for relevance, coverage, usefulness, clarity, and expertise.[^1]

The scoring template, reproduced from Figure 2 of the paper, awards points cumulatively:

1 point if the response is relevant and provides some information related to the user's inquiry.
1 more point if the response addresses a substantial portion of the user's question.
1 more point if the response answers the basic elements of the user's question in a useful way.
1 more point if the response is clearly written from an AI assistant's perspective, addressing the question directly and comprehensively.
1 final point for a response impeccably tailored to the user's question, without extraneous information.[^1]

The judge is asked to produce a brief justification and a single integer in the range 0 to 5 on a designated final line, which can be parsed deterministically. Because the rubric is additive rather than holistic, scores tend to cluster in the middle of the range, which reduces variance from minor wording differences.[^1]

Sampling and pair construction

For each new prompt the system samples $N = 4$ candidate responses with temperature $T = 0.7$ and nucleus $p = 0.9$, then scores each candidate $K = 3$ times under the rubric and averages.[^1] If at least one candidate strictly outscores another, the highest-scoring response becomes the "chosen" example and the lowest-scoring becomes the "rejected" example, forming a single preference pair. Prompts where all candidates receive identical scores are discarded, which biases the resulting dataset towards harder cases on which the model has discriminative signal.

The size of the AIFT set grows across iterations because more discriminable pairs survive: 3,964 pairs were used to train $M_2$ from $M_1$, and 6,942 pairs to train $M_3$ from $M_2$.[^1] Prompts for sampling are drawn from a held-out portion of Open Assistant and supplemented by automatically generated instructions following the Self-Instruct paradigm.

DPO training step

Each new $M_{t+1}$ is initialised from the previous SFT model (denoted $M_1$ in the paper, since the EFT-augmented SFT model is treated as a stable anchor) rather than from $M_t$ itself, then fine-tuned on the freshly produced AIFT preference pairs with DPO. Reported hyperparameters include a learning rate of 1e-6 linearly decayed to 1e-7, a batch size of 16, and a DPO temperature parameter $\beta = 0.1$.[^1] Restarting from the SFT anchor avoids compounding noise across iterations but means that any improvement must come from the AIFT pairs rather than from initialisation drift.

Two-stage SFT: instruction following and evaluation

A distinctive aspect of the recipe is that the very first SFT stage combines two objectives in a single training pass. Half of the seed teaches the model to follow instructions, in the standard way; the other half teaches it to apply the evaluation rubric to (instruction, response) pairs and emit a numeric score with reasoning. The paper shows that this Evaluation Fine-Tuning is the key ingredient that allows the model to act as a reliable judge for itself: a model trained on IFT alone reaches only 65.1% pairwise accuracy at predicting human preferences on the held-out EFT set, whereas adding EFT raises this to 78.7% even before any self-rewarding loop is run.[^1] After three iterations the figure rises to 81.7%, supporting the paper's central claim that the model's judge improves in lock-step with its instruction-following ability.[^1]

Results in the original paper

AlpacaEval 2.0

The paper's headline metric is the AlpacaEval 2.0 win rate against GPT-4-Turbo (gpt-4-1106-preview) as judged by GPT-4-Turbo itself, computed on 805 instructions and reported under the original (non-length-controlled) protocol used at the time of submission.[^1] Length-controlled AlpacaEval, which fits a logistic regression to remove the well-known length bias in autoannotators, was published by Dubois et al. only after the Self-Rewarding paper.[^12]

Model	AlpacaEval 2.0 win rate vs GPT-4-Turbo
SFT baseline (IFT only)	9.94%
$M_1$ (IFT + EFT)	9.94%
$M_2$ (one self-rewarding iter)	15.38%
$M_3$ (two self-rewarding iters)	20.44%

[^1]

For comparison, the same paper's leaderboard snapshot reported Claude 2 at 17.19%, Gemini Pro at 16.85%, and GPT-4 0613 at 15.76%, all below the third iteration of Self-Rewarding Llama 2 70B.[^1] Head-to-head GPT-4 evaluations between iterations showed $M_3$ beating $M_2$ in 47.7% of cases versus losses in 12.5% (the rest were ties), and beating the SFT baseline in 62.5% of cases versus 9.8%.[^3]

Reward modelling accuracy

A second axis of evaluation tracks how well the model predicts the same Open Assistant pairwise ranking the EFT data was derived from. The reported pairwise accuracy rises monotonically:

Model	Pairwise reward accuracy
SFT baseline (IFT only)	65.1%
$M_1$ (IFT + EFT)	78.7%
$M_2$	80.4%
$M_3$	81.7%

[^1]

The pattern is the paper's key empirical claim: training on self-generated preference pairs not only improves response quality but also tightens the model's agreement with human evaluators, contradicting the intuition that pure self-distillation should make the judge collapse onto its own biases.

MT-Bench

MT-Bench, a 160-question two-turn benchmark scored by GPT-4 on a 1-to-10 scale, shows a smaller but consistent climb across iterations: 6.85 for the SFT baseline (without EFT), 6.78 for $M_1$, 7.01 for $M_2$, and 7.25 for $M_3$.[^1] The small drop from baseline to $M_1$ reflects the dilution of IFT data by EFT scoring examples and is recovered by the first self-rewarding step.

Standard NLP benchmark behaviour

A complementary set of zero-shot tasks (ARC, HellaSwag, MMLU, OpenBookQA, GSM8K, TriviaQA in the published evaluation) was used to test for capability regression. The reported numbers are essentially flat across $M_0$ through $M_3$, suggesting that self-rewarding modifies the model's chat behaviour without degrading its underlying knowledge or reasoning at the scale tested.[^1] The authors interpret this as evidence that self-rewarding adjusts surface style and instruction-following while leaving the base distribution largely intact, though they do not claim it improves these benchmarks either.

Response length

A persistent caveat is that average response length grew sharply across iterations: roughly 1,092 characters for $M_1$, 1,552 for $M_2$, and 2,552 for $M_3$.[^1] Because GPT-4 evaluators favour longer responses in AlpacaEval-style protocols, an unknown fraction of the headline win-rate gains may be attributable to length rather than substance. The authors acknowledge the issue and later researchers explicitly target it (see Meta-Rewarding and length-controlled AlpacaEval, below).[^4][^12]

Variants and follow-ups

Meta-Rewarding Language Models

Wu et al. (Meta FAIR, UC Berkeley, NYU) introduced Meta-Rewarding Language Models in July 2024 (arXiv:2407.19594).[^4] The procedure adds a third role to the actor/judge dyad: a meta-judge that compares pairs of judgements produced by the same model and assigns preference labels to them, which are then used to fine-tune the model's judging behaviour in addition to its acting behaviour.[^4] The meta-judge stage explicitly addresses the rapid saturation of pure self-rewarding by training the model to produce better evaluations, not only better responses.

The paper also introduces a length-control mechanism. A quality tier parameter $\rho \in [0,1]$ selects the top fraction of candidates by score, then "opt[s] for the shortest response within this top tier" as the chosen example, with the longest low-scoring candidate as the rejected example, counteracting the judge's length bias.[^4] Judge-level filtering additionally drops preference pairs whose chosen judgements exceed a length threshold (varying from about 1,100 to 1,000 characters across iterations).[^4]

Applied to Llama-3-8B-Instruct (with a seed model whose AlpacaEval 2 length-controlled win rate is 22.92%), four iterations of meta-rewarding raised the length-controlled win rate to 39.44% and the raw win rate to 39.45%, gaining 13.97 percentage points over the seed.[^4] On Arena-Hard the same model rose from 20.6% to 29.1%. MT-Bench Turn-1 score increased from 8.319 to 8.738 across the same four iterations, with less than 0.1 reduction in Turn-2 score. The paper reports that Meta-Rewarding outperforms a Self-Rewarding baseline with the same length-control mechanism by about 3.95 LC points at iteration 4, isolating the contribution of the meta-judge.[^4]

Direct Nash Optimization

Microsoft Research's Direct Nash Optimization (DNO), introduced by Rosset et al. in April 2024 (arXiv:2404.03715), reformulates self-improvement as approaching a Nash equilibrium of a general-preference game rather than maximising a scalar reward.[^5] DNO replaces the Bradley-Terry assumption underlying DPO with a more general pairwise preference function, then iteratively trains the policy against preferences elicited by the current model itself. The authors report that a 7B Orca-2.5 model aligned by DNO reached a 33% AlpacaEval 2.0 win rate against GPT-4-Turbo, surpassing a 70B Self-Rewarding model on the same benchmark and matching Mistral Large.[^5] DNO is often grouped with Self-Rewarding LMs as evidence that the policy/reward duality can be exploited beyond the strict DPO framework.

CREAM and SCIR

Two ICLR-class follow-ups attack the reward-noise problem directly. CREAM ("Consistency Regularized Self-Rewarding Language Models", arXiv:2410.12735) measures the consistency of self-assigned rankings across training iterations and adds a regularisation term that penalises preference pairs on which earlier and later versions of the model disagree, on the grounds that those pairs are likely noise.[^6] SCIR ("Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models", arXiv:2502.08922) generalises the idea to enforce agreement among multiple internal reward-extraction strategies within a single model.[^7] Both report that reward consistency rises from roughly 0.4-0.5 in vanilla Self-Rewarding to above 0.9 with their regularisation, with corresponding gains on AlpacaEval-style metrics; both also explicitly attribute the rapid saturation of vanilla Self-Rewarding to noise in the self-generated preference labels.[^6][^7]

Self-play and constitutional variants

Earlier work on SPIN (Self-Play Fine-Tuning, Chen et al. 2024) shares the iterative DPO scaffold but replaces the self-judge with a fixed objective: distinguish the previous-iteration policy's outputs from a reference distribution.[^13] SPIN is therefore complementary, since it does not require any evaluator at all but does not improve the model's judgement.

Constitutional AI and the broader RLAIF family use AI-generated preferences but retain a separate, frozen preference model trained on AI-labelled comparisons; they predate Self-Rewarding LMs and inspired the LLM-as-a-judge step.[^10] Several 2024-2025 systems combine the two ideas, replacing the static constitution with an evolving self-judge prompt while retaining safety-oriented critique prompts as guard rails.

Applications and adoption

Self-rewarding has been positioned as a route to extend alignment training beyond what is feasible with hand-collected preference data. Concrete uses fall into three broad categories:

Open-source post-training pipelines. A widely starred open implementation by lucidrains (self-rewarding-lm-pytorch) makes the procedure available with a Hugging Face Transformers / TRL backend, and several open chat models published between mid-2024 and 2025 report using a self-rewarding stage on top of standard DPO, although exact recipes vary.[^14]
Specialised judge training. Because Evaluation Fine-Tuning produces a model that is simultaneously an actor and a judge, researchers have used Self-Rewarding-style models as drop-in replacements for separate judge LLMs in tasks such as instruction-tuning data curation, rejection sampling of synthetic data, and reward modelling for downstream RL.[^4][^6]
Foundations for scalable-oversight experiments. The procedure is cited as an empirical reference point in the scalable oversight literature: it demonstrates concretely that a model can improve its own evaluator, providing one of the few experimental data points for proposals such as recursive reward modelling and weak-to-strong generalisation.[^1]

The paper's results have been replicated qualitatively in the smaller-model regime by the Meta-Rewarding study on Llama-3-8B-Instruct and by community fine-tunes on Mistral 7B-class base models, although smaller models tend to plateau sooner without explicit consistency or length controls.[^4][^6]

Limitations and criticisms

Reward saturation and self-distillation collapse

The most repeated critique is that Self-Rewarding plateaus quickly. The original paper itself reports diminishing returns by iteration 3, and follow-up work by Wu et al. (Meta-Rewarding) and CREAM frames this as the expected outcome of training a judge only via its actor role: each iteration produces preference pairs that the judge has already learned to discriminate, so the marginal informational content of the AIFT set falls.[^4][^6] Meta-Rewarding's explicit judge training and CREAM's consistency regularisation each extend the productive iteration count, but neither claims indefinite improvement.

Reward hacking and length bias

Because the judge and the policy share parameters, biases in the judge propagate directly into the policy. The most visible such bias is verbosity: GPT-4-style evaluators (including the model itself) favour longer responses on AlpacaEval-style protocols, and self-rewarding monotonically inflates response length, more than doubling from $M_1$ to $M_3$ in the original experiments.[^1][^11] The same effect is the textbook example of reward hacking, in which the policy exploits a defect in the reward function (here, length preference) without improving on the underlying objective.[^1] Subsequent benchmarks (length-controlled AlpacaEval, Arena-Hard's length-aware judging) and length-control mechanisms inside Meta-Rewarding were designed in part to discount the inflation.[^4][^12]

A related concern is the self-enhancement bias identified by Zheng et al.: LLM judges systematically prefer answers in their own style, which means a self-rewarding model effectively rewards its own idiosyncrasies on every iteration.[^11] To the extent that those idiosyncrasies do not correlate with human preferences, the procedure can drift away from human judgement even while raising its self-reported scores. The original paper attempts to bound this risk by also reporting pairwise accuracy against the original Open Assistant rankings, which remains a fixed external anchor.[^1]

Inheritance of Goodhart's law

More generally, Self-Rewarding LMs make explicit a tension long discussed in AI alignment: optimising hard against any imperfect proxy reward eventually causes capability divergence (the proxy goes up while the true objective stagnates or falls), a pattern captured by Goodhart's law. Self-rewarding makes the proxy and the optimiser the same artefact, which amplifies both the upside (faster adaptation) and the risk (no external check). The literature on scalable oversight frequently uses Self-Rewarding as a case study in this trade-off, sometimes pairing it with AI safety via debate or recursive reward modelling as complementary mechanisms for keeping the proxy honest.[^15]

Dependence on the seed and on Open Assistant

The recipe is sensitive to the quality and breadth of seed data. The reported gains use 3,200 IFT and roughly 1,600 EFT examples derived from Open Assistant; the paper does not show whether equivalent gains are achievable from scratch on a model without any prior chat tuning, and replication attempts on bases with weaker initial instruction-following often saturate at lower win rates.[^1][^6] Because the seed is small relative to the base model, the procedure is also brittle to distribution mismatch between the seed evaluator's notion of quality and the held-out prompt distribution.

Benchmark contamination and evaluator self-reference

Because the same family of GPT-4-Turbo evaluators is used to score AlpacaEval 2.0, the headline numbers do not establish independence from the judge family. Subsequent work has emphasised the importance of cross-judge evaluation, replacing GPT-4 with stronger or differently aligned judges to triangulate gains; for example, Meta-Rewarding reports both AlpacaEval and Arena-Hard scores in part to address this concern.[^4]

Capability narrowing

Although standard NLP benchmarks remain flat in the published experiments, several follow-up studies report that aggressive self-rewarding can degrade structured-reasoning behaviour on tasks outside the seed's distribution, presumably because the judge does not reward step-by-step reasoning unless the seed taught it to.[^6] Combining self-rewarding with process-style supervision such as a process reward model (PRM) is one proposed remedy.

Comparison with adjacent methods

Method	Reward source	Reward updated during training?	Policy update rule	Representative reference
Classical RLHF (InstructGPT-style)	Bradley-Terry reward model from human pairs	No (frozen)	PPO against frozen RM	Ouyang et al. 2022[^8]
DPO	Implicit reward induced by preference pairs	No (fixed dataset)	DPO classification loss	Rafailov et al. 2023[^9]
Constitutional AI / RLAIF	AI-generated preferences from constitution	Reward model retrained but separate from policy	PPO against AI-trained RM	Bai et al. 2022[^10]
Self-Rewarding LM	Same model as policy via LLM-as-a-Judge	Yes, jointly with policy	DPO on self-labelled pairs	Yuan et al. 2024[^1]
Meta-Rewarding LM	Same model, with a meta-judge	Yes, with explicit judge training	DPO on self-labelled pairs with length control	Wu et al. 2024[^4]
Direct Nash Optimization	Same model via general pairwise preferences	Yes, approaches Nash equilibrium	Contrastive regression objective	Rosset et al. 2024[^5]
SPIN	Self vs reference distribution discriminator	Implicit in self-play loss	DPO-style self-play loss	Chen et al. 2024[^13]

The table emphasises the central novelty of Self-Rewarding LMs: not the use of AI feedback (Constitutional AI and RLAIF already did that), nor the choice of DPO as optimiser (now standard), but the explicit fusion of policy and reward parameters into one network whose weights are updated each iteration.

Direct Preference Optimization (DPO) is the optimiser used inside every Self-Rewarding iteration.[^9]
Reinforcement Learning from Human Feedback (RLHF) is the baseline alignment procedure that Self-Rewarding aims to extend past the human-data ceiling.[^8]
Constitutional AI and RLAIF are the immediate methodological precursors for AI-generated preference labels.[^10]
MT-Bench and AlpacaEval are the principal benchmarks on which Self-Rewarding gains are reported.[^11][^12]
SPIN (Self-Play Fine-Tuning) is a contemporaneous iterative self-improvement method that uses self-play rather than self-judging.[^13]
Reward hacking and Goodhart's law frame the central failure mode that Self-Rewarding amplifies, and scalable oversight / recursive reward modelling are the broader research agendas it informs.[^15]
Chatbot Arena is the human-vote leaderboard most often used to triangulate AlpacaEval results.[^11]
Rejection sampling and synthetic data pipelines are the engineering context in which the procedure is most readily slotted.[^14]

References

Self-Rewarding Language Models

Background and motivation

How it works

Pipeline overview

Seed data and the LLM-as-a-Judge prompt

Sampling and pair construction

DPO training step

Two-stage SFT: instruction following and evaluation

Results in the original paper

AlpacaEval 2.0

Reward modelling accuracy

MT-Bench

Standard NLP benchmark behaviour

Response length

Variants and follow-ups

Meta-Rewarding Language Models

Direct Nash Optimization

CREAM and SCIR

Self-play and constitutional variants

Applications and adoption

Limitations and criticisms

Reward saturation and self-distillation collapse

Reward hacking and length bias

Inheritance of Goodhart's law

Dependence on the seed and on Open Assistant

Benchmark contamination and evaluator self-reference

Capability narrowing

Comparison with adjacent methods

Related work

See also

References

Improve this article

Self-Rewarding Language Models

Background and motivation

How it works

Pipeline overview

Seed data and the LLM-as-a-Judge prompt

Sampling and pair construction

DPO training step

Two-stage SFT: instruction following and evaluation

Results in the original paper

AlpacaEval 2.0

Reward modelling accuracy

MT-Bench

Standard NLP benchmark behaviour

Response length

Variants and follow-ups

Meta-Rewarding Language Models

Direct Nash Optimization

CREAM and SCIR

Self-play and constitutional variants

Applications and adoption

Limitations and criticisms

Reward saturation and self-distillation collapse

Reward hacking and length bias

Inheritance of Goodhart's law

Dependence on the seed and on Open Assistant

Benchmark contamination and evaluator self-reference

Capability narrowing

Comparison with adjacent methods

Related work

See also

References