Self-Taught Evaluator
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,332 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,332 words
Add missing citations, update stale details, or suggest a clearer explanation.
Self-Taught Evaluator is a method for training a strong LLM-as-a-judge without any human preference annotations, using synthetic training data and an iterative self-improvement loop. It was introduced in the paper "Self-Taught Evaluators" by Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li at Meta AI's Fundamental AI Research (FAIR) group, posted to arXiv on 5 August 2024.[1][2] Starting from Llama 3 70B-Instruct as the seed model, the approach reported 88.3% accuracy on RewardBench (88.7% with majority voting), up from the seed model's 75.4%, matching reward models trained on human-labeled data and exceeding general-purpose judges such as GPT-4.[1]
Model-based evaluation underpins much of modern large language model development. Evaluator models are used as reward models during reinforcement learning from human feedback, and as cheaper, faster stand-ins for human judges when comparing model outputs. The standard way to build such an evaluator is to collect large quantities of human preference judgments over pairs of responses, which is expensive to gather and goes stale as the underlying models improve and their failure modes shift.[1][3]
The Self-Taught Evaluator paper asks whether a competitive evaluator can be trained without that human-labeled data at all. The authors argue that a sufficiently capable base model can generate its own training signal: produce response pairs where one member is known to be worse by construction, then learn to explain and confirm that judgment, and repeat the process on progressively harder cases.[1][4]
The training scheme uses only unlabeled instructions plus a seed instruction-tuned model. No human rates any response. The pipeline runs in a loop, and each pass produces data used to fine-tune the next iteration's model.[1]
| Step | Description |
|---|---|
| 1. Instruction selection | Sample unlabeled user instructions from a public corpus (the paper uses WildChat) and classify them by category with an LLM. The main experiments concentrate on the reasoning category, using 20,582 instructions to focus on hard inputs. |
| 2. Response pair construction | For each instruction, the seed model first writes a baseline response (treated as the "good" answer). The model is then prompted to write a modified instruction that is similar but not semantically identical, and to answer that modified instruction. This second answer is high quality for the wrong question, so it serves as the "bad" response to the original instruction, giving a synthetic preference pair with a known winner. |
| 3. Judgment generation | The current evaluator model is prompted LLM-as-a-judge style to compare the pair and produce a reasoning trace ending in a verdict. The paper samples N=15 judgments per example. |
| 4. Rejection sampling | Only judgments whose final verdict matches the constructed (known) preference are kept; one correct trace is retained per example. Examples for which no sampled judgment reaches the correct verdict are discarded. |
| 5. Fine-tune and iterate | The retained reasoning-plus-judgment data is used to fine-tune the seed model, and the loop repeats with the improved model generating fresh judgments. |
Because the "bad" answer is generated as a genuinely good response to a slightly different prompt rather than as obviously broken text, the contrast is subtle and forces the evaluator to reason about relevance to the actual instruction rather than surface quality. The iteration acts as an automatic curriculum: as the evaluator improves, rejection sampling admits harder examples it can now solve, so the training distribution sharpens over successive rounds.[1][4] The paper runs roughly five iterations, with most of the gain on RewardBench arriving in the first one or two before the curve flattens.[1]
This procedure builds on the broader line of self-improvement work from the same group, including iterative methods where a model trains on its own filtered outputs.[1]
The headline evaluation is RewardBench, a benchmark that measures how often an evaluator prefers the better of two responses across chat, hard-chat, safety, and reasoning subsets. The Self-Taught Evaluator raised Llama-3-70B-Instruct from 75.4% to 88.3% overall after five iterations, and to 88.7% when 32 judgment samples are combined by majority vote.[1]
| Evaluator | Overall | Chat | Chat Hard | Safety | Reasoning |
|---|---|---|---|---|---|
| Llama-3-70B-Instruct (seed) | 75.4 | 97.6 | 58.9 | 69.2 | 78.5 |
| Self-Taught Evaluator (iter. 5) | 88.3 | 96.6 | 84.2 | 91.5 | 81.0 |
| Self-Taught Evaluator (+ majority vote) | 88.7 | 96.9 | 84.0 | 91.5 | 82.5 |
| Llama-3-70B-Instruct + HelpSteer2 (labeled) | 85.6 | 96.9 | 70.0 | 88.8 | 86.7 |
| GPT-4 (0125) | 84.3 | 95.3 | 74.3 | 87.2 | 86.9 |
| Gemini 1.5 Pro | 88.1 | 92.3 | 80.6 | 87.5 | 92.0 |
Two comparisons are central to the paper's claim. First, the synthetic-only evaluator beats a version of the same base model fine-tuned as an LLM-as-a-judge on the HelpSteer2 dataset of roughly 10,000 human preference annotations, which scored 85.6%.[1] Second, it exceeds GPT-4 (0125) at 84.3% as a judge, and reaches the level of strong proprietary judges like Gemini 1.5 Pro.[1] The largest improvements over the seed model come on the Chat Hard and Safety subsets, where reasoning about subtle differences matters most, while the Reasoning subset shows the seed and trained models closer together.
Beyond RewardBench, the paper reports results on additional evaluation settings. On MT-Bench human-agreement, the Self-Taught Evaluator reached about 78.9% agreement with human judgments (79.5% with majority voting), close to GPT-4-0125 at roughly 79.1%.[1] The authors also run ablations using a different seed model (a Mixtral 8x22B Instruct model) for the initial synthetic data generation, showing the recipe is not specific to a single base model.[1]
Meta released model weights, training data, and code alongside the research. The public model card, facebook/Self-taught-evaluator-llama3.1-70B, is built on Llama-3.1-70B-Instruct (a slightly newer base than the Llama-3-70B-Instruct used for the main paper experiments) and is trained with a combination of supervised fine-tuning and direct preference optimization, the latter using a combined DPO and negative-log-likelihood loss on preference pairs.[5] The released model is distributed under a custom Self-Taught Evaluator research license that gates access.[5] The preprocessed preference data is published as the facebook/Self-taught-evaluator-DPO-data dataset, and the training and evaluation code lives in Meta's facebookresearch/RAM repository under the self_taught_evaluator project.[6][7] The model release was part of a batch of FAIR open-source artifacts announced in 2024.[3]
The work is part of a wider shift toward synthetic-data and self-improvement pipelines that reduce dependence on human annotation for both training and evaluation, alongside related ideas such as self-rewarding language models from the same lab. By removing the human-labeling bottleneck, the method makes it cheaper to keep evaluators current as frontier models change.
Several caveats apply. The recipe presumes a seed model strong enough to write competent responses and usable reasoning traces, so it is less applicable to weak base models. The synthetic "bad" responses are generated by a specific trick (answering an altered instruction), which targets relevance errors and may not cover every kind of quality difference a human cares about, such as factual correctness or safety nuances not represented in the source instructions. RewardBench accuracy also measures agreement on constructed comparisons rather than downstream training outcomes, so a high score does not by itself guarantee that using the model as a reward signal yields better-aligned policies. The authors frame the approach as complementary to, not a wholesale replacement for, human oversight.[1][4]