LLM-as-a-judge
Last reviewed
May 31, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 3,568 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 3,568 words
Add missing citations, update stale details, or suggest a clearer explanation.
LLM-as-a-judge is the practice of using a large language model to evaluate the outputs of other models (or of itself), standing in for a human annotator. Instead of scoring an answer with a fixed metric like BLEU or exact match, you hand the answer to a capable model such as GPT-4 and ask it to grade quality, pick the better of two responses, or check the work against a reference. The idea took off after Lianmin Zheng and colleagues published "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" at NeurIPS 2023, which showed that a strong judge model agrees with human preferences more than 80% of the time, about the rate at which two humans agree with each other. [1] That single result reframed how the field thinks about model evaluation, and within a couple of years LLM judges were running the leaderboards, labeling preference data for training, and supplying reward signals to reasoning models.
The appeal is mostly economic. Collecting human preference judgments at the scale modern development needs is slow and expensive. A judge model costs a fraction of a cent per comparison, runs in seconds, and never gets tired or distracted. It also handles open-ended generation, where there is no single correct string to match against, far better than n-gram overlap metrics ever could. The catch, and it is a real one, is that the judge is itself a fallible language model. It carries its own biases, it can be gamed, and it is weakest exactly where evaluation matters most: hard reasoning, subtle factual errors, and adversarial inputs. Most of the interesting work in this area is about measuring those failure modes and patching them.
Evaluating chat assistants is genuinely hard. The questions people ask are open-ended, the good answers are diverse, and quality depends on helpfulness, correctness, tone, and format all at once. Traditional automatic metrics were built for narrower tasks. BLEU and ROUGE compare a candidate against reference text and reward surface overlap, which punishes a correct answer phrased differently from the reference and rewards fluent nonsense that happens to share words. They correlate poorly with human judgment on generation tasks. [2]
Human evaluation is the gold standard, but it does not scale. Running a careful human study for every checkpoint during training, or for every prompt variant during prompt engineering, is not feasible. This is the gap LLM judges fill. They approximate human preference cheaply enough to use in a tight development loop, and they generalize across task types without per-task metric engineering. The bet is that a model good enough to produce strong answers is also good enough to recognize strong answers, and for many tasks that bet pays off.
There is a second motivation that has nothing to do with cost. A judge model can explain itself. Ask it for a verdict with reasoning and you get a critique you can read, audit, and use as a training signal. A scalar reward model gives you a number with no rationale; a generative judge gives you a number plus an argument. That interpretability is part of why the technique spread into reinforcement learning pipelines.
Zheng et al. distinguished three ways to set up an LLM judge, and the taxonomy has stuck. [1]
| Mode | What the judge sees | What it returns | Best for |
|---|---|---|---|
| Single-answer grading | One question, one answer | An absolute score, often 1 to 10 | Cheap large-scale scoring, no pairing needed |
| Pairwise comparison | One question, two answers (A and B) | A winner, or a tie | Building leaderboards, preference data |
| Reference-guided grading | Question, answer, and a reference solution | Score or verdict relative to the reference | Math and tasks with a known correct answer |
Single-answer grading is the simplest and cheapest. It does not require pairing responses, so it scales linearly, but absolute scores drift: a judge's notion of an "8 out of 10" is not stable across prompts or sessions, which makes raw scores hard to compare. Pairwise comparison sidesteps that by only asking which of two answers is better, a relative judgment that models tend to make more reliably. The downside is cost, since comparing N models head to head is quadratic unless you sample matchups. Reference-guided grading was Zheng et al.'s fix for math and reasoning, where judges otherwise fail badly: give the judge a worked solution to compare against and its error rate on the MT-bench math questions dropped sharply. [1]
Most practical systems also ask the judge to reason before it decides, a chain-of-thought style prompt, because a verdict produced after explicit reasoning tends to be better calibrated than a bare label.
A large slice of the modern evaluation stack depends on LLM judges. The technique is not a research curiosity; it is load-bearing infrastructure.
MT-Bench is the benchmark Zheng et al. shipped alongside the method. It is 80 multi-turn questions across eight categories: writing, roleplay, extraction, reasoning, math, coding, knowledge in STEM, and knowledge in humanities and social science. Each question has two turns, and a judge (originally GPT-4) scores the responses. [1] MT-bench was designed specifically to test the judge as much as the models being judged.
AlpacaEval from Stanford is a single-turn instruction-following benchmark of 805 prompts. A judge model compares each candidate answer against a baseline (originally text-davinci-003, later GPT-4 outputs) and reports a win rate. [3] AlpacaEval became popular because it is fast and cheap, but it had a glaring verbosity problem: models could climb the leaderboard just by writing longer. The fix was Length-Controlled AlpacaEval, which fits a regression to estimate what the preference would have been if both answers were the same length. That single change raised the leaderboard's Spearman correlation with Chatbot Arena from 0.93 to 0.98 and made the metric much harder to game with verbosity. [4]
Arena-Hard-Auto, from the team behind Chatbot Arena, goes further. Its BenchBuilder pipeline mines roughly 200,000 real user queries from the Arena, scores them on seven quality indicators (specificity, domain knowledge, complexity, problem-solving, creativity, technical accuracy, real-world application), clusters them, and keeps the 500 hardest prompts. A GPT-4-Turbo judge then compares each model against a GPT-4-0314 baseline. [5] The result is striking for an automatic benchmark: Arena-Hard-v0.1 reports 87.4% separability (its ability to confidently tell models apart) and 89.1% agreement with the human Chatbot Arena rankings, with confidence intervals about three times tighter than MT-bench, at roughly $25 per model. [5] Later judge configurations push agreement with Arena rankings toward the high nineties.
Chatbot Arena itself is the human anchor all of these chase. It is a crowdsourced platform where users chat with two anonymous models, vote for the better one, and the votes feed an Elo-style rating. The automatic benchmarks above are, in effect, attempts to predict Arena rankings without waiting for tens of thousands of human votes.
The headline number from Zheng et al. is that GPT-4 reaches over 80% agreement with human preferences, matching human-human agreement. [1] The details are worth knowing because they qualify that claim in useful ways.
On MT-bench, under a setup that excludes tie votes, GPT-4 pairwise judgments agreed with human votes about 85% of the time, while two humans agreed about 81 to 82% of the time. [1] When ties are allowed back in, agreement for both falls to the mid-60s, because ties are genuinely ambiguous and everyone disagrees about them. On Chatbot Arena data, GPT-4 reached roughly 87% agreement, the same as the human baseline. [1] So the "matches humans" claim is real, but it is measured on broad chat-quality preferences and it softens considerably on close calls.
The more sobering result is that agreement is not uniform across task types. Judges are strong on writing, roleplay, and general helpfulness, and weak on math and hard reasoning, where they confidently endorse wrong answers unless given a reference solution. [1] A judge that is 85% reliable on writing and much worse on math is not 85% reliable overall in any way you can summarize with one number. This task dependence is the single most important caveat to keep in mind.
Weaker models make poor judges. GPT-3.5 in the original study showed far lower consistency and stronger biases than GPT-4. [1] The general pattern across the literature is that judge quality tracks model capability: the judge needs to be at least as capable as the models it grades, and ideally more so.
The reason LLM judges are a research topic and not a solved problem is that they fail in systematic, exploitable ways. These are not random errors; they are directional biases that show up reliably and that a motivated party can exploit to inflate scores.
| Bias | What it is | Evidence |
|---|---|---|
| Position bias | Preferring the answer in a particular slot (often the first) regardless of content | GPT-3.5 gave a consistent verdict after swapping order only 46.2% of the time; GPT-4 reached 65% [1] |
| Verbosity / length bias | Favoring longer answers even when length adds nothing | In a repetitive-list attack, GPT-3.5 and Claude-v1 were fooled 91.3% of the time; GPT-4 only 8.7% [1] |
| Self-preference / self-enhancement | Rating one's own outputs (or stylistically similar ones) higher than humans do | GPT-4 favored its own answers by about 10 percentage points of win rate, Claude-v1 by about 25 [1]; models can recognize their own generations and favor them [9] |
| Format / style bias | Rewarding markdown, lists, confident tone, and structure over substance | LLM judges penalized a sarcastic tone with 96% score loss but an incorrect answer with only 13% [8] |
| Sycophancy | Agreeing with assertions or flattering framings in the prompt | Documented across instruction-tuned models and inherited by judges [10] |
| Limited reasoning | Confidently grading wrong math and logic as correct | Judges fail MT-bench math without a reference solution [1] |
Position bias is the cleanest to demonstrate. Present the same two answers in opposite orders and a biased judge changes its mind, which means its verdict encodes slot position, not quality. Verbosity bias is the one that most directly corrupts leaderboards, because length is trivially easy to manufacture; the "repetitive list attack" in the original paper padded answers with redundant lists that added no information, and the weaker judges fell for it almost every time. [1]
Self-preference is the most worrying for fairness. If a lab uses its own flagship model to judge a leaderboard, that model may rank itself above competitors not because it is better but because it likes its own style. Wataoka and colleagues argued the mechanism is partly perplexity: judges rate low-perplexity text (text they find familiar and would have generated themselves) more highly than humans do, which produces self-preference as a side effect. [9] Panickssery and colleagues separately showed that models like GPT-4 can recognize their own outputs with non-trivial accuracy and that this self-recognition correlates with the size of the bias. [9]
The style-over-substance failure is the one I find most damning, because it inverts what we want. The "Style Outweighs Substance" study built SOS-Bench from 19 benchmarks and found that LLM-judge preferences correlate almost perfectly with stylistic features and only weakly with safety, world knowledge, and instruction-following. A response made sarcastic lost 96% of its score; a response made factually wrong lost only 13%. [8] A judge that punishes tone more than falsehood is optimizing for the wrong thing, and any training loop built on it will drift toward confident, well-formatted, possibly incorrect output.
None of these biases is fatal, and the field has accumulated a reasonable toolkit. The honest framing is that mitigations reduce bias rather than remove it, and most carry a cost in compute or complexity.
| Bias | Mitigation | How it helps |
|---|---|---|
| Position | Swap order, run both ways, average or require agreement | A verdict that survives a swap is not position-driven; disagreement signals a coin-flip case [1] |
| Verbosity / length | Length-controlled win rates, length-penalized prompts | Statistically removes the length advantage from the score [4] |
| Self-preference | Use a different model family as judge, or a panel | A judge cannot self-favor outputs it did not generate [6][9] |
| Reasoning failure | Reference-guided grading, chain-of-thought prompts | Gives the judge a correct solution to check against [1] |
| General miscalibration | Few-shot examples, explicit rubrics, score probabilities | Anchors the judge's scale and steps [1][7] |
| Over-trust | Selective judging with human-agreement guarantees | Judge abstains and escalates when unsure [11] |
Position swapping is standard practice now: run each comparison in both orders and only count a clear win if the judge agrees with itself both times, otherwise call it a tie. Length control, as AlpacaEval demonstrated, can be done after the fact with a regression that conditions on equal length. [4] For reasoning, the reference-guided mode is the reliable fix, though it only works when you have a reference. For calibration, G-Eval introduced a useful trick: instead of taking the judge's single output score, read the probabilities it assigns to each rating token and compute a weighted average, which smooths out the coarse integer scores and improved correlation with humans on summarization to a Spearman of 0.514, well above prior metrics. [7]
The most principled line of work treats reliability as a guarantee rather than a hope. "Trust or Escalate" and similar selective-evaluation methods have the judge estimate its own confidence and abstain on low-confidence cases, escalating those to a stronger model or a human, which lets you bound the human-agreement rate of the whole pipeline. [11] That is the right mental model: a judge is not a metric you trust unconditionally, it is a metric whose error you should measure and manage.
One mitigation deserves its own section because it attacks several biases at once. Instead of a single large judge, use a panel of several smaller models and aggregate their votes. Verga and colleagues called this PoLL, a Panel of LLM evaluators. [6]
The argument is both statistical and economic. A panel drawn from disjoint model families cannot collectively self-favor, because no single family's style dominates, so intra-model bias largely cancels out. And a handful of small models can be cheaper than one frontier model while correlating better with human judgment. PoLL was evaluated across three settings (single-hop QA, multi-hop QA, and Chatbot Arena) over six datasets, and a panel of smaller models from different families correlated better with human judgments than a single GPT-4 judge while costing more than seven times less. [6] The intuition is the same one behind ensembles everywhere: diverse, independent errors average out, and the aggregate is steadier than any member.
The practical knobs are which models to include, how to aggregate (majority vote, average score, or max), and how to handle disagreement. Disagreement among panelists is itself a useful signal: a prompt the panel splits on is probably genuinely ambiguous and may deserve human review.
The most consequential use of LLM-as-a-judge is not benchmarking, it is training. The judge has become a source of reward.
In reinforcement learning from human feedback, the classic pipeline trains a scalar reward model on human preference pairs, then optimizes the policy against that reward. LLM-as-a-judge slots into this in two ways. First, it can generate the preference labels themselves, replacing or augmenting human annotators, which is the core idea behind RLAIF (reinforcement learning from AI feedback) and Anthropic's Constitutional AI, where a model critiques and ranks outputs according to written principles instead of human votes. [12] Second, the judge can serve directly as the reward function, with no separate scalar model in between.
This blurs the line between a judge and a reward model. A generative reward model is essentially an LLM-as-a-judge used as a reward signal: rather than emitting an opaque scalar, it reasons in text and then produces a preference, which can be more accurate and is far more interpretable. The GenRM line of work showed that training a model to produce judgments with reasoning traces can match Bradley-Terry-style reward models in-distribution and beat them out-of-distribution, while also beating a plain LLM judge. [13] Self-Rewarding Language Models pushed the idea to its limit: a model judges its own generated responses, uses those judgments to build preference data, and trains on it, then repeats, so the same model is both policy and judge across iterations. [14]
In reasoning models, the judge often becomes a verifier. For math and code there is frequently a ground-truth check (a unit test, a numeric answer), and the cleaner signal comes from rule-based verification rather than a judge's opinion, which is why DeepSeek-R1 and similar systems lean on verifiable rewards for those domains. For open-ended tasks where no automatic verifier exists, a generative judge fills in. The two are complementary: verifiers are reliable but narrow, judges are broad but noisy.
Using a judge as reward inherits a specific danger: reward hacking. If the judge has a verbosity or format bias, the policy being trained will discover and exploit it, producing longer or more formatted outputs that the judge loves and humans do not. This is the bias catalog above turned into a failure of training rather than measurement, and it is a recurring reason teams keep humans in the loop and re-validate judge-driven gains against held-out human evaluation.
It is worth being blunt about the limits, because the convenience of LLM judges makes it easy to over-rely on them.
Do not trust a judge alone on hard reasoning, math, or anything with a checkable answer; use a real verifier or supply a reference. [1] Do not trust an absolute single-answer score as a stable, comparable number across prompts; prefer pairwise comparison or calibrate carefully. Do not let a model judge a leaderboard that includes itself or its own family without controlling for self-preference. [9] Do not assume bias mitigations have fully worked; confidence-estimation methods tend to overestimate human agreement even for strong judges, so the safe assumption is that residual bias remains. [11]
And do not mistake correlation with human preference for correlation with quality. The "Style Outweighs Substance" result is the cautionary tale: judge preferences can correlate almost perfectly with style and weakly with the things we actually care about, so a model can win on the judge's terms while getting worse on safety and factual accuracy. [8] The recurring recommendation across the survey literature is to treat the judge as one instrument among several, to report which judge and which prompt template were used (results are sensitive to both), and to periodically re-anchor against human evaluation rather than letting the judge run unchecked. [15][16]
None of this means the technique is unsound. It means LLM-as-a-judge is a measuring instrument with known systematic error, and the right way to use any such instrument is to characterize its error and correct for it, not to pretend it reads true.