LLM-as-a-judge

AI Benchmarks Large Language Models Model Evaluation

19 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v4 · 3,745 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LLM-as-a-judge is the practice of using a strong large language model to evaluate the outputs of other models, or of itself, in place of a human annotator. Instead of scoring an answer with a fixed metric like BLEU or exact match, you hand the answer to a capable model such as GPT-4 and ask it to grade quality, pick the better of two responses, or check the work against a reference. The technique was put on a quantitative footing by Lianmin Zheng and colleagues in "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (NeurIPS 2023), which reported that a strong judge such as GPT-4 reaches over 80% agreement with human preferences, "the same level of agreement between humans." ^[1] That single result reframed how the field thinks about model evaluation, and within a couple of years LLM judges were running the leaderboards, labeling preference data for training, and supplying reward signals to reasoning models through RLAIF and related pipelines.

The appeal is mostly economic. Collecting human preference judgments at the scale modern development needs is slow and expensive. A judge model costs a fraction of a cent per comparison, runs in seconds, and never gets tired or distracted. It also handles open-ended generation, where there is no single correct string to match against, far better than n-gram overlap metrics ever could. The catch, and it is a real one, is that the judge is itself a fallible language model. It carries its own biases, it can be gamed, and it is weakest exactly where evaluation matters most: hard reasoning, subtle factual errors, and adversarial inputs. Most of the interesting work in this area is about measuring those failure modes and patching them.

Why use a model to grade a model?

Evaluating chat assistants is genuinely hard. The questions people ask are open-ended, the good answers are diverse, and quality depends on helpfulness, correctness, tone, and format all at once. Traditional automatic metrics were built for narrower tasks. BLEU and ROUGE compare a candidate against reference text and reward surface overlap, which punishes a correct answer phrased differently from the reference and rewards fluent nonsense that happens to share words. They correlate poorly with human judgment on generation tasks. ^[2]

Human evaluation is the gold standard, but it does not scale. Running a careful human study for every checkpoint during training, or for every prompt variant during prompt engineering, is not feasible. This is the gap LLM judges fill. They approximate human preference cheaply enough to use in a tight development loop, and they generalize across task types without per-task metric engineering. The bet is that a model good enough to produce strong answers is also good enough to recognize strong answers, and for many tasks that bet pays off.

There is a second motivation that has nothing to do with cost. A judge model can explain itself. Ask it for a verdict with reasoning and you get a critique you can read, audit, and use as a training signal. A scalar reward model gives you a number with no rationale; a generative judge gives you a number plus an argument. That interpretability is part of why the technique spread into reinforcement learning pipelines.

What are the modes of LLM judging?

Zheng et al. distinguished three ways to set up an LLM judge, and the taxonomy has stuck. ^[1]

Mode	What the judge sees	What it returns	Best for
Single-answer grading	One question, one answer	An absolute score, often 1 to 10	Cheap large-scale scoring, no pairing needed
Pairwise comparison	One question, two answers (A and B)	A winner, or a tie	Building leaderboards, preference data
Reference-guided grading	Question, answer, and a reference solution	Score or verdict relative to the reference	Math and tasks with a known correct answer

Single-answer grading is the simplest and cheapest. It does not require pairing responses, so it scales linearly, but absolute scores drift: a judge's notion of an "8 out of 10" is not stable across prompts or sessions, which makes raw scores hard to compare. Pairwise comparison sidesteps that by only asking which of two answers is better, a relative judgment that models tend to make more reliably. The downside is cost, since comparing N models head to head is quadratic unless you sample matchups. Reference-guided grading was Zheng et al.'s fix for math and reasoning, where judges otherwise fail badly: give the judge a worked solution to compare against and its error rate on the MT-bench math questions dropped sharply. ^[1]

Most practical systems also ask the judge to reason before it decides, a chain-of-thought style prompt, because a verdict produced after explicit reasoning tends to be better calibrated than a bare label.

Which benchmarks and leaderboards run on judges?

A large slice of the modern evaluation stack depends on LLM judges. The technique is not a research curiosity; it is load-bearing infrastructure.

MT-Bench is the benchmark Zheng et al. shipped alongside the method. It is 80 multi-turn questions across eight categories: writing, roleplay, extraction, reasoning, math, coding, knowledge in STEM, and knowledge in humanities and social science. Each question has two turns, and a judge (originally GPT-4) scores the responses. ^[1] MT-bench was designed specifically to test the judge as much as the models being judged.

AlpacaEval from Stanford is a single-turn instruction-following benchmark of 805 prompts. A judge model compares each candidate answer against a baseline (originally text-davinci-003, later GPT-4 outputs) and reports a win rate. ^[3] AlpacaEval became popular because it is fast and cheap, but it had a glaring verbosity problem: models could climb the leaderboard just by writing longer. The fix was Length-Controlled AlpacaEval, which fits a regression to estimate what the preference would have been if both answers were the same length. That single change raised the leaderboard's Spearman correlation with Chatbot Arena from 0.93 to 0.98 and made the metric much harder to game with verbosity. ^[4]

Arena-Hard-Auto, from the team behind Chatbot Arena, goes further. Its BenchBuilder pipeline mines roughly 200,000 real user queries from the Arena, scores them on seven quality indicators (specificity, domain knowledge, complexity, problem-solving, creativity, technical accuracy, real-world application), clusters them, and keeps the 500 hardest prompts. A GPT-4-Turbo judge then compares each model against a GPT-4-0314 baseline. ^[5] The result is striking for an automatic benchmark: Arena-Hard-v0.1 reports 87.4% separability (its ability to confidently tell models apart) and 89.1% agreement with the human Chatbot Arena rankings, with confidence intervals about three times tighter than MT-bench, at roughly $25 per model. ^[5] Later judge configurations push agreement with Arena rankings toward the high nineties.

Chatbot Arena itself is the human anchor all of these chase. It is a crowdsourced platform where users chat with two anonymous models, vote for the better one, and the votes feed an Elo-style rating. The automatic benchmarks above are, in effect, attempts to predict Arena rankings without waiting for tens of thousands of human votes.

How well do judges agree with humans?

The headline number from Zheng et al. is that GPT-4 reaches over 80% agreement with human preferences, matching human-human agreement. ^[1] The details are worth knowing because they qualify that claim in useful ways.

On MT-bench, under a setup that excludes tie votes, GPT-4 pairwise judgments agreed with human votes about 85% of the time, while two humans agreed about 81 to 82% of the time. ^[1] When ties are allowed back in, agreement for both falls to the mid-60s, because ties are genuinely ambiguous and everyone disagrees about them. On Chatbot Arena data, GPT-4 reached roughly 87% agreement, the same as the human baseline. ^[1] So the "matches humans" claim is real, but it is measured on broad chat-quality preferences and it softens considerably on close calls.

The more sobering result is that agreement is not uniform across task types. Judges are strong on writing, roleplay, and general helpfulness, and weak on math and hard reasoning, where they confidently endorse wrong answers unless given a reference solution. ^[1] A judge that is 85% reliable on writing and much worse on math is not 85% reliable overall in any way you can summarize with one number. This task dependence is the single most important caveat to keep in mind.

Weaker models make poor judges. GPT-3.5 in the original study showed far lower consistency and stronger biases than GPT-4. ^[1] The general pattern across the literature is that judge quality tracks model capability: the judge needs to be at least as capable as the models it grades, and ideally more so.

What biases do LLM judges have?

The reason LLM judges are a research topic and not a solved problem is that they fail in systematic, exploitable ways. These are not random errors; they are directional biases that show up reliably and that a motivated party can exploit to inflate scores.

Bias	What it is	Evidence
Position bias	Preferring the answer in a particular slot (often the first) regardless of content	GPT-3.5 gave a consistent verdict after swapping order only 46.2% of the time; GPT-4 reached 65% ^[1]
Verbosity / length bias	Favoring longer answers even when length adds nothing	In a repetitive-list attack, GPT-3.5 and Claude-v1 were fooled 91.3% of the time; GPT-4 only 8.7% ^[1]
Self-preference / self-enhancement	Rating one's own outputs (or stylistically similar ones) higher than humans do	GPT-4 favored its own answers by about 10 percentage points of win rate, Claude-v1 by about 25 ^[1]; models can recognize their own generations and favor them ^[9]
Format / style bias	Rewarding markdown, lists, confident tone, and structure over substance	LLM judges penalized a sarcastic tone with 96% score loss but an incorrect answer with only 13% ^[8]
Sycophancy	Agreeing with assertions or flattering framings in the prompt	Documented across instruction-tuned models and inherited by judges ^[10]
Limited reasoning	Confidently grading wrong math and logic as correct	Judges fail MT-bench math without a reference solution ^[1]

Position bias is the cleanest to demonstrate. Present the same two answers in opposite orders and a biased judge changes its mind, which means its verdict encodes slot position, not quality. Verbosity bias is the one that most directly corrupts leaderboards, because length is trivially easy to manufacture; the "repetitive list attack" in the original paper padded answers with redundant lists that added no information, and the weaker judges fell for it almost every time. ^[1]

Self-preference is the most worrying for fairness. If a lab uses its own flagship model to judge a leaderboard, that model may rank itself above competitors not because it is better but because it likes its own style. Wataoka and colleagues argued the mechanism is partly perplexity: judges rate low-perplexity text (text they find familiar and would have generated themselves) more highly than humans do, which produces self-preference as a side effect. ^[9] Panickssery and colleagues separately showed that models like GPT-4 can recognize their own outputs with non-trivial accuracy and that this self-recognition correlates with the size of the bias. ^[9]

The style-over-substance failure is the one I find most damning, because it inverts what we want. The "Style Outweighs Substance" study built SOS-Bench from 19 benchmarks and found that LLM-judge preferences correlate almost perfectly with stylistic features and only weakly with safety, world knowledge, and instruction-following. A response made sarcastic lost 96% of its score; a response made factually wrong lost only 13%. ^[8] A judge that punishes tone more than falsehood is optimizing for the wrong thing, and any training loop built on it will drift toward confident, well-formatted, possibly incorrect output.

How do you mitigate judge bias?

None of these biases is fatal, and the field has accumulated a reasonable toolkit. The honest framing is that mitigations reduce bias rather than remove it, and most carry a cost in compute or complexity.

Bias	Mitigation	How it helps
Position	Swap order, run both ways, average or require agreement	A verdict that survives a swap is not position-driven; disagreement signals a coin-flip case ^[1]
Verbosity / length	Length-controlled win rates, length-penalized prompts	Statistically removes the length advantage from the score ^[4]
Self-preference	Use a different model family as judge, or a panel	A judge cannot self-favor outputs it did not generate ^[6]^[9]
Reasoning failure	Reference-guided grading, chain-of-thought prompts	Gives the judge a correct solution to check against ^[1]
General miscalibration	Few-shot examples, explicit rubrics, score probabilities	Anchors the judge's scale and steps ^[1]^[7]
Over-trust	Selective judging with human-agreement guarantees	Judge abstains and escalates when unsure ^[11]

Position swapping is standard practice now: run each comparison in both orders and only count a clear win if the judge agrees with itself both times, otherwise call it a tie. Length control, as AlpacaEval demonstrated, can be done after the fact with a regression that conditions on equal length. ^[4] For reasoning, the reference-guided mode is the reliable fix, though it only works when you have a reference. For calibration, G-Eval introduced a useful trick: instead of taking the judge's single output score, read the probabilities it assigns to each rating token and compute a weighted average, which smooths out the coarse integer scores and improved correlation with humans on summarization to a Spearman of 0.514, well above prior metrics. ^[7]

The most principled line of work treats reliability as a guarantee rather than a hope. "Trust or Escalate" and similar selective-evaluation methods have the judge estimate its own confidence and abstain on low-confidence cases, escalating those to a stronger model or a human, which lets you bound the human-agreement rate of the whole pipeline. ^[11] That is the right mental model: a judge is not a metric you trust unconditionally, it is a metric whose error you should measure and manage.

What is a panel of LLM judges (jury)?

One mitigation deserves its own section because it attacks several biases at once. Instead of a single large judge, use a panel of several smaller models and aggregate their votes. Verga and colleagues called this PoLL, a Panel of LLM evaluators. ^[6]

The argument is both statistical and economic. A panel drawn from disjoint model families cannot collectively self-favor, because no single family's style dominates, so intra-model bias largely cancels out. And a handful of small models can be cheaper than one frontier model while correlating better with human judgment. PoLL was evaluated across three settings (single-hop QA, multi-hop QA, and Chatbot Arena) over six datasets, and a panel of smaller models from different families correlated better with human judgments than a single GPT-4 judge while costing more than seven times less. ^[6] The intuition is the same one behind ensembles everywhere: diverse, independent errors average out, and the aggregate is steadier than any member.

The practical knobs are which models to include, how to aggregate (majority vote, average score, or max), and how to handle disagreement. Disagreement among panelists is itself a useful signal: a prompt the panel splits on is probably genuinely ambiguous and may deserve human review.

How are judges used as reward and verifier signals?

The most consequential use of LLM-as-a-judge is not benchmarking, it is training. The judge has become a source of reward.

In reinforcement learning from human feedback, the classic pipeline trains a scalar reward model on human preference pairs, then optimizes the policy against that reward. LLM-as-a-judge slots into this in two ways. First, it can generate the preference labels themselves, replacing or augmenting human annotators, which is the core idea behind RLAIF (reinforcement learning from AI feedback) and Anthropic's Constitutional AI, where a model critiques and ranks outputs according to written principles instead of human votes. ^[12] Second, the judge can serve directly as the reward function, with no separate scalar model in between.

The evidence that AI-generated labels can substitute for human ones is now fairly strong. In the RLAIF study by Lee and colleagues, models trained on AI feedback were preferred over a supervised baseline 71% of the time on summarization, statistically indistinguishable from the 73% won by human-feedback training, and on harmless dialogue RLAIF reached an 88% harmless rate versus 76% for RLHF and 64% for the supervised baseline. ^[17] When RLAIF and RLHF were compared head to head, neither won at a rate distinguishable from 50%. ^[17] In other words, a capable judge labeling preferences can match human labelers on these tasks, which is the practical license behind much of the field's move to AI feedback.

This blurs the line between a judge and a reward model. A generative reward model is essentially an LLM-as-a-judge used as a reward signal: rather than emitting an opaque scalar, it reasons in text and then produces a preference, which can be more accurate and is far more interpretable. The GenRM line of work showed that training a model to produce judgments with reasoning traces can match Bradley-Terry-style reward models in-distribution and beat them out-of-distribution, while also beating a plain LLM judge. ^[13] Self-Rewarding Language Models pushed the idea to its limit: a model judges its own generated responses, uses those judgments to build preference data, and trains on it, then repeats, so the same model is both policy and judge across iterations. ^[14]

In reasoning models, the judge often becomes a verifier. For math and code there is frequently a ground-truth check (a unit test, a numeric answer), and the cleaner signal comes from rule-based verification rather than a judge's opinion, which is why DeepSeek-R1 and similar systems lean on verifiable rewards for those domains. For open-ended tasks where no automatic verifier exists, a generative judge fills in. The two are complementary: verifiers are reliable but narrow, judges are broad but noisy.

Using a judge as reward inherits a specific danger: reward hacking. If the judge has a verbosity or format bias, the policy being trained will discover and exploit it, producing longer or more formatted outputs that the judge loves and humans do not. This is the bias catalog above turned into a failure of training rather than measurement, and it is a recurring reason teams keep humans in the loop and re-validate judge-driven gains against held-out human evaluation.

When should you not trust an LLM judge?

It is worth being blunt about the limits, because the convenience of LLM judges makes it easy to over-rely on them.

Do not trust a judge alone on hard reasoning, math, or anything with a checkable answer; use a real verifier or supply a reference. ^[1] Do not trust an absolute single-answer score as a stable, comparable number across prompts; prefer pairwise comparison or calibrate carefully. Do not let a model judge a leaderboard that includes itself or its own family without controlling for self-preference. ^[9] Do not assume bias mitigations have fully worked; confidence-estimation methods tend to overestimate human agreement even for strong judges, so the safe assumption is that residual bias remains. ^[11]

And do not mistake correlation with human preference for correlation with quality. The "Style Outweighs Substance" result is the cautionary tale: judge preferences can correlate almost perfectly with style and weakly with the things we actually care about, so a model can win on the judge's terms while getting worse on safety and factual accuracy. ^[8] The recurring recommendation across the survey literature is to treat the judge as one instrument among several, to report which judge and which prompt template were used (results are sensitive to both), and to periodically re-anchor against human evaluation rather than letting the judge run unchecked. ^[15]^[16]

None of this means the technique is unsound. It means LLM-as-a-judge is a measuring instrument with known systematic error, and the right way to use any such instrument is to characterize its error and correct for it, not to pretend it reads true.

References

Zheng, Lianmin; Chiang, Wei-Lin; Sheng, Ying; et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. https://arxiv.org/abs/2306.05685 ↩
Liu, Chia-Wei; Lowe, Ryan; Serban, Iulian; et al. "How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation." EMNLP, 2016. https://arxiv.org/abs/1603.08023 ↩
Li, Xuechen; Zhang, Tianyi; Dubois, Yann; et al. "AlpacaEval: An Automatic Evaluator of Instruction-following Models." GitHub repository, Stanford, 2023. https://github.com/tatsu-lab/alpaca_eval ↩
Dubois, Yann; Galambosi, Balazs; Liang, Percy; Hashimoto, Tatsunori B. "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators." arXiv preprint, 2024. https://arxiv.org/abs/2404.04475 ↩
Li, Tianle; Chiang, Wei-Lin; Frick, Evan; et al. "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline." arXiv preprint, 2024. https://arxiv.org/abs/2406.11939 ↩
Verga, Pat; Hofstatter, Sebastian; Althammer, Sophia; et al. "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models." arXiv preprint, 2024. https://arxiv.org/abs/2404.18796 ↩
Liu, Yang; Iter, Dan; Xu, Yichong; Wang, Shuohang; Xu, Ruochen; Zhu, Chenguang. "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment." EMNLP, 2023. https://aclanthology.org/2023.emnlp-main.153/ ↩
Feuer, Benjamin; Goldblum, Micah; Datta, Teresa; et al. "Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking." arXiv preprint, 2024. https://arxiv.org/abs/2409.15268 ↩
Panickssery, Arjun; Bowman, Samuel R.; Feng, Shi. "LLM Evaluators Recognize and Favor Their Own Generations." NeurIPS, 2024. https://arxiv.org/abs/2404.13076 ↩
Sharma, Mrinank; Tong, Meg; Korbak, Tomasz; et al. "Towards Understanding Sycophancy in Language Models." ICLR, 2024. https://arxiv.org/abs/2310.13548 ↩
Chaudhary, Aditi; Gupta, Akshita; et al. "Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement." arXiv preprint, 2024. https://arxiv.org/abs/2407.18370 ↩
Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv preprint, Anthropic, 2022. https://arxiv.org/abs/2212.08073 ↩
Mahan, Dakota; Van Phung, Duy; Rafailov, Rafael; et al. "Generative Reward Models." arXiv preprint, 2024. https://arxiv.org/abs/2410.12832 ↩
Yuan, Weizhe; Pang, Richard Yuanzhe; Cho, Kyunghyun; et al. "Self-Rewarding Language Models." ICML, 2024. https://arxiv.org/abs/2401.10020 ↩
Li, Haitao; Dong, Qian; Chen, Junjie; et al. "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods." arXiv preprint, 2024. https://arxiv.org/abs/2412.05579 ↩
Gu, Jiawei; Jiang, Xuhui; Shi, Zhichao; et al. "A Survey on LLM-as-a-Judge." arXiv preprint, 2024. https://arxiv.org/abs/2411.15594 ↩
Lee, Harrison; Phatale, Samrat; Mansoor, Hassan; et al. "RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." International Conference on Machine Learning (ICML), 2024. https://arxiv.org/abs/2309.00267 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Agent benchmark reward hacking Agent evaluation Creative Writing v3 DeepResearch Bench FeatureBench HalluLens IFBench Inter-rater agreement MM-Vet Mixture of Agents MultiChallenge Omni-MATH Patronus AI Reward Model Self-Taught Evaluator Vicuna (language model)Wisdom of the Crowd

Why use a model to grade a model?

What are the modes of LLM judging?

Which benchmarks and leaderboards run on judges?

How well do judges agree with humans?

What biases do LLM judges have?

How do you mitigate judge bias?

What is a panel of LLM judges (jury)?

How are judges used as reward and verifier signals?

When should you not trust an LLM judge?

See also

References

Improve this article

Related Articles

FACTS Grounding

NoLiMa

LongBench v2

BABILong

MRCR

LLM Benchmark Comparison (Leaderboard Overview)

What links here

Related Articles

FACTS Grounding

NoLiMa

LongBench v2

BABILong

MRCR

LLM Benchmark Comparison (Leaderboard Overview)

What links here