MT-Bench

AI Benchmarks Large Language Models Natural Language Processing

29 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v8 · 5,871 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MT-Bench (Multi-Turn Benchmark) is a benchmark of 80 hand-written, two-turn questions that evaluates large language models (LLMs) on multi-turn conversation and instruction following by using a strong model such as GPT-4 as an automated judge that scores each answer from 1 to 10. Introduced in June 2023 by researchers from LMSYS (Large Model Systems Organization) and affiliated universities, it is best known for formalizing and validating the LLM-as-a-Judge paradigm, with the original paper reporting that GPT-4 judgments match human preferences at "over 80% agreement, the same level of agreement between humans."^[1]^[2] The 80 questions are evenly split across eight categories, scored across both conversational turns.^[1]

The accompanying paper, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," was published at the NeurIPS 2023 Datasets and Benchmarks Track.^[1] It has since become one of the most cited papers in the LLM evaluation literature, accumulating over 6,000 citations by early 2026 according to Semantic Scholar. MT-Bench and its associated methodology have shaped how the AI research community measures and compares the quality of conversational AI systems.

Why was MT-Bench created?

As LLM-based chat assistants grew more capable throughout 2022 and 2023, existing evaluation methods struggled to keep pace. Traditional natural language processing benchmarks such as MMLU, HellaSwag, and TruthfulQA rely on multiple-choice or short-answer formats. While useful for measuring factual knowledge and reasoning on closed-ended tasks, these benchmarks fail to capture the open-ended, conversational qualities that users value in chat assistants, including coherence across multiple turns, creativity, nuance in instruction following, and the ability to handle follow-up requests that modify or build on earlier context.^[1]

Human evaluation remains the gold standard for measuring these qualities, but it is slow, expensive, and difficult to scale. Collecting reliable pairwise preference judgments from human annotators can cost thousands of dollars and take weeks to complete, making it impractical for the rapid iteration cycles of modern LLM development. The research team engaged 58 expert-level human labelers to produce ground-truth annotations for validation purposes, underscoring the resource-intensive nature of human evaluation.^[1] The LMSYS team set out to address this gap by developing both a targeted benchmark (MT-Bench) and an automated evaluation methodology (LLM-as-a-Judge) that could approximate human preferences at a fraction of the cost.^[1]

Another motivation was the need to evaluate models on multi-turn interactions specifically. Most existing benchmarks at the time tested models on isolated, single-turn prompts. Real-world usage of chat assistants, however, involves extended conversations where users ask follow-up questions, refine their requests, or challenge the model's previous answers. The ability to maintain coherence and contextual awareness across turns is a distinct capability that single-turn benchmarks cannot measure.^[1]

Authors and Institutions

The paper was authored by Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica.^[1] The research team drew members from several institutions:

UC Berkeley
UC San Diego
Carnegie Mellon University
Stanford University
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)

Many of the same researchers were responsible for building Vicuna, the FastChat platform, and Chatbot Arena, all of which are closely related to MT-Bench. LMSYS itself originated as a multi-university research collaboration in 2023 and was later incorporated as a 501(c)(3) non-profit in September 2024, with a mission to make large AI models accessible through open-source development, datasets, and evaluation tools.

Benchmark Design

How are MT-Bench questions structured?

LMSYS describes the question set as follows: "MT-Bench is a carefully curated benchmark that includes 80 high-quality, multi-turn questions. These questions are tailored to assess the conversation flow and instruction-following capabilities of models in multi-turn dialogues."^[2] MT-Bench contains 80 multi-turn questions, each consisting of exactly two conversational turns.^[1] The first turn presents an initial prompt, and the second turn introduces a follow-up that tests the model's ability to build on its previous response. Follow-up turns are designed to be challenging: they may ask the model to refine, extend, contradict, or reformat its first answer. This two-turn structure tests not only the quality of individual responses but also the model's capacity for contextual continuity and instruction compliance across turns.^[1]

All 80 questions were manually designed by the research team.^[1] The questions intentionally target areas where weaker models tend to break down, including complex reasoning chains, mathematical problem solving, code generation with constraints, and tasks that require maintaining a specific persona or format.

Eight Categories

The 80 questions are evenly distributed across eight categories, with 10 questions per category:^[1]

Category	Description	Example Task Types
Writing	Creative and structured text generation	Essays, emails, letters, persuasive writing
Roleplay	Maintaining a character or persona	Acting as a historical figure, fictional character, or professional
Extraction	Pulling structured information from text	Identifying key facts, summarizing, reformatting data
Reasoning	Logical and commonsense reasoning	Deductive puzzles, hypothetical scenarios, argument analysis
Math	Mathematical problem solving	Arithmetic, algebra, word problems, proofs
Coding	Programming tasks and code analysis	Writing functions, debugging, explaining code, algorithm design
STEM (Knowledge I)	Science, technology, engineering knowledge	Physics concepts, biology questions, engineering principles
Humanities (Knowledge II)	Humanities and social science knowledge	History, philosophy, economics, social science questions

This category design ensures that MT-Bench tests a broad range of capabilities rather than focusing narrowly on a single skill. The inclusion of both knowledge-oriented categories (STEM and Humanities) alongside skill-oriented categories (Coding, Math, Reasoning) provides a balanced assessment profile. The distinction between the two knowledge categories (STEM vs. Humanities) also helps identify whether a model has uneven coverage across different knowledge domains.

Example Questions

The following examples, drawn from the publicly available question set in the FastChat repository, illustrate how each category pairs an initial prompt with a challenging follow-up.^[10]

Writing (Question 81). The first turn asks: "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions." The second turn then changes the constraint entirely: "Rewrite your previous response. Start every sentence with the letter A." This forces the model to recall its own output and restructure it under a strict formatting rule.^[10]

Roleplay (Question 91). The first turn instructs: "Pretend yourself to be Elon Musk in all the following conversations. Speak like Elon Musk as much as possible. Why do we need to go to Mars?" The follow-up shifts topic while maintaining the persona: "How do you like dancing? Can you teach me?" The model must stay in character even when the topic moves away from the persona's typical domain.^[10]

Extraction (Question 131). The first turn provides three movie reviews and asks: "Evaluate the following movie reviews on a scale of 1 to 5... Return the answer as a JSON array of integers." The second turn adds a new requirement: "Update your previous reply by including the release date as part of the JSON content." This tests both structured output generation and the ability to incrementally modify a previous response.^[10]

Reasoning (Question 101). The first turn poses a classic logic puzzle: "Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what's your current position?" The second turn modifies the premise: "If the 'second person' is changed to 'last person' in the above question, what would the answer be?" Many models fail on the second turn because the modified question requires careful re-analysis rather than simple substitution.^[10]

Math (Question 111). The first turn asks: "The vertices of a triangle are at points (0, 0), (-1, 1), and (3, 3). What is the area of the triangle?" The second turn builds on the geometry: "What's the area of the circle circumscribing the triangle?" Solving the follow-up requires using the first answer as an intermediate step.^[10]

Coding (Question 121). The first turn requests: "Develop a Python program that reads all the text files under a directory and returns top-5 words with the most number of occurrences." The follow-up adds a performance constraint: "Can you parallelize it?" The model must modify its own code while preserving correctness.^[10]

STEM (Question 143). The first turn asks: "Photosynthesis is a vital process for life on Earth. Could you outline the two main stages of photosynthesis, including where they take place within the chloroplast?" The second turn demands quantitative reasoning: "How much energy can a tree produce through photosynthesis in its lifetime? Please provide an estimate using actual numerical values."^[10]

Humanities (Question 151). The first turn asks: "Provide insights into the correlation between economic indicators such as GDP, inflation, and unemployment rates. Explain how fiscal and monetary policies affect those indicators." The second turn tests communication flexibility: "Now, explain them again like I'm five."^[10]

How is the MT-Bench score calculated?

In the default single-answer grading mode, the LLM judge evaluates each model response independently on a scale of 1 to 10, where 1 indicates a completely unhelpful or incorrect response and 10 indicates a near-perfect response.^[1] The judge provides a score for each of the two turns separately. A model's overall MT-Bench score is the average across all 160 individual turn scores (80 questions multiplied by 2 turns).^[1]

The scoring prompt instructs the judge to consider several quality dimensions, including helpfulness, relevance, accuracy, depth, creativity, and level of detail. For certain categories such as Math and Coding, the system can optionally supply a reference answer to guide the judge's evaluation (reference-guided grading).^[1] The judge also provides a written explanation for each score, making the evaluation process interpretable and allowing researchers to audit individual judgments.

What is the LLM-as-a-Judge methodology?

The most influential contribution of the MT-Bench paper is the systematic study and validation of using strong LLMs as automated judges for evaluating other LLMs.^[1] As the authors summarize in the paper's abstract, "we explore using strong LLMs as judges to evaluate these models on more open-ended questions," and "our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans."^[1] The paper examines three distinct judging approaches:^[1]

Pairwise Comparison

In this mode, the judge model receives a question along with two candidate responses (from different models) and must determine which response is better, or declare a tie. This approach mirrors the format used in Chatbot Arena, where human users compare anonymous model outputs side by side. Pairwise comparison tends to produce higher-quality judgments because the judge can directly contrast the two responses, but it scales quadratically with the number of models being evaluated.^[1]

Single-Answer Grading

The judge model receives a single question-response pair and assigns a numerical score on a 1 to 10 scale. This is the recommended default mode for MT-Bench because it is simpler, cheaper (requiring only one judge call per response rather than pairwise comparisons), and produces scores that are easy to aggregate and compare across models. Single-answer grading scales linearly with the number of models, making it practical for evaluating large numbers of candidates.^[1]

Reference-Guided Grading

For questions with objectively correct answers (particularly in Math and Coding), a reference solution can be provided to the judge alongside the model's response. This helps the judge assess correctness more accurately, addressing a known weakness of LLM judges in verifying mathematical and logical reasoning. The paper tested Math question grading and found that the default prompt had a 70% failure rate (14 out of 20 incorrect judgments), chain-of-thought prompting reduced this to 30% (6 out of 20), and reference-guided grading further reduced it to 15% (3 out of 20).^[1] The improvement was most pronounced on Math questions; other categories where answers are more subjective showed smaller gains.^[1]

Validation Against Human Judgments

The research team conducted extensive validation of the LLM-as-a-Judge approach using both expert annotators and crowdsourced evaluations.^[1] Key findings from the paper's validation experiments include:

MT-Bench validation (Setup S2, second-turn comparisons):

Judge Pair	Agreement Rate	Number of Votes
GPT-4 pairwise vs. human experts	85%	864
GPT-4 single-answer vs. human experts	84%	776
Human-to-human inter-rater agreement	82%	474

Chatbot Arena validation (Setup S2):

Judge Pair	Agreement Rate	Number of Votes
GPT-4 pairwise vs. human	95%	1,967
GPT-4 single-answer vs. human	85%	1,761
Human-to-human agreement	87%	1,944

The central finding is that GPT-4 as a judge achieves agreement with human experts at a level that matches or slightly exceeds the agreement rate among human annotators themselves. This result validated the idea that strong LLMs can serve as practical, scalable substitutes for human evaluation in many settings.^[1]

The research team also collected 3,000 expert votes and made 30,000 conversations with human preferences publicly available through their GitHub repository to support reproducibility and further research.^[1] In a supplementary analysis, when humans disagreed with GPT-4's judgments, they deemed the GPT-4 judgments reasonable in 75% of cases and actually changed their own votes in 34% of those disagreements.^[1]

Few-Shot Prompting for Judge Improvement

The paper explored whether few-shot examples could improve GPT-4's consistency as a judge. By providing a small number of labeled examples in the judge prompt, GPT-4's position-bias consistency improved from 65.0% in the zero-shot setting to 77.5% with few-shot prompting.^[1] This finding suggests that judge quality can be further enhanced through careful prompt engineering.

Model Performance Results

The original MT-Bench evaluation tested a range of proprietary and open-source models. The following table shows the scores reported in the LMSYS leaderboard announcement (June 2023), with GPT-4 grading as the judge:^[2]

Model	MT-Bench Score	MMLU (5-shot)
GPT-4	8.99	86.4
GPT-3.5-turbo	7.94	70.0
Claude-v1	7.90	-
Claude-instant-v1	7.85	-
Vicuna-33B	7.12	-
WizardLM-30B	7.01	-
Guanaco-33B	6.53	-
Tulu-30B	6.43	-
Guanaco-65B	6.41	-
OpenAssistant-LLaMA-30B	6.41	-
PaLM-Chat-Bison-001	6.40	-
Vicuna-13B	6.39	52.1
MPT-30B-Chat	6.39	-
WizardLM-13B	6.35	-
Vicuna-7B	6.00	47.1
Baize-v2-13B	5.75	-
Nous-Hermes-13B	5.51	-
MPT-7B-Chat	5.42	-
GPT4All-13B-Snoozy	5.41	-
Koala-13B	5.35	-
MPT-30B-Instruct	5.22	-
Falcon-40B-Instruct	5.17	-
H2O-Oasst-OpenLLaMA-13B	4.63	-
Alpaca-13B	4.53	48.1
ChatGLM-6B	4.50	-
OpenAssistant-Pythia-12B	4.32	-
RWKV-4-Raven-14B	3.98	-
Dolly-V2-12B	3.28	-
FastChat-T5-3B	3.04	-
StableLM-Tuned-Alpha-7B	2.75	-
LLaMA-13B	2.61	47.0

Several patterns stand out from these results. GPT-4 held a clear lead at 8.99, more than a full point ahead of GPT-3.5-turbo and Claude-v1.^[2] Among open-source models, Vicuna-33B led at 7.12, demonstrating that fine-tuned open models could approach (though not match) proprietary systems.^[2] Base models without instruction tuning, such as LLaMA-13B, scored near the bottom at 2.61, confirming that fine-tuning and RLHF are critical for chat performance.^[2] The gap between instruction-tuned variants (e.g., Vicuna-13B at 6.39) and their base counterparts (LLaMA-13B at 2.61) provided clear quantitative evidence of the impact of alignment training.^[2]

Notably, there was not a strong correlation between MMLU scores and MT-Bench scores for all models. For example, LLaMA-13B scored 47.0 on MMLU but only 2.61 on MT-Bench, while Vicuna-7B scored 47.1 on MMLU but achieved 6.00 on MT-Bench.^[2] This discrepancy highlighted that knowledge (as measured by MMLU) and conversational ability (as measured by MT-Bench) are distinct capabilities.

Performance Across Categories

The paper analyzed per-category performance using win rates from Chatbot Arena (the percentage of pairwise comparisons won against all other models):^[1]

Model	Writing	Roleplay	Reasoning	Math	Coding	Extraction	STEM	Humanities
GPT-4	61.2%	67.9%	49.3%	66.1%	56.3%	66.2%	76.6%	72.2%
GPT-3.5	50.9%	60.6%	32.6%	63.8%	55.0%	48.8%	52.8%	53.8%
Vicuna-13B	39.7%	39.2%	20.1%	18.0%	36.9%	29.2%	47.0%	47.5%
LLaMA-13B	15.1%	15.1%	7.8%	7.5%	2.1%	9.3%	6.8%	10.1%

GPT-4 achieved its highest win rates in STEM (76.6%) and Humanities (72.2%) and its lowest in Reasoning (49.3%).^[1] Vicuna-13B's win rates dropped sharply in Math (18.0%) and Reasoning (20.1%) compared to its Writing (39.7%) and Humanities (47.5%) performance, revealing that multi-turn reasoning and mathematical tasks present the greatest challenge for smaller open-source models.^[1] LLaMA-13B, without instruction tuning, scored in the single digits across most categories, with its lowest performance in Coding at just 2.1%.^[1]

Turn-Level Performance

A notable finding was the performance gap between first-turn and second-turn responses. The paper reported specific first-turn and second-turn scores for key models:^[1]

Model	First Turn	Second Turn	Average
GPT-4	8.96	9.03	8.99
GPT-3.5-turbo	8.08	7.81	7.94
Claude-v1	8.15	7.65	7.90

GPT-4 was unusual in that it actually scored slightly higher on the second turn (9.03) than the first (8.96), suggesting robust multi-turn capabilities.^[1] GPT-3.5 and Claude-v1 both showed modest declines from their first-turn to second-turn performance.^[1] Many open-source models experienced substantially larger drops on the second turn, suggesting weaker ability to maintain context and follow up on earlier responses. Models like Vicuna-7B and WizardLM-13B showed particularly pronounced degradation, indicating that their instruction-following ability was more fragile when asked to build on previous context. This performance degradation on follow-up turns became a key metric for assessing a model's conversational robustness.

Connection to Chatbot Arena

MT-Bench and Chatbot Arena were developed as complementary evaluation approaches and were presented together in the same paper.^[1] While MT-Bench provides a controlled, reproducible benchmark with fixed questions, Chatbot Arena offers a crowd-sourced evaluation platform where users submit their own prompts and vote on anonymous model responses.^[11]

In Chatbot Arena, two models are randomly selected to generate responses to a user's query. The user then votes for the better response without knowing which model produced it.^[11] These pairwise preferences are aggregated using the Bradley-Terry model to compute Elo-like ratings for each model. While the system was initially described using chess-style Elo ratings, the Arena later adopted the Bradley-Terry model to better handle the complexity of thousands of simultaneous matchups.^[11] By mid-2023, the Arena had collected over 42,000 anonymous votes from users.^[2]

MT-Bench scores showed high correlation with Chatbot Arena Elo ratings, which provided external validation for both systems.^[1] The LMSYS leaderboard, launched alongside the MT-Bench paper, displayed three metrics side by side: Chatbot Arena Elo rating, MT-Bench score, and MMLU score.^[2] This combination allowed researchers and practitioners to compare models from multiple angles, with each metric capturing a different aspect of model quality.

The two evaluation methods serve different purposes. MT-Bench excels at standardized, reproducible comparison with low variance, while Chatbot Arena captures real-world user preferences across an unconstrained range of prompts. Together, they established LMSYS as the leading authority on LLM evaluation during 2023 and 2024.

Known Biases and Limitations

The MT-Bench paper was notable for its transparency about the limitations of the LLM-as-a-Judge approach. The authors identified and studied several systematic biases:^[1]

Position Bias

In pairwise comparison mode, LLM judges tend to favor the response presented in a particular position. The paper tested multiple judge models and prompt variants, producing detailed results:^[1]

Judge	Prompt Variant	Consistency	First-Position Bias	Second-Position Bias
GPT-4	default	65.0%	30.0%	5.0%
GPT-4	rename	66.2%	28.7%	5.0%
GPT-3.5	default	46.2%	50.0%	1.2%
GPT-3.5	rename	51.2%	38.8%	6.2%
Claude-v1	default	23.8%	75.0%	0.0%
Claude-v1	rename	56.2%	11.2%	28.7%

"Consistency" indicates the percentage of cases where the judge gave the same verdict regardless of response order. GPT-4 achieved the highest consistency at 65.0%, while Claude-v1 in the default prompt setting showed extreme first-position bias (75.0%) and only 23.8% consistency.^[1] Claude-v1 also exhibited a notable name bias, favoring "Assistant A" regardless of content. Renaming the assistants in the prompt substantially improved Claude-v1's consistency to 56.2%.^[1]

The recommended mitigation is to run each comparison twice with swapped positions and only declare a winner when the preference is consistent across both orderings. If the results conflict after swapping, the outcome is recorded as a tie.^[1]

Verbosity Bias

LLM judges tend to prefer longer, more detailed responses even when shorter answers are equally or more accurate. The paper tested this with a "repetitive list" attack, where 23 MT-Bench model answers were made unnecessarily verbose by asking GPT-4 to rephrase list items without adding new information and inserting the rephrased list at the beginning of the original response.^[1] The failure rates under this attack were:^[1]

Judge	Failure Rate
Claude-v1	91.3%
GPT-3.5	91.3%
GPT-4	8.7%

GPT-4 was notably more resistant to verbosity manipulation, failing on only 8.7% of test cases, while both Claude-v1 and GPT-3.5 failed on 91.3%.^[1] This bias can unfairly penalize models that produce concise, focused outputs and reward models that pad responses with unnecessary elaboration.

Self-Enhancement Bias

LLM judges show a measurable tendency to favor responses generated by themselves. When serving as judge, GPT-4 displayed a roughly 10% higher win rate for its own outputs compared to the rate assigned by human evaluators.^[1] Claude-v1 showed an even more pronounced self-enhancement effect, favoring its own responses with approximately 25% higher win rate.^[1] GPT-3.5, by contrast, did not exhibit a statistically significant self-enhancement bias.^[1] This finding raises concerns about the objectivity of LLM-based evaluation, particularly when the judge model is also a competitor in the evaluation.

Limited Reasoning Ability

LLM judges can struggle to correctly evaluate responses in domains that require precise verification, particularly Mathematics and formal logic. The paper demonstrated this concretely: using the default prompt, GPT-4 as judge made incorrect assessments on 70% of Math question evaluations (14 out of 20). Chain-of-thought prompting reduced the error rate to 30%, and reference-guided grading brought it down to 15%.^[1] A judge may assign high scores to plausible-sounding but incorrect mathematical solutions because it cannot reliably verify multi-step computations.

Narrow Evaluation Scope

The benchmark emphasizes helpfulness as its primary evaluation criterion but largely neglects safety considerations, including harmful content generation and factual reliability. Within helpfulness, multiple dimensions (accuracy, relevance, creativity, depth) are collapsed into a single 1-to-10 score, which can obscure important trade-offs between these qualities. The authors acknowledged this limitation and suggested that a more comprehensive evaluation framework separating these dimensions could be developed in future work.^[1]

Static Question Set

With only 80 fixed questions, MT-Bench is vulnerable to contamination. Model developers can optimize for the specific questions in the benchmark, and as the questions became widely known, the risk of overfitting grew over time. The small question set also limits the statistical power of the benchmark, particularly for distinguishing between models with similar capabilities.

How does MT-Bench differ from Arena-Hard?

In April 2024, the LMSYS team released Arena-Hard, a next-generation benchmark designed to address many of MT-Bench's limitations.^[3] Arena-Hard was built using a data pipeline that draws prompts from real user interactions in Chatbot Arena rather than relying on hand-crafted questions.^[3]^[4]

The Arena-Hard construction process began with approximately 200,000 user queries collected from Chatbot Arena. The team applied topic modeling (using BERTopic) to identify over 4,000 topic clusters and scored prompts on seven quality criteria: specificity, domain knowledge, complexity, problem-solving, creativity, technical accuracy, and real-world application.^[4] The final benchmark consists of 500 prompts selected from 250 high-scoring topic clusters, with 2 prompts sampled from each cluster.^[3]^[4]

Improvements Over MT-Bench

Metric	MT-Bench	Arena-Hard
Number of prompts	80 (160 turns)	500 (1,000 judgments)
Model separability (95% CI)	22.6%	87.4%
Agreement with Chatbot Arena	Lower	89.1%
Correlation with Chatbot Arena	~91.3% (Spearman)	98.6%
Prompt source	Hand-crafted	Real user queries
Update frequency	Static	Periodically refreshed
Cost per evaluation run	~$1-5 (depending on judge)	~$25

Arena-Hard's most significant improvement is separability: the ability to statistically distinguish between models of different quality levels. MT-Bench achieved only 22.6% separability when measured with 95% confidence intervals, meaning that many model pairs could not be reliably ranked.^[3] This was a dramatic difference from the 91.3% agreement figure that appears when using Spearman Correlation alone. As the LMSYS team cautioned, "Spearman Correlation, a popular metric for measuring correlations between rankings, may be an unreliable metric for ranking correlation as it does not consider variance of the rankings."^[3] Arena-Hard raised separability to 87.4% and reached 89.1% agreement with Chatbot Arena rankings, making it far more useful for fine-grained model comparison.^[3]

Arena-Hard uses GPT-4-Turbo as the judge model with pairwise comparison against a fixed baseline (GPT-4-0314).^[3] It incorporates chain-of-thought reasoning and position-swapping to mitigate bias, and generates 1,000 judgments per model for statistical robustness. Results are fitted using the Bradley-Terry model to produce win-rate estimates with confidence intervals.^[3]

Is MT-Bench open source?

MT-Bench is fully open source and available through the FastChat repository on GitHub (lm-sys/FastChat). The fastchat/llm_judge module contains everything needed to run the benchmark:^[10]

The 80 benchmark questions in JSONL format
Prompt templates for single-answer grading, pairwise comparison, and reference-guided grading
Scripts for generating model responses and collecting judge evaluations
Pre-generated model answers and judgments for several reference models
Reference answers from GPT-4 for use in reference-guided grading

Installation is straightforward using pip: pip install -e ".[model_worker,llm_judge]". Researchers and developers can run MT-Bench evaluations against any model that exposes a chat API, and the modular design allows for substituting different judge models beyond GPT-4.^[10]

The cost of running a complete MT-Bench evaluation varies by judge model. Using GPT-4 (8K context) as the judge costs approximately $5.10 per full evaluation run, while GPT-4-Turbo costs around $1.85 and GPT-4o approximately $0.93.^[10] These low costs compared to human evaluation (which can run into thousands of dollars) are a primary reason for MT-Bench's widespread adoption.

Extensions and Derivatives

The success of MT-Bench inspired several extensions and derivative benchmarks:

MT-Bench-101

Published at ACL 2024, MT-Bench-101 is a fine-grained evaluation benchmark that expands on the original MT-Bench with a three-tier hierarchical ability taxonomy covering Perceptivity, Adaptability, and Interactivity.^[5] It contains 4,208 turns across 1,388 multi-turn dialogues covering 13 distinct tasks.^[5] The benchmark evaluated 21 LLMs and found that common alignment techniques did not produce consistent improvements in multi-turn capabilities, revealing a gap between single-turn and multi-turn performance that persists across model families.^[5]

Multilingual MT-Bench

Community contributors have translated MT-Bench questions into multiple languages, including Japanese, Chinese, Russian, German, French, Indonesian, Vietnamese, and Polish. A dedicated multilingual fork (MT-Bench-X) provides professionally edited translations for German, Spanish, Italian, and French, enabling standardized multi-turn evaluation across languages. These translations were manually reviewed by fluent speakers to ensure correctness and natural phrasing.

FairMT-Bench

Published at ICLR 2025, FairMT-Bench addresses the absence of fairness evaluation in multi-turn dialogue settings.^[6] The benchmark formulates a task taxonomy targeting LLM fairness capabilities across three stages: context understanding, user interaction, and instruction trade-offs, with each stage comprising two tasks.^[6] It includes 10,000 multi-turn dialogue data points covering two bias types and six bias attributes. A distilled subset, FairMT-1K, provides a lighter evaluation option.^[6] Experiments on 15 state-of-the-art LLMs revealed that models are more likely to generate biased responses in multi-turn scenarios compared to single-turn settings.^[6]

MT-Video-Bench

Released in October 2025, MT-Video-Bench extends the multi-turn evaluation paradigm to video-grounded dialogue. It consists of 987 dialogues comprising 5,805 question-answer pairs sourced from 135 videos across five domains, enabling evaluation of multimodal LLMs on their ability to maintain video-referenced conversations across multiple turns.

Domain-Specific Adaptations

Researchers have adapted the MT-Bench framework for specific domains by creating custom question sets while retaining the LLM-as-a-Judge scoring methodology. These adaptations cover areas such as medical reasoning, legal analysis, and software engineering, allowing practitioners to evaluate LLMs on tasks specific to their field of interest.

Relationship to Other Evaluation Benchmarks

MT-Bench exists within a broader ecosystem of LLM evaluation benchmarks, each capturing different aspects of model quality:

Benchmark	Tasks	Evaluation Method	Strengths
MT-Bench	80 multi-turn questions	LLM-as-a-Judge (1-10 scale)	Standardized, reproducible, low cost
Arena-Hard	500 real-user prompts	LLM-as-a-Judge (pairwise)	High separability, real-world prompts
AlpacaEval	805 single-turn tasks	LLM-as-a-Judge (pairwise)	Large task set, length-controlled variant
WildBench	1,024 real-user tasks	LLM-as-a-Judge (WB-Score/Reward)	Highest correlation (0.98) with Arena
Chatbot Arena	Open-ended user prompts	Human pairwise voting	Gold standard for human preferences
MMLU	14,042 multiple-choice	Accuracy	Broad knowledge coverage

AlpacaEval, with 805 tasks drawn from alignment datasets, includes relatively simpler prompts compared to MT-Bench but covers a wider range. WildBench, published at ICLR 2025, uses 1,024 tasks selected from over one million human-chatbot conversation logs and achieves 0.98 Pearson correlation with Chatbot Arena Elo ratings, surpassing both Arena-Hard (0.91) and AlpacaEval 2.0 (0.87).^[7] Each benchmark fills a different niche: MT-Bench for quick, standardized multi-turn testing; Arena-Hard for statistically robust automated evaluation; and Chatbot Arena for ground-truth human preferences.

Impact on LLM Evaluation

MT-Bench and the LLM-as-a-Judge paradigm have had a significant impact on how the AI community evaluates language models:

Standardization of automated evaluation. Before MT-Bench, there was no widely accepted methodology for using LLMs to evaluate other LLMs on open-ended tasks. The paper provided both a theoretical framework and empirical validation, giving the community confidence to adopt LLM-based evaluation at scale.^[1] By 2025, LLM-as-a-Judge had become a standard component of model development pipelines at most major AI labs, including OpenAI, Anthropic, Google, and Meta. Two comprehensive survey papers on LLM-as-a-Judge were published in late 2024 (Gu et al. on arXiv 2411.15594; Haitao et al. on arXiv 2412.05579), documenting the widespread adoption of the paradigm across fields including text generation, question answering, dialogue systems, education, and peer review.^[8]^[9]

Shift toward conversational benchmarks. MT-Bench helped shift the evaluation paradigm from single-turn, closed-ended benchmarks toward multi-turn, open-ended evaluations that better reflect real-world usage of chat assistants. This influenced the design of subsequent benchmarks including Arena-Hard, WildBench, AlpacaEval, and FairMT-Bench.

Enabling rapid model iteration. By demonstrating that GPT-4 judgments closely track human preferences, MT-Bench made it practical for model developers to run evaluation cycles in hours rather than weeks.^[1] This accelerated the pace of LLM development and fine-tuning research, enabling faster experimentation with training data mixes, hyperparameters, and alignment techniques.

Transparency about evaluation limitations. The paper's thorough analysis of judge biases (position, verbosity, self-enhancement) set a standard for transparency in evaluation research.^[1] Subsequent work on LLM evaluation has consistently referenced and built upon these findings, and bias mitigation strategies proposed in the paper (such as position swapping) have become standard practice.

Community infrastructure. The LMSYS leaderboard, which prominently featured MT-Bench scores alongside Chatbot Arena Elo ratings and MMLU scores, became a central resource for tracking the state of the art in conversational AI.^[2] The open-source release of all benchmark materials through FastChat lowered the barrier for researchers worldwide to participate in LLM evaluation.^[10]

Criticisms

Despite its contributions, MT-Bench has faced several criticisms from the research community:

The 80-question set is too small to provide robust statistical differentiation between closely matched models, leading to high variance in scores. When measured with 95% confidence intervals, the benchmark's separability drops to just 22.6%.^[3]
The reliance on GPT-4 as the default judge creates a circular dependency: models are effectively being evaluated by a competitor, and the judge's own biases become baked into the ranking.
The benchmark's focus on helpfulness overlooks critical safety, fairness, and alignment considerations that are important for real-world deployment. This gap motivated the creation of FairMT-Bench.^[6]
As the fixed question set became widely known, models could be specifically optimized for MT-Bench questions, reducing the benchmark's discriminative power over time.
The single-number score (1 to 10) collapses multiple quality dimensions into one metric, making it difficult to understand where a model excels or falls short.
The benchmark was designed primarily for English-language evaluation, limiting its applicability in multilingual contexts despite community translation efforts.
The two-turn conversation format, while an improvement over single-turn benchmarks, does not capture the complexity of extended multi-turn dialogues with five or more turns. Later benchmarks like MT-Bench-101 and FairMT-Bench address this limitation.^[5]
The Spearman correlation metric, often used to validate benchmark rankings against human preferences, can be misleading when confidence intervals are wide, as the LMSYS team itself later acknowledged when developing Arena-Hard.^[3]

Many of these criticisms motivated the development of Arena-Hard, WildBench, and other next-generation benchmarks that use larger, dynamically updated question sets derived from real user interactions.

References

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., & Stoica, I. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." *Advances in Neural Information Processing Systems* (NeurIPS 2023), Datasets and Benchmarks Track. arXiv:2306.05685 ↩
LMSYS Org. (2023). "Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B." LMSYS Blog. https://lmsys.org/blog/2023-06-22-leaderboard/ ↩
LMSYS Org. (2024). "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline." LMSYS Blog. https://lmsys.org/blog/2024-04-19-arena-hard/ ↩
Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Zhu, B., Gonzalez, J.E., & Stoica, I. (2024). "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline." arXiv:2406.11939 ↩
Bai, G., Liu, Z., et al. (2024). "MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics* (ACL 2024). https://aclanthology.org/2024.acl-long.401/ ↩
Fan, Z., et al. (2025). "FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs." *International Conference on Learning Representations* (ICLR 2025). arXiv:2410.19317 ↩
Lin, B.Y., Deng, T., et al. (2025). "WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild." *International Conference on Learning Representations* (ICLR 2025). arXiv:2406.04770 ↩
Gu, Z., et al. (2024). "A Survey on LLM-as-a-Judge." arXiv:2411.15594 ↩
Haitao, C.S., et al. (2024). "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods." arXiv:2412.05579 ↩
FastChat GitHub Repository. "LLM Judge: MT-Bench and Chatbot Arena." https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge ↩
LMSYS Org. (2023). "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings." LMSYS Blog. https://lmsys.org/blog/2023-05-03-arena/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit