MT-Bench (Multi-Turn Benchmark) is a benchmark for evaluating large language models (LLMs) on their ability to handle multi-turn conversations and follow complex instructions. Introduced in June 2023 by researchers from LMSYS (Large Model Systems Organization) and affiliated universities, MT-Bench consists of 80 carefully crafted multi-turn questions spanning eight categories. The benchmark is best known for formalizing the LLM-as-a-Judge paradigm, in which a strong language model such as GPT-4 serves as an automated evaluator, scoring responses on a scale of 1 to 10.
The accompanying paper, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," was published at the NeurIPS 2023 Datasets and Benchmarks Track. It has since become one of the most cited papers in the LLM evaluation literature, accumulating over 6,000 citations by early 2026 according to Semantic Scholar. MT-Bench and its associated methodology have shaped how the AI research community measures and compares the quality of conversational AI systems.
As LLM-based chat assistants grew more capable throughout 2022 and 2023, existing evaluation methods struggled to keep pace. Traditional natural language processing benchmarks such as MMLU, HellaSwag, and TruthfulQA rely on multiple-choice or short-answer formats. While useful for measuring factual knowledge and reasoning on closed-ended tasks, these benchmarks fail to capture the open-ended, conversational qualities that users value in chat assistants, including coherence across multiple turns, creativity, nuance in instruction following, and the ability to handle follow-up requests that modify or build on earlier context.
Human evaluation remains the gold standard for measuring these qualities, but it is slow, expensive, and difficult to scale. Collecting reliable pairwise preference judgments from human annotators can cost thousands of dollars and take weeks to complete, making it impractical for the rapid iteration cycles of modern LLM development. The research team engaged 58 expert-level human labelers to produce ground-truth annotations for validation purposes, underscoring the resource-intensive nature of human evaluation. The LMSYS team set out to address this gap by developing both a targeted benchmark (MT-Bench) and an automated evaluation methodology (LLM-as-a-Judge) that could approximate human preferences at a fraction of the cost.
Another motivation was the need to evaluate models on multi-turn interactions specifically. Most existing benchmarks at the time tested models on isolated, single-turn prompts. Real-world usage of chat assistants, however, involves extended conversations where users ask follow-up questions, refine their requests, or challenge the model's previous answers. The ability to maintain coherence and contextual awareness across turns is a distinct capability that single-turn benchmarks cannot measure.
The paper was authored by Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. The research team drew members from several institutions:
Many of the same researchers were responsible for building Vicuna, the FastChat platform, and Chatbot Arena, all of which are closely related to MT-Bench. LMSYS itself originated as a multi-university research collaboration in 2023 and was later incorporated as a 501(c)(3) non-profit in September 2024, with a mission to make large AI models accessible through open-source development, datasets, and evaluation tools.
MT-Bench contains 80 multi-turn questions, each consisting of exactly two conversational turns. The first turn presents an initial prompt, and the second turn introduces a follow-up that tests the model's ability to build on its previous response. Follow-up turns are designed to be challenging: they may ask the model to refine, extend, contradict, or reformat its first answer. This two-turn structure tests not only the quality of individual responses but also the model's capacity for contextual continuity and instruction compliance across turns.
All 80 questions were manually designed by the research team. The questions intentionally target areas where weaker models tend to break down, including complex reasoning chains, mathematical problem solving, code generation with constraints, and tasks that require maintaining a specific persona or format.
The 80 questions are evenly distributed across eight categories, with 10 questions per category:
| Category | Description | Example Task Types |
|---|---|---|
| Writing | Creative and structured text generation | Essays, emails, letters, persuasive writing |
| Roleplay | Maintaining a character or persona | Acting as a historical figure, fictional character, or professional |
| Extraction | Pulling structured information from text | Identifying key facts, summarizing, reformatting data |
| Reasoning | Logical and commonsense reasoning | Deductive puzzles, hypothetical scenarios, argument analysis |
| Math | Mathematical problem solving | Arithmetic, algebra, word problems, proofs |
| Coding | Programming tasks and code analysis | Writing functions, debugging, explaining code, algorithm design |
| STEM (Knowledge I) | Science, technology, engineering knowledge | Physics concepts, biology questions, engineering principles |
| Humanities (Knowledge II) | Humanities and social science knowledge | History, philosophy, economics, social science questions |
This category design ensures that MT-Bench tests a broad range of capabilities rather than focusing narrowly on a single skill. The inclusion of both knowledge-oriented categories (STEM and Humanities) alongside skill-oriented categories (Coding, Math, Reasoning) provides a balanced assessment profile. The distinction between the two knowledge categories (STEM vs. Humanities) also helps identify whether a model has uneven coverage across different knowledge domains.
The following examples, drawn from the publicly available question set in the FastChat repository, illustrate how each category pairs an initial prompt with a challenging follow-up.
Writing (Question 81). The first turn asks: "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions." The second turn then changes the constraint entirely: "Rewrite your previous response. Start every sentence with the letter A." This forces the model to recall its own output and restructure it under a strict formatting rule.
Roleplay (Question 91). The first turn instructs: "Pretend yourself to be Elon Musk in all the following conversations. Speak like Elon Musk as much as possible. Why do we need to go to Mars?" The follow-up shifts topic while maintaining the persona: "How do you like dancing? Can you teach me?" The model must stay in character even when the topic moves away from the persona's typical domain.
Extraction (Question 131). The first turn provides three movie reviews and asks: "Evaluate the following movie reviews on a scale of 1 to 5... Return the answer as a JSON array of integers." The second turn adds a new requirement: "Update your previous reply by including the release date as part of the JSON content." This tests both structured output generation and the ability to incrementally modify a previous response.
Reasoning (Question 101). The first turn poses a classic logic puzzle: "Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what's your current position?" The second turn modifies the premise: "If the 'second person' is changed to 'last person' in the above question, what would the answer be?" Many models fail on the second turn because the modified question requires careful re-analysis rather than simple substitution.
Math (Question 111). The first turn asks: "The vertices of a triangle are at points (0, 0), (-1, 1), and (3, 3). What is the area of the triangle?" The second turn builds on the geometry: "What's the area of the circle circumscribing the triangle?" Solving the follow-up requires using the first answer as an intermediate step.
Coding (Question 121). The first turn requests: "Develop a Python program that reads all the text files under a directory and returns top-5 words with the most number of occurrences." The follow-up adds a performance constraint: "Can you parallelize it?" The model must modify its own code while preserving correctness.
STEM (Question 143). The first turn asks: "Photosynthesis is a vital process for life on Earth. Could you outline the two main stages of photosynthesis, including where they take place within the chloroplast?" The second turn demands quantitative reasoning: "How much energy can a tree produce through photosynthesis in its lifetime? Please provide an estimate using actual numerical values."
Humanities (Question 151). The first turn asks: "Provide insights into the correlation between economic indicators such as GDP, inflation, and unemployment rates. Explain how fiscal and monetary policies affect those indicators." The second turn tests communication flexibility: "Now, explain them again like I'm five."
In the default single-answer grading mode, the LLM judge evaluates each model response independently on a scale of 1 to 10, where 1 indicates a completely unhelpful or incorrect response and 10 indicates a near-perfect response. The judge provides a score for each of the two turns separately. A model's overall MT-Bench score is the average across all 160 individual turn scores (80 questions multiplied by 2 turns).
The scoring prompt instructs the judge to consider several quality dimensions, including helpfulness, relevance, accuracy, depth, creativity, and level of detail. For certain categories such as Math and Coding, the system can optionally supply a reference answer to guide the judge's evaluation (reference-guided grading). The judge also provides a written explanation for each score, making the evaluation process interpretable and allowing researchers to audit individual judgments.
The most influential contribution of the MT-Bench paper is the systematic study and validation of using strong LLMs as automated judges for evaluating other LLMs. The paper examines three distinct judging approaches:
In this mode, the judge model receives a question along with two candidate responses (from different models) and must determine which response is better, or declare a tie. This approach mirrors the format used in Chatbot Arena, where human users compare anonymous model outputs side by side. Pairwise comparison tends to produce higher-quality judgments because the judge can directly contrast the two responses, but it scales quadratically with the number of models being evaluated.
The judge model receives a single question-response pair and assigns a numerical score on a 1 to 10 scale. This is the recommended default mode for MT-Bench because it is simpler, cheaper (requiring only one judge call per response rather than pairwise comparisons), and produces scores that are easy to aggregate and compare across models. Single-answer grading scales linearly with the number of models, making it practical for evaluating large numbers of candidates.
For questions with objectively correct answers (particularly in Math and Coding), a reference solution can be provided to the judge alongside the model's response. This helps the judge assess correctness more accurately, addressing a known weakness of LLM judges in verifying mathematical and logical reasoning. The paper tested Math question grading and found that the default prompt had a 70% failure rate (14 out of 20 incorrect judgments), chain-of-thought prompting reduced this to 30% (6 out of 20), and reference-guided grading further reduced it to 15% (3 out of 20). The improvement was most pronounced on Math questions; other categories where answers are more subjective showed smaller gains.
The research team conducted extensive validation of the LLM-as-a-Judge approach using both expert annotators and crowdsourced evaluations. Key findings from the paper's validation experiments include:
MT-Bench validation (Setup S2, second-turn comparisons):
| Judge Pair | Agreement Rate | Number of Votes |
|---|---|---|
| GPT-4 pairwise vs. human experts | 85% | 864 |
| GPT-4 single-answer vs. human experts | 84% | 776 |
| Human-to-human inter-rater agreement | 82% | 474 |
Chatbot Arena validation (Setup S2):
| Judge Pair | Agreement Rate | Number of Votes |
|---|---|---|
| GPT-4 pairwise vs. human | 95% | 1,967 |
| GPT-4 single-answer vs. human | 85% | 1,761 |
| Human-to-human agreement | 87% | 1,944 |
The central finding is that GPT-4 as a judge achieves agreement with human experts at a level that matches or slightly exceeds the agreement rate among human annotators themselves. This result validated the idea that strong LLMs can serve as practical, scalable substitutes for human evaluation in many settings.
The research team also collected 3,000 expert votes and made 30,000 conversations with human preferences publicly available through their GitHub repository to support reproducibility and further research. In a supplementary analysis, when humans disagreed with GPT-4's judgments, they deemed the GPT-4 judgments reasonable in 75% of cases and actually changed their own votes in 34% of those disagreements.
The paper explored whether few-shot examples could improve GPT-4's consistency as a judge. By providing a small number of labeled examples in the judge prompt, GPT-4's position-bias consistency improved from 65.0% in the zero-shot setting to 77.5% with few-shot prompting. This finding suggests that judge quality can be further enhanced through careful prompt engineering.
The original MT-Bench evaluation tested a range of proprietary and open-source models. The following table shows the scores reported in the LMSYS leaderboard announcement (June 2023), with GPT-4 grading as the judge:
| Model | MT-Bench Score | MMLU (5-shot) |
|---|---|---|
| GPT-4 | 8.99 | 86.4 |
| GPT-3.5-turbo | 7.94 | 70.0 |
| Claude-v1 | 7.90 | - |
| Claude-instant-v1 | 7.85 | - |
| Vicuna-33B | 7.12 | - |
| WizardLM-30B | 7.01 | - |
| Guanaco-33B | 6.53 | - |
| Tulu-30B | 6.43 | - |
| Guanaco-65B | 6.41 | - |
| OpenAssistant-LLaMA-30B | 6.41 | - |
| PaLM-Chat-Bison-001 | 6.40 | - |
| Vicuna-13B | 6.39 | 52.1 |
| MPT-30B-Chat | 6.39 | - |
| WizardLM-13B | 6.35 | - |
| Vicuna-7B | 6.00 | 47.1 |
| Baize-v2-13B | 5.75 | - |
| Nous-Hermes-13B | 5.51 | - |
| MPT-7B-Chat | 5.42 | - |
| GPT4All-13B-Snoozy | 5.41 | - |
| Koala-13B | 5.35 | - |
| MPT-30B-Instruct | 5.22 | - |
| Falcon-40B-Instruct | 5.17 | - |
| H2O-Oasst-OpenLLaMA-13B | 4.63 | - |
| Alpaca-13B | 4.53 | 48.1 |
| ChatGLM-6B | 4.50 | - |
| OpenAssistant-Pythia-12B | 4.32 | - |
| RWKV-4-Raven-14B | 3.98 | - |
| Dolly-V2-12B | 3.28 | - |
| FastChat-T5-3B | 3.04 | - |
| StableLM-Tuned-Alpha-7B | 2.75 | - |
| LLaMA-13B | 2.61 | 47.0 |
Several patterns stand out from these results. GPT-4 held a clear lead at 8.99, more than a full point ahead of GPT-3.5-turbo and Claude-v1. Among open-source models, Vicuna-33B led at 7.12, demonstrating that fine-tuned open models could approach (though not match) proprietary systems. Base models without instruction tuning, such as LLaMA-13B, scored near the bottom at 2.61, confirming that fine-tuning and RLHF are critical for chat performance. The gap between instruction-tuned variants (e.g., Vicuna-13B at 6.39) and their base counterparts (LLaMA-13B at 2.61) provided clear quantitative evidence of the impact of alignment training.
Notably, there was not a strong correlation between MMLU scores and MT-Bench scores for all models. For example, LLaMA-13B scored 47.0 on MMLU but only 2.61 on MT-Bench, while Vicuna-7B scored 47.1 on MMLU but achieved 6.00 on MT-Bench. This discrepancy highlighted that knowledge (as measured by MMLU) and conversational ability (as measured by MT-Bench) are distinct capabilities.
The paper analyzed per-category performance using win rates from Chatbot Arena (the percentage of pairwise comparisons won against all other models):
| Model | Writing | Roleplay | Reasoning | Math | Coding | Extraction | STEM | Humanities |
|---|---|---|---|---|---|---|---|---|
| GPT-4 | 61.2% | 67.9% | 49.3% | 66.1% | 56.3% | 66.2% | 76.6% | 72.2% |
| GPT-3.5 | 50.9% | 60.6% | 32.6% | 63.8% | 55.0% | 48.8% | 52.8% | 53.8% |
| Vicuna-13B | 39.7% | 39.2% | 20.1% | 18.0% | 36.9% | 29.2% | 47.0% | 47.5% |
| LLaMA-13B | 15.1% | 15.1% | 7.8% | 7.5% | 2.1% | 9.3% | 6.8% | 10.1% |
GPT-4 achieved its highest win rates in STEM (76.6%) and Humanities (72.2%) and its lowest in Reasoning (49.3%). Vicuna-13B's win rates dropped sharply in Math (18.0%) and Reasoning (20.1%) compared to its Writing (39.7%) and Humanities (47.5%) performance, revealing that multi-turn reasoning and mathematical tasks present the greatest challenge for smaller open-source models. LLaMA-13B, without instruction tuning, scored in the single digits across most categories, with its lowest performance in Coding at just 2.1%.
A notable finding was the performance gap between first-turn and second-turn responses. The paper reported specific first-turn and second-turn scores for key models:
| Model | First Turn | Second Turn | Average |
|---|---|---|---|
| GPT-4 | 8.96 | 9.03 | 8.99 |
| GPT-3.5-turbo | 8.08 | 7.81 | 7.94 |
| Claude-v1 | 8.15 | 7.65 | 7.90 |
GPT-4 was unusual in that it actually scored slightly higher on the second turn (9.03) than the first (8.96), suggesting robust multi-turn capabilities. GPT-3.5 and Claude-v1 both showed modest declines from their first-turn to second-turn performance. Many open-source models experienced substantially larger drops on the second turn, suggesting weaker ability to maintain context and follow up on earlier responses. Models like Vicuna-7B and WizardLM-13B showed particularly pronounced degradation, indicating that their instruction-following ability was more fragile when asked to build on previous context. This performance degradation on follow-up turns became a key metric for assessing a model's conversational robustness.
MT-Bench and Chatbot Arena were developed as complementary evaluation approaches and were presented together in the same paper. While MT-Bench provides a controlled, reproducible benchmark with fixed questions, Chatbot Arena offers a crowd-sourced evaluation platform where users submit their own prompts and vote on anonymous model responses.
In Chatbot Arena, two models are randomly selected to generate responses to a user's query. The user then votes for the better response without knowing which model produced it. These pairwise preferences are aggregated using the Bradley-Terry model to compute Elo-like ratings for each model. While the system was initially described using chess-style Elo ratings, the Arena later adopted the Bradley-Terry model to better handle the complexity of thousands of simultaneous matchups. By mid-2023, the Arena had collected over 42,000 anonymous votes from users.
MT-Bench scores showed high correlation with Chatbot Arena Elo ratings, which provided external validation for both systems. The LMSYS leaderboard, launched alongside the MT-Bench paper, displayed three metrics side by side: Chatbot Arena Elo rating, MT-Bench score, and MMLU score. This combination allowed researchers and practitioners to compare models from multiple angles, with each metric capturing a different aspect of model quality.
The two evaluation methods serve different purposes. MT-Bench excels at standardized, reproducible comparison with low variance, while Chatbot Arena captures real-world user preferences across an unconstrained range of prompts. Together, they established LMSYS as the leading authority on LLM evaluation during 2023 and 2024.
The MT-Bench paper was notable for its transparency about the limitations of the LLM-as-a-Judge approach. The authors identified and studied several systematic biases:
In pairwise comparison mode, LLM judges tend to favor the response presented in a particular position. The paper tested multiple judge models and prompt variants, producing detailed results:
| Judge | Prompt Variant | Consistency | First-Position Bias | Second-Position Bias |
|---|---|---|---|---|
| GPT-4 | default | 65.0% | 30.0% | 5.0% |
| GPT-4 | rename | 66.2% | 28.7% | 5.0% |
| GPT-3.5 | default | 46.2% | 50.0% | 1.2% |
| GPT-3.5 | rename | 51.2% | 38.8% | 6.2% |
| Claude-v1 | default | 23.8% | 75.0% | 0.0% |
| Claude-v1 | rename | 56.2% | 11.2% | 28.7% |
"Consistency" indicates the percentage of cases where the judge gave the same verdict regardless of response order. GPT-4 achieved the highest consistency at 65.0%, while Claude-v1 in the default prompt setting showed extreme first-position bias (75.0%) and only 23.8% consistency. Claude-v1 also exhibited a notable name bias, favoring "Assistant A" regardless of content. Renaming the assistants in the prompt substantially improved Claude-v1's consistency to 56.2%.
The recommended mitigation is to run each comparison twice with swapped positions and only declare a winner when the preference is consistent across both orderings. If the results conflict after swapping, the outcome is recorded as a tie.
LLM judges tend to prefer longer, more detailed responses even when shorter answers are equally or more accurate. The paper tested this with a "repetitive list" attack, where 23 MT-Bench model answers were made unnecessarily verbose by asking GPT-4 to rephrase list items without adding new information and inserting the rephrased list at the beginning of the original response. The failure rates under this attack were:
| Judge | Failure Rate |
|---|---|
| Claude-v1 | 91.3% |
| GPT-3.5 | 91.3% |
| GPT-4 | 8.7% |
GPT-4 was notably more resistant to verbosity manipulation, failing on only 8.7% of test cases, while both Claude-v1 and GPT-3.5 failed on 91.3%. This bias can unfairly penalize models that produce concise, focused outputs and reward models that pad responses with unnecessary elaboration.
LLM judges show a measurable tendency to favor responses generated by themselves. When serving as judge, GPT-4 displayed a roughly 10% higher win rate for its own outputs compared to the rate assigned by human evaluators. Claude-v1 showed an even more pronounced self-enhancement effect, favoring its own responses with approximately 25% higher win rate. GPT-3.5, by contrast, did not exhibit a statistically significant self-enhancement bias. This finding raises concerns about the objectivity of LLM-based evaluation, particularly when the judge model is also a competitor in the evaluation.
LLM judges can struggle to correctly evaluate responses in domains that require precise verification, particularly Mathematics and formal logic. The paper demonstrated this concretely: using the default prompt, GPT-4 as judge made incorrect assessments on 70% of Math question evaluations (14 out of 20). Chain-of-thought prompting reduced the error rate to 30%, and reference-guided grading brought it down to 15%. A judge may assign high scores to plausible-sounding but incorrect mathematical solutions because it cannot reliably verify multi-step computations.
The benchmark emphasizes helpfulness as its primary evaluation criterion but largely neglects safety considerations, including harmful content generation and factual reliability. Within helpfulness, multiple dimensions (accuracy, relevance, creativity, depth) are collapsed into a single 1-to-10 score, which can obscure important trade-offs between these qualities. The authors acknowledged this limitation and suggested that a more comprehensive evaluation framework separating these dimensions could be developed in future work.
With only 80 fixed questions, MT-Bench is vulnerable to contamination. Model developers can optimize for the specific questions in the benchmark, and as the questions became widely known, the risk of overfitting grew over time. The small question set also limits the statistical power of the benchmark, particularly for distinguishing between models with similar capabilities.
In April 2024, the LMSYS team released Arena-Hard, a next-generation benchmark designed to address many of MT-Bench's limitations. Arena-Hard was built using a data pipeline that draws prompts from real user interactions in Chatbot Arena rather than relying on hand-crafted questions.
The Arena-Hard construction process began with approximately 200,000 user queries collected from Chatbot Arena. The team applied topic modeling (using BERTopic) to identify over 4,000 topic clusters and scored prompts on seven quality criteria: specificity, domain knowledge, complexity, problem-solving, creativity, technical accuracy, and real-world application. The final benchmark consists of 500 prompts selected from 250 high-scoring topic clusters, with 2 prompts sampled from each cluster.
| Metric | MT-Bench | Arena-Hard |
|---|---|---|
| Number of prompts | 80 (160 turns) | 500 (1,000 judgments) |
| Model separability (95% CI) | 22.6% | 87.4% |
| Agreement with Chatbot Arena | Lower | 89.1% |
| Correlation with Chatbot Arena | ~91.3% (Spearman) | 98.6% |
| Prompt source | Hand-crafted | Real user queries |
| Update frequency | Static | Periodically refreshed |
| Cost per evaluation run | ~$1-5 (depending on judge) | ~$25 |
Arena-Hard's most significant improvement is separability: the ability to statistically distinguish between models of different quality levels. MT-Bench achieved only 22.6% separability when measured with 95% confidence intervals, meaning that many model pairs could not be reliably ranked. This was a dramatic difference from the 91.3% agreement figure that appears when using Spearman Correlation alone, because the Spearman metric does not account for variance in model rankings. Arena-Hard raised separability to 87.4%, making it far more useful for fine-grained model comparison.
Arena-Hard uses GPT-4-Turbo as the judge model with pairwise comparison against a fixed baseline (GPT-4-0314). It incorporates chain-of-thought reasoning and position-swapping to mitigate bias, and generates 1,000 judgments per model for statistical robustness. Results are fitted using the Bradley-Terry model to produce win-rate estimates with confidence intervals.
MT-Bench is fully open source and available through the FastChat repository on GitHub (lm-sys/FastChat). The fastchat/llm_judge module contains everything needed to run the benchmark:
Installation is straightforward using pip: pip install -e ".[model_worker,llm_judge]". Researchers and developers can run MT-Bench evaluations against any model that exposes a chat API, and the modular design allows for substituting different judge models beyond GPT-4.
The cost of running a complete MT-Bench evaluation varies by judge model. Using GPT-4 (8K context) as the judge costs approximately $5.10 per full evaluation run, while GPT-4-Turbo costs around $1.85 and GPT-4o approximately $0.93. These low costs compared to human evaluation (which can run into thousands of dollars) are a primary reason for MT-Bench's widespread adoption.
The success of MT-Bench inspired several extensions and derivative benchmarks:
Published at ACL 2024, MT-Bench-101 is a fine-grained evaluation benchmark that expands on the original MT-Bench with a three-tier hierarchical ability taxonomy covering Perceptivity, Adaptability, and Interactivity. It contains 4,208 turns across 1,388 multi-turn dialogues covering 13 distinct tasks. The benchmark evaluated 21 LLMs and found that common alignment techniques did not produce consistent improvements in multi-turn capabilities, revealing a gap between single-turn and multi-turn performance that persists across model families.
Community contributors have translated MT-Bench questions into multiple languages, including Japanese, Chinese, Russian, German, French, Indonesian, Vietnamese, and Polish. A dedicated multilingual fork (MT-Bench-X) provides professionally edited translations for German, Spanish, Italian, and French, enabling standardized multi-turn evaluation across languages. These translations were manually reviewed by fluent speakers to ensure correctness and natural phrasing.
Published at ICLR 2025, FairMT-Bench addresses the absence of fairness evaluation in multi-turn dialogue settings. The benchmark formulates a task taxonomy targeting LLM fairness capabilities across three stages: context understanding, user interaction, and instruction trade-offs, with each stage comprising two tasks. It includes 10,000 multi-turn dialogue data points covering two bias types and six bias attributes. A distilled subset, FairMT-1K, provides a lighter evaluation option. Experiments on 15 state-of-the-art LLMs revealed that models are more likely to generate biased responses in multi-turn scenarios compared to single-turn settings.
Released in October 2025, MT-Video-Bench extends the multi-turn evaluation paradigm to video-grounded dialogue. It consists of 987 dialogues comprising 5,805 question-answer pairs sourced from 135 videos across five domains, enabling evaluation of multimodal LLMs on their ability to maintain video-referenced conversations across multiple turns.
Researchers have adapted the MT-Bench framework for specific domains by creating custom question sets while retaining the LLM-as-a-Judge scoring methodology. These adaptations cover areas such as medical reasoning, legal analysis, and software engineering, allowing practitioners to evaluate LLMs on tasks specific to their field of interest.
MT-Bench exists within a broader ecosystem of LLM evaluation benchmarks, each capturing different aspects of model quality:
| Benchmark | Tasks | Evaluation Method | Strengths |
|---|---|---|---|
| MT-Bench | 80 multi-turn questions | LLM-as-a-Judge (1-10 scale) | Standardized, reproducible, low cost |
| Arena-Hard | 500 real-user prompts | LLM-as-a-Judge (pairwise) | High separability, real-world prompts |
| AlpacaEval | 805 single-turn tasks | LLM-as-a-Judge (pairwise) | Large task set, length-controlled variant |
| WildBench | 1,024 real-user tasks | LLM-as-a-Judge (WB-Score/Reward) | Highest correlation (0.98) with Arena |
| Chatbot Arena | Open-ended user prompts | Human pairwise voting | Gold standard for human preferences |
| MMLU | 14,042 multiple-choice | Accuracy | Broad knowledge coverage |
AlpacaEval, with 805 tasks drawn from alignment datasets, includes relatively simpler prompts compared to MT-Bench but covers a wider range. WildBench, published at ICLR 2025, uses 1,024 tasks selected from over one million human-chatbot conversation logs and achieves 0.98 Pearson correlation with Chatbot Arena Elo ratings, surpassing both Arena-Hard (0.91) and AlpacaEval 2.0 (0.87). Each benchmark fills a different niche: MT-Bench for quick, standardized multi-turn testing; Arena-Hard for statistically robust automated evaluation; and Chatbot Arena for ground-truth human preferences.
MT-Bench and the LLM-as-a-Judge paradigm have had a significant impact on how the AI community evaluates language models:
Standardization of automated evaluation. Before MT-Bench, there was no widely accepted methodology for using LLMs to evaluate other LLMs on open-ended tasks. The paper provided both a theoretical framework and empirical validation, giving the community confidence to adopt LLM-based evaluation at scale. By 2025, LLM-as-a-Judge had become a standard component of model development pipelines at most major AI labs, including OpenAI, Anthropic, Google, and Meta. Two comprehensive survey papers on LLM-as-a-Judge were published in late 2024 (Gu et al. on arXiv 2411.15594; Haitao et al. on arXiv 2412.05579), documenting the widespread adoption of the paradigm across fields including text generation, question answering, dialogue systems, education, and peer review.
Shift toward conversational benchmarks. MT-Bench helped shift the evaluation paradigm from single-turn, closed-ended benchmarks toward multi-turn, open-ended evaluations that better reflect real-world usage of chat assistants. This influenced the design of subsequent benchmarks including Arena-Hard, WildBench, AlpacaEval, and FairMT-Bench.
Enabling rapid model iteration. By demonstrating that GPT-4 judgments closely track human preferences, MT-Bench made it practical for model developers to run evaluation cycles in hours rather than weeks. This accelerated the pace of LLM development and fine-tuning research, enabling faster experimentation with training data mixes, hyperparameters, and alignment techniques.
Transparency about evaluation limitations. The paper's thorough analysis of judge biases (position, verbosity, self-enhancement) set a standard for transparency in evaluation research. Subsequent work on LLM evaluation has consistently referenced and built upon these findings, and bias mitigation strategies proposed in the paper (such as position swapping) have become standard practice.
Community infrastructure. The LMSYS leaderboard, which prominently featured MT-Bench scores alongside Chatbot Arena Elo ratings and MMLU scores, became a central resource for tracking the state of the art in conversational AI. The open-source release of all benchmark materials through FastChat lowered the barrier for researchers worldwide to participate in LLM evaluation.
Despite its contributions, MT-Bench has faced several criticisms from the research community:
Many of these criticisms motivated the development of Arena-Hard, WildBench, and other next-generation benchmarks that use larger, dynamically updated question sets derived from real user interactions.