# AlpacaEval

> Source: https://aiwiki.ai/wiki/alpacaeval
> Updated: 2026-06-21
> Categories: AI Benchmarks, Large Language Models, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

AlpacaEval is an automatic evaluation framework for instruction-following [large language models](/wiki/large_language_model) (LLMs) developed by Stanford University's Tatsu Lab, which measures how well a model follows open-ended user instructions by using an LLM-based judge to compute a win rate against a fixed reference model across 805 instructions.[7] The project describes itself as "an automatic evaluator for instruction-following language models" that is "human-validated, high-quality, cheap, and fast."[7] AlpacaEval has become one of the most widely adopted benchmarks in the LLM development pipeline due to its speed, low cost, and high correlation with human preference judgments. The latest version, AlpacaEval 2.0 with length-controlled (LC) win rates, achieves a 0.98 Spearman correlation with [Chatbot Arena](/wiki/lmsys_chatbot_arena) rankings while costing less than $10 in OpenAI credits and running in under three minutes.[2][7]

## What is AlpacaEval used for?

AlpacaEval is used as a fast, low-cost stand-in for human evaluation during LLM development. Instead of waiting for crowdsourced human votes to accumulate, researchers run AlpacaEval to get same-day feedback on whether a training run or [fine-tuning](/wiki/fine_tuning) approach has improved a model's instruction-following quality. Because AlpacaEval 2.0 LC correlates 0.98 with [Chatbot Arena](/wiki/lmsys_chatbot_arena), a single automated run that costs under $10 can approximate the ranking signal that would otherwise require thousands of human pairwise votes.[2] Model release announcements frequently report AlpacaEval scores alongside knowledge benchmarks such as [MMLU](/wiki/mmlu) and coding benchmarks such as [HumanEval](/wiki/humaneval).

## When was AlpacaEval released?

AlpacaEval emerged from the broader Stanford Alpaca project, which began in early 2023 as an effort to create accessible instruction-following models. The original [Stanford Alpaca](/wiki/stanford_alpaca) model was a [LLaMA](/wiki/llama) 7B model [fine-tuned](/wiki/fine_tuning) on 52,000 instruction-following demonstrations generated by OpenAI's text-davinci-003.[3] The project was led by researchers including Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto.[3]

As the need for reliable and scalable evaluation of instruction-following models grew, the team developed AlpacaFarm, a simulation framework for methods that learn from human feedback. The AlpacaFarm paper was published at NeurIPS 2023, authored by Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto.[1] AlpacaFarm addressed three fundamental challenges in instruction-following model development: the high cost of human feedback data collection, the lack of trustworthy automated evaluation, and the absence of standardized reference implementations. The framework could replicate the [RLHF](/wiki/reinforcement_learning_from_human_feedback) process in under 24 hours for less than $200.[1]

AlpacaEval was originally designed as the evaluation component of AlpacaFarm. It was then released as a standalone benchmark and leaderboard, allowing any researcher or developer to evaluate their instruction-following model against a common standard. The evaluation set, auto-annotator methodology, and leaderboard are maintained as an open-source project on GitHub under the Apache 2.0 code license and CC BY NC 4.0 data license.[7]

## What is in the AlpacaEval evaluation set?

The AlpacaEval evaluation set consists of 805 instructions drawn from a simplified version of the AlpacaFarm evaluation set.[7] These instructions were chosen to be representative of the types of interactions users had with the original Alpaca web demo. The dataset aggregates instructions from five distinct sources:

| Source | Number of Instructions | Description |
|--------|----------------------|-------------|
| Self-Instruct | 252 | Test set from the self-instruct method for generating instruction data |
| Open Assistant (OASST) | 188 | Test set from the Open Assistant conversational dataset |
| Anthropic Helpful | 129 | Test set from Anthropic's helpfulness-focused human feedback data |
| Vicuna | 80 | Test set curated by the Vicuna project team |
| Koala | 156 | Test set from the Koala conversational model project |

To create the AlpacaEval set, the original AlpacaFarm evaluation instructions and input fields were merged into a single instruction field. This modification affected roughly one quarter of the examples, all originating from the self-instruct subset. Reference outputs from text-davinci-003 were also regenerated without imposing length limits on the responses.[7]

The 805 instructions span a wide range of tasks including creative writing, classification, open-ended question answering, brainstorming, coding, summarization, and general knowledge. This diversity is intended to provide a broad assessment of a model's instruction-following capabilities across different domains. Because these 805 samples are kept private, crafted adversarial outputs that game the benchmark can transfer across submissions, a property later exploited by null-model attacks.[6]

## How does AlpacaEval work?

### Pairwise Comparison with an LLM Judge

At its core, AlpacaEval works by having a powerful LLM (the "auto-annotator" or "judge") compare two outputs for the same instruction: one generated by the model being evaluated and one generated by a reference (baseline) model. For each of the 805 instructions, the judge decides which output it prefers. The model's score is the percentage of instructions for which the judge prefers its output over the reference output, expressed as a win rate.[7]

The evaluation pipeline follows these steps:

1. The model under evaluation generates responses to all 805 instructions.
2. Each model response is paired with the corresponding reference model response.
3. The auto-annotator (judge LLM) receives both responses for each instruction and determines which one is better.
4. The win rate is calculated as the fraction of comparisons where the judge prefers the evaluated model's output.

### Auto-Annotators

AlpacaEval supports multiple auto-annotator configurations. The two primary annotators are:

**alpaca_eval_gpt4 (AlpacaEval 1.0 default):** This annotator uses [GPT-4](/wiki/gpt4) to make binary preference judgments. It achieves approximately 69.2% agreement with human annotators and costs around $13.60 per 1,000 evaluations. It attains a 0.97 Spearman correlation with Chatbot Arena at the leaderboard level.[7]

**weighted_alpaca_eval_gpt4_turbo (AlpacaEval 2.0 default):** This improved annotator uses [GPT-4](/wiki/gpt4) Turbo. Rather than generating a binary preference token, it extracts the log-probabilities (logprobs) of each preference option and computes a continuous, weighted preference score. This approach means the model outputs a single token per comparison, which reduces cost and latency considerably. The use of weighted logprobs also had the unexpected benefit of reducing the annotator's length bias compared to binary judgments. The decoding temperature is set to 1.0 when using logprobs.[2]

Other annotators have been tested as well. For example, an annotator based on [Claude](/wiki/claude) achieved 65.3% human agreement at a cost of roughly $3.30 per 1,000 evaluations, with a Spearman correlation of 0.93 with Chatbot Arena.[7]

### Human Validation

The automatic annotators were validated against a substantial corpus of human preference data. The team collected approximately 20,000 human preference annotations, including 2,500 cross-annotations where four separate human annotators evaluated each of 650 instructions selected through stratified sampling across models and datasets.[1] These human annotations serve as the ground truth for measuring auto-annotator quality.

### Technical Features

AlpacaEval incorporates several design choices to ensure reliability and reproducibility:

- **Output randomization:** The order in which the two responses (model vs. reference) are presented to the judge is randomized to mitigate position bias.
- **Caching:** Previously computed annotations are cached to avoid redundant API calls and ensure reproducibility.
- **Seeding:** Random seeds are derived from the instruction text to make results deterministic across runs.
- **Batching:** Multiple comparisons can be batched together for cost efficiency.

## AlpacaEval 1.0

The first version of AlpacaEval used text-davinci-003 (OpenAI's instruction-tuned GPT-3.5 variant) as the reference model and the alpaca_eval_gpt4 annotator as the default judge. In this setup, the win rate represents the percentage of the 805 instructions for which the judge prefers the evaluated model's output over text-davinci-003's output.[7]

AlpacaEval 1.0 established the basic framework and demonstrated that LLM-based automatic evaluation could serve as a practical proxy for human judgment. However, it had a notable weakness: the evaluator exhibited a systematic preference for longer outputs. Models that produced verbose responses scored higher even when the additional length did not contribute meaningful content. This length bias made it possible to inflate scores by prompting models to generate longer answers.[2]

### Notable AlpacaEval 1.0 Scores

| Model | Win Rate (%) | Standard Error |
|-------|-------------|----------------|
| [GPT-4](/wiki/gpt4) | 95.3 | 0.7 |
| [Claude](/wiki/claude) | 88.4 | 1.1 |
| [ChatGPT](/wiki/chatgpt) | 86.1 | 1.2 |
| Guanaco 65B | 71.8 | 1.6 |

Because the reference model (text-davinci-003) was relatively weak by the standards of later models, top-tier models achieved very high win rates, compressing the useful range of the leaderboard near the ceiling.

## What changed in AlpacaEval 2.0?

AlpacaEval 2.0 was released to address the shortcomings of the original version. It introduced two major changes: a stronger reference model and an improved auto-annotator.[2]

### Stronger Reference Model

AlpacaEval 2.0 replaced text-davinci-003 with GPT-4 Turbo (specifically the November 6, 2023 preview version, gpt-4-1106-preview) as the baseline reference model.[7] Using a much stronger reference model spread the distribution of scores across the leaderboard, since even competitive models frequently lose comparisons against GPT-4 Turbo outputs. This eliminated the ceiling effect that plagued the first version and provided better discrimination between high-performing models.

In AlpacaEval 2.0, GPT-4 Preview (11/06) serves as both the baseline and the auto-annotator, anchored at a 50.00% win rate by definition (since it is compared against itself).[7]

### Improved Auto-Annotator

The weighted_alpaca_eval_gpt4_turbo annotator uses GPT-4 Turbo with logprob-based scoring instead of binary preference generation. This single-token output approach is both cheaper and faster than the original GPT-4-based annotator. The continuous preference scores computed from logprobs provide more granular signal than binary win/loss decisions.[2]

### Length Bias Mitigation

The most significant methodological contribution of AlpacaEval 2.0 is the introduction of length-controlled (LC) win rates, described in the paper "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators" by Yann Dubois, Balazs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. The paper was published at COLM 2024.[2]

The core problem is that LLM-based judges tend to prefer longer outputs, even when the extra length does not improve quality. The original AlpacaEval designers acknowledged that "other factors like length and style effects were not controlled for," noting that humans exhibited similar biases in their initial analysis.[2] However, as the benchmark gained popularity as a leaderboard, these uncontrolled biases became targets for exploitation.

The severity of the length bias was dramatic. In experiments, GPT-4 Turbo's win rate fluctuated from 22.9% to 64.3% simply by prompting it to be more verbose or more concise. Even weaker models could achieve significant score increases through verbosity alone.[2]

## How do length-controlled win rates work?

Length-controlled (LC) win rates address the length bias through a regression-based debiasing method. The approach answers a counterfactual question: "What would the preference be if the model's output and the baseline's output had the same length?"[2] Applying this correction raised AlpacaEval's Spearman correlation with [Chatbot Arena](/wiki/lmsys_chatbot_arena) from 0.93 to 0.98 while sharply reducing the benchmark's gameability through verbosity.[2]

### Methodology

The debiasing technique fits a generalized linear model (GLM) that predicts the auto-annotator's preference based on three components:

1. **Model term:** Captures the inherent quality difference between the evaluated model and the baseline, independent of length.
2. **Length term:** Models the effect of the length difference between the model's output and the baseline's output on the annotator's preference. A tanh transformation is applied to the normalized length difference to account for diminishing returns of additional length.
3. **Instruction term:** Accounts for task-specific difficulty variations across different instructions.

The length-controlled win rate is then obtained by setting the length term to zero, effectively predicting what the preference would be if both outputs had equal length.[2] Formally, the length-controlled preference is computed as:

> LC win rate = 100 * E[logistic(model_term + instruction_term)]

where the expectation is taken over all instructions.

### Statistical Properties

The method preserves several desirable properties:

- **Identity:** Comparing the baseline model to itself yields exactly 50%.
- **Symmetry:** Swapping the model and baseline produces scores that sum to 100%.
- **Bounded:** Scores remain within the [0%, 100%] range.

### Implementation Details

- A separate logistic regression is fitted for each model, so adding a new model to the leaderboard does not retroactively change any previously computed scores.
- Five-fold cross-validation is used during fitting.
- L2 regularization is applied to prevent overfitting, with weak regularization on the length coefficient to guard against adversarial truncation attacks.
- The GLM uses cross-entropy loss for optimization.

## How well does AlpacaEval correlate with Chatbot Arena?

One of AlpacaEval's strongest selling points is its high correlation with [Chatbot Arena](/wiki/lmsys_chatbot_arena) (also known as LMSYS Chatbot Arena), a human preference evaluation platform where users vote on pairwise comparisons of anonymous model outputs.[4]

| Metric | Spearman Correlation with Chatbot Arena |
|--------|----------------------------------------|
| AlpacaEval 1.0 (raw win rate, text-davinci-003 baseline) | ~0.93 |
| AlpacaEval 2.0 (raw win rate, GPT-4 Turbo baseline) | ~0.94 |
| AlpacaEval 2.0 LC (length-controlled win rate) | **0.98** |
| [MT-Bench](/wiki/mt_bench) | Lower than both AlpacaEval versions |

The 0.98 Spearman correlation makes AlpacaEval 2.0 LC the automatic benchmark with the highest reported correlation with Chatbot Arena among publicly available evaluation frameworks. The Spearman rank correlation is used rather than Pearson correlation because win-rate probabilities are log-linearly related to Elo ratings rather than linearly related. Correlation calculations were performed on benchmarks evaluating at least 25 models present in the Chatbot Arena, providing a sample of 38 overlapping models for AlpacaEval and 34 for MT-Bench.[2]

The high correlation matters because it validates AlpacaEval as a practical stand-in for expensive, slow human evaluations during model development. Researchers can use AlpacaEval to get rapid feedback on model quality without waiting for sufficient Chatbot Arena votes to accumulate.

## Leaderboard

The AlpacaEval 2.0 leaderboard ranks models by their length-controlled win rates against GPT-4 Turbo. The leaderboard includes over 120 models.[7] Below is a selection of notable scores:

### Selected AlpacaEval 2.0 Scores

| Model | LC Win Rate (%) | Raw Win Rate (%) | Avg. Output Length (chars) |
|-------|----------------|-------------------|---------------------------|
| GPT-4 Omni (05/13) | 57.46 | 51.33 | 1,873 |
| GPT-4 Turbo (04/09) | 55.02 | 46.12 | 1,802 |
| [Claude 3.5 Sonnet](/wiki/claude) (06/20) | 52.37 | 40.56 | 1,488 |
| Yi-Large Preview | 51.89 | 57.47 | 2,335 |
| GPT-4o Mini (07/18) | 50.73 | 44.65 | 1,861 |
| GPT-4 Preview (11/06) (baseline) | 50.00 | 50.00 | 2,049 |
| Qwen1.5 110B Chat | 43.91 | 33.78 | 1,631 |
| [Claude 3 Opus](/wiki/claude) (02/29) | 40.51 | 29.11 | 1,388 |
| [Llama 3.1](/wiki/llama3) 405B Instruct | 39.26 | 39.11 | 1,988 |
| [GPT-4](/wiki/gpt4) (original) | 38.13 | 23.58 | 1,365 |
| Qwen2 72B Instruct | 38.07 | 29.85 | 1,626 |
| [Llama 3](/wiki/llama3) 70B Instruct | 34.42 | 33.18 | 1,919 |
| [Claude 3 Sonnet](/wiki/claude) (02/29) | 34.87 | 25.56 | 1,420 |
| [Mistral](/wiki/mistral) Large (24/02) | 32.65 | 21.44 | 1,362 |
| Mixtral 8x22B v0.1 | 30.88 | 22.21 | 1,445 |
| [Claude 2](/wiki/claude) | 28.16 | 17.19 | 1,069 |
| Gemini Pro | 24.38 | 18.18 | 1,456 |
| [Llama 3](/wiki/llama3) 8B Instruct | 22.92 | 22.57 | 1,899 |
| GPT-3.5 Turbo (06/13) | 22.72 | 14.13 | 1,328 |
| [Llama 2](/wiki/llama) Chat 70B | 14.69 | 13.89 | 1,790 |
| Alpaca 7B | 5.88 | 2.59 | 396 |

The leaderboard reveals several patterns. Models with high raw win rates but disproportionately long outputs (such as Nanbeige Plus Chat v0.1, which scored 56.70% raw but only 44.46% LC) are penalized by length control. Conversely, models with concise but high-quality outputs (such as Claude 3.5 Sonnet, which achieved a 52.37% LC rate despite a relatively modest 40.56% raw rate) benefit from the debiasing.[7]

Community-developed techniques such as Mixture of Agents (MoA) and best-of-N sampling have also been applied to push scores higher. For example, the SelfMoA approach combined with gemma-2-9b-it-WPO-HB achieved a 78.54% LC win rate, though such methods involve generating and selecting among multiple candidate outputs rather than evaluating a single model's direct responses.[7]

## How do you run AlpacaEval?

AlpacaEval is distributed as a Python package (requiring Python 3.10 or higher) and can be installed via pip:[7]

```bash
pip install alpaca-eval
```

Basic evaluation requires an OpenAI API key (for the default GPT-4-based auto-annotator) and a JSON file containing the model's outputs:

```bash
export OPENAI_API_KEY=<your-key>
alpaca_eval --model_outputs 'outputs.json'
```

The outputs JSON file must contain entries with at least two fields: `instruction` (the prompt) and `output` (the model's response). AlpacaEval provides several commands:

| Command | Description |
|---------|-------------|
| `evaluate` | Score model outputs against the reference baseline |
| `evaluate_from_model` | Generate outputs from a model endpoint and then evaluate them |
| `make_leaderboard` | Precompute and display the full leaderboard |
| `analyze_evaluators` | Compute quality metrics for auto-annotator configurations |

Users can also specify custom annotators, reference models, and evaluation sets through command-line parameters, making the framework flexible for specialized evaluation scenarios.[7]

## What are the limitations of AlpacaEval?

Despite its popularity and strong correlation with human judgments, AlpacaEval has several recognized limitations. The maintainers are explicit that the debiasing is partial, noting that "we are only controlling for length bias" and that auto-annotators give "more importance to the style of the output than its content (e.g. factuality)."[7]

### Residual Length and Style Biases

While length-controlled win rates significantly reduce length bias, they do not eliminate all confounding factors. The auto-annotator still exhibits preferences for certain stylistic patterns, including a bias toward outputs formatted as lists and a bias toward outputs that resemble the annotator model's own generation style.[2] Because GPT-4 serves as both the annotator and the baseline in AlpacaEval 2.0, models fine-tuned on GPT-4 outputs may receive an inherent advantage.

### Instruction Set Representativeness

The 805 instructions in AlpacaEval were derived from user interactions with an early Alpaca web demo and from several 2022-2023 era datasets. These instructions may not reflect how users interact with modern, more capable models. Advanced use cases such as multi-turn dialogue, complex reasoning chains, tool use, and long-document analysis are underrepresented or absent from the evaluation set.

### Adversarial Vulnerability

Research has demonstrated that AlpacaEval can be exploited by adversarial strategies. A 2024 study showed that deliberately crafted "null models," which prepend specific adversarial prefixes to responses, could achieve LC win rates as high as 86.5% on AlpacaEval 2.0 regardless of the actual answer content.[6] While the length-controlled metric reduces this vulnerability compared to raw win rates, it does not fully prevent gaming.

### Single-Turn Only

AlpacaEval evaluates only single-turn instruction following. It does not assess a model's ability to maintain context over multi-turn conversations, follow up on previous responses, or handle complex interactive workflows.

### Factuality Not Directly Measured

The LLM-based judge evaluates perceived quality and helpfulness rather than factual accuracy. A model that generates fluent, well-structured but factually incorrect responses may score higher than a model that provides correct but less polished answers. AlpacaEval does not incorporate fact-checking or grounding against external knowledge sources.

### Cost Dependency on OpenAI API

The default auto-annotators rely on OpenAI's GPT-4 API, which means evaluation costs are subject to OpenAI's pricing and the benchmark's reproducibility depends on continued API availability. Changes to the GPT-4 API (such as model updates or deprecations) could affect score consistency over time.

## How does AlpacaEval differ from MT-Bench and Chatbot Arena?

AlpacaEval occupies a specific niche in the LLM evaluation landscape. The following table compares it with other commonly used benchmarks:

| Benchmark | Type | Scope | Cost | Correlation with Human Preference |
|-----------|------|-------|------|----------------------------------|
| AlpacaEval 2.0 LC | Automatic (LLM judge) | 805 open-ended instructions | <$10 | 0.98 (Chatbot Arena) |
| [MT-Bench](/wiki/mt_bench) | Automatic (LLM judge) | 80 multi-turn questions across 8 categories | <$5 | Lower than AlpacaEval |
| [Chatbot Arena](/wiki/lmsys_chatbot_arena) | Human preference (crowdsourced) | Open-ended, user-submitted | Free (crowdsourced) | Reference standard |
| [MMLU](/wiki/mmlu) | Multiple-choice knowledge test | 57 academic subjects | Free (local) | Measures knowledge, not preference |
| [HumanEval](/wiki/humaneval) | Code generation | 164 Python programming problems | Free (local) | Measures coding ability specifically |

The key distinction is that AlpacaEval uses a single-turn, win-rate format judged by an LLM, [MT-Bench](/wiki/mt_bench) scores multi-turn answers on an absolute 1-to-10 scale, and [Chatbot Arena](/wiki/lmsys_chatbot_arena) collects live human votes. AlpacaEval's advantage lies in its balance of speed, cost, and correlation with human preferences. It is particularly useful as a development-time metric when researchers need rapid feedback on whether a training run or fine-tuning approach has improved instruction-following quality. For definitive rankings, human evaluation through platforms like Chatbot Arena remains the gold standard.

## Impact and Adoption

AlpacaEval has had a significant influence on the LLM research and development community since its release. It is frequently cited in research papers introducing new models or training techniques, and many model release announcements include AlpacaEval scores alongside other benchmark results.

The benchmark helped popularize the concept of "LLM-as-a-judge" evaluation, where one language model is used to assess the quality of another model's outputs.[5] This paradigm has since been adopted by other evaluation frameworks and has become a standard practice in the field.

The length-controlled debiasing methodology introduced in AlpacaEval 2.0 has influenced thinking about bias correction in automated evaluations more broadly. The authors have encouraged future work to apply similar debiasing techniques during reward model training in [reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback) (RLHF) pipelines, where length biases can similarly distort training signals.[2]

The open-source nature of the project, with its code, evaluation set, and leaderboard all publicly available, has contributed to reproducibility and transparency in LLM evaluation.[7] Researchers can submit their models' outputs to the leaderboard, inspect the evaluation methodology, and even develop custom annotator configurations for specialized use cases.

## See Also

- [Chatbot Arena](/wiki/lmsys_chatbot_arena)
- [MT-Bench](/wiki/mt_bench)
- [MMLU](/wiki/mmlu)
- [Stanford Alpaca](/wiki/stanford_alpaca)
- [Large Language Models](/wiki/large_language_model)
- [Reinforcement Learning from Human Feedback](/wiki/reinforcement_learning_from_human_feedback)
- [GPT-4](/wiki/gpt4)
- [Fine-Tuning](/wiki/fine_tuning)

## References

1. Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). "AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback." *Advances in Neural Information Processing Systems 36 (NeurIPS 2023)*. arXiv:2305.14387.

2. Dubois, Y., Galambosi, B., Liang, P., & Hashimoto, T. B. (2024). "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators." *Conference on Language Modeling (COLM 2024)*. arXiv:2404.04475.

3. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). "Stanford Alpaca: An Instruction-following LLaMA Model." GitHub repository, Stanford University.

4. Chiang, W. L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." arXiv:2403.04132.

5. Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." *Advances in Neural Information Processing Systems 36 (NeurIPS 2023)*. arXiv:2306.05685.

6. Shi, T., Li, X., & Hashimoto, T. B. (2024). "Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates." arXiv:2410.07137.

7. tatsu-lab/alpaca_eval. GitHub repository. https://github.com/tatsu-lab/alpaca_eval