AlpacaEval is an automatic evaluation framework for instruction-following large language models (LLMs). Developed by Stanford University's Tatsu Lab, it measures how well a model follows open-ended user instructions by comparing its outputs against a reference model using an LLM-based judge. AlpacaEval has become one of the most widely adopted benchmarks in the LLM development pipeline due to its speed, low cost, and high correlation with human preference judgments. The latest version, AlpacaEval 2.0 with length-controlled win rates, achieves a 0.98 Spearman correlation with Chatbot Arena rankings while costing less than $10 and running in under three minutes.
AlpacaEval emerged from the broader Stanford Alpaca project, which began in early 2023 as an effort to create accessible instruction-following models. The original Stanford Alpaca model was a LLaMA 7B model fine-tuned on 52,000 instruction-following demonstrations generated by OpenAI's text-davinci-003. The project was led by researchers including Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto.
As the need for reliable and scalable evaluation of instruction-following models grew, the team developed AlpacaFarm, a simulation framework for methods that learn from human feedback. The AlpacaFarm paper was published at NeurIPS 2023, authored by Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaFarm addressed three fundamental challenges in instruction-following model development: the high cost of human feedback data collection, the lack of trustworthy automated evaluation, and the absence of standardized reference implementations. The framework could replicate the RLHF process in under 24 hours for less than $200.
AlpacaEval was originally designed as the evaluation component of AlpacaFarm. It was then released as a standalone benchmark and leaderboard, allowing any researcher or developer to evaluate their instruction-following model against a common standard. The evaluation set, auto-annotator methodology, and leaderboard are maintained as an open-source project on GitHub under the Apache 2.0 code license and CC BY NC 4.0 data license.
The AlpacaEval evaluation set consists of 805 instructions drawn from a simplified version of the AlpacaFarm evaluation set. These instructions were chosen to be representative of the types of interactions users had with the original Alpaca web demo. The dataset aggregates instructions from five distinct sources:
| Source | Number of Instructions | Description |
|---|---|---|
| Self-Instruct | 252 | Test set from the self-instruct method for generating instruction data |
| Open Assistant (OASST) | 188 | Test set from the Open Assistant conversational dataset |
| Anthropic Helpful | 129 | Test set from Anthropic's helpfulness-focused human feedback data |
| Vicuna | 80 | Test set curated by the Vicuna project team |
| Koala | 156 | Test set from the Koala conversational model project |
To create the AlpacaEval set, the original AlpacaFarm evaluation instructions and input fields were merged into a single instruction field. This modification affected roughly one quarter of the examples, all originating from the self-instruct subset. Reference outputs from text-davinci-003 were also regenerated without imposing length limits on the responses.
The 805 instructions span a wide range of tasks including creative writing, classification, open-ended question answering, brainstorming, coding, summarization, and general knowledge. This diversity is intended to provide a broad assessment of a model's instruction-following capabilities across different domains.
At its core, AlpacaEval works by having a powerful LLM (the "auto-annotator" or "judge") compare two outputs for the same instruction: one generated by the model being evaluated and one generated by a reference (baseline) model. For each of the 805 instructions, the judge decides which output it prefers. The model's score is the percentage of instructions for which the judge prefers its output over the reference output, expressed as a win rate.
The evaluation pipeline follows these steps:
AlpacaEval supports multiple auto-annotator configurations. The two primary annotators are:
alpaca_eval_gpt4 (AlpacaEval 1.0 default): This annotator uses GPT-4 to make binary preference judgments. It achieves approximately 69.2% agreement with human annotators and costs around $13.60 per 1,000 evaluations. It attains a 0.97 Spearman correlation with Chatbot Arena at the leaderboard level.
weighted_alpaca_eval_gpt4_turbo (AlpacaEval 2.0 default): This improved annotator uses GPT-4 Turbo. Rather than generating a binary preference token, it extracts the log-probabilities (logprobs) of each preference option and computes a continuous, weighted preference score. This approach means the model outputs a single token per comparison, which reduces cost and latency considerably. The use of weighted logprobs also had the unexpected benefit of reducing the annotator's length bias compared to binary judgments. The decoding temperature is set to 1.0 when using logprobs.
Other annotators have been tested as well. For example, an annotator based on Claude achieved 65.3% human agreement at a cost of roughly $3.30 per 1,000 evaluations, with a Spearman correlation of 0.93 with Chatbot Arena.
The automatic annotators were validated against a substantial corpus of human preference data. The team collected approximately 20,000 human preference annotations, including 2,500 cross-annotations where four separate human annotators evaluated each of 650 instructions selected through stratified sampling across models and datasets. These human annotations serve as the ground truth for measuring auto-annotator quality.
AlpacaEval incorporates several design choices to ensure reliability and reproducibility:
The first version of AlpacaEval used text-davinci-003 (OpenAI's instruction-tuned GPT-3.5 variant) as the reference model and the alpaca_eval_gpt4 annotator as the default judge. In this setup, the win rate represents the percentage of the 805 instructions for which the judge prefers the evaluated model's output over text-davinci-003's output.
AlpacaEval 1.0 established the basic framework and demonstrated that LLM-based automatic evaluation could serve as a practical proxy for human judgment. However, it had a notable weakness: the evaluator exhibited a systematic preference for longer outputs. Models that produced verbose responses scored higher even when the additional length did not contribute meaningful content. This length bias made it possible to inflate scores by prompting models to generate longer answers.
| Model | Win Rate (%) | Standard Error |
|---|---|---|
| GPT-4 | 95.3 | 0.7 |
| Claude | 88.4 | 1.1 |
| ChatGPT | 86.1 | 1.2 |
| Guanaco 65B | 71.8 | 1.6 |
Because the reference model (text-davinci-003) was relatively weak by the standards of later models, top-tier models achieved very high win rates, compressing the useful range of the leaderboard near the ceiling.
AlpacaEval 2.0 was released to address the shortcomings of the original version. It introduced two major changes: a stronger reference model and an improved auto-annotator.
AlpacaEval 2.0 replaced text-davinci-003 with GPT-4 Turbo (specifically the November 6, 2023 preview version, gpt-4-1106-preview) as the baseline reference model. Using a much stronger reference model spread the distribution of scores across the leaderboard, since even competitive models frequently lose comparisons against GPT-4 Turbo outputs. This eliminated the ceiling effect that plagued the first version and provided better discrimination between high-performing models.
In AlpacaEval 2.0, GPT-4 Preview (11/06) serves as both the baseline and the auto-annotator, anchored at a 50.00% win rate by definition (since it is compared against itself).
The weighted_alpaca_eval_gpt4_turbo annotator uses GPT-4 Turbo with logprob-based scoring instead of binary preference generation. This single-token output approach is both cheaper and faster than the original GPT-4-based annotator. The continuous preference scores computed from logprobs provide more granular signal than binary win/loss decisions.
The most significant methodological contribution of AlpacaEval 2.0 is the introduction of length-controlled (LC) win rates, described in the paper "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators" by Yann Dubois, Balazs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. The paper was published at COLM 2024.
The core problem is that LLM-based judges tend to prefer longer outputs, even when the extra length does not improve quality. The original AlpacaEval designers acknowledged that "other factors like length and style effects were not controlled for," noting that humans exhibited similar biases in their initial analysis. However, as the benchmark gained popularity as a leaderboard, these uncontrolled biases became targets for exploitation.
The severity of the length bias was dramatic. In experiments, GPT-4 Turbo's win rate fluctuated from 22.9% to 64.3% simply by prompting it to be more verbose or more concise. Even weaker models could achieve significant score increases through verbosity alone.
Length-controlled (LC) win rates address the length bias through a regression-based debiasing method. The approach answers a counterfactual question: "What would the preference be if the model's output and the baseline's output had the same length?"
The debiasing technique fits a generalized linear model (GLM) that predicts the auto-annotator's preference based on three components:
The length-controlled win rate is then obtained by setting the length term to zero, effectively predicting what the preference would be if both outputs had equal length. Formally, the length-controlled preference is computed as:
LC win rate = 100 * E[logistic(model_term + instruction_term)]
where the expectation is taken over all instructions.
The method preserves several desirable properties:
One of AlpacaEval's strongest selling points is its high correlation with Chatbot Arena (also known as LMSYS Chatbot Arena), a human preference evaluation platform where users vote on pairwise comparisons of anonymous model outputs.
| Metric | Spearman Correlation with Chatbot Arena |
|---|---|
| AlpacaEval 1.0 (raw win rate, text-davinci-003 baseline) | ~0.93 |
| AlpacaEval 2.0 (raw win rate, GPT-4 Turbo baseline) | ~0.94 |
| AlpacaEval 2.0 LC (length-controlled win rate) | 0.98 |
| MT-Bench | Lower than both AlpacaEval versions |
The 0.98 Spearman correlation makes AlpacaEval 2.0 LC the automatic benchmark with the highest reported correlation with Chatbot Arena among publicly available evaluation frameworks. The Spearman rank correlation is used rather than Pearson correlation because win-rate probabilities are log-linearly related to Elo ratings rather than linearly related. Correlation calculations were performed on benchmarks evaluating at least 25 models present in the Chatbot Arena, providing a sample of 38 overlapping models for AlpacaEval and 34 for MT-Bench.
The high correlation matters because it validates AlpacaEval as a practical stand-in for expensive, slow human evaluations during model development. Researchers can use AlpacaEval to get rapid feedback on model quality without waiting for sufficient Chatbot Arena votes to accumulate.
The AlpacaEval 2.0 leaderboard ranks models by their length-controlled win rates against GPT-4 Turbo. The leaderboard includes over 120 models. Below is a selection of notable scores:
| Model | LC Win Rate (%) | Raw Win Rate (%) | Avg. Output Length (chars) |
|---|---|---|---|
| GPT-4 Omni (05/13) | 57.46 | 51.33 | 1,873 |
| GPT-4 Turbo (04/09) | 55.02 | 46.12 | 1,802 |
| Claude 3.5 Sonnet (06/20) | 52.37 | 40.56 | 1,488 |
| Yi-Large Preview | 51.89 | 57.47 | 2,335 |
| GPT-4o Mini (07/18) | 50.73 | 44.65 | 1,861 |
| GPT-4 Preview (11/06) (baseline) | 50.00 | 50.00 | 2,049 |
| Qwen1.5 110B Chat | 43.91 | 33.78 | 1,631 |
| Claude 3 Opus (02/29) | 40.51 | 29.11 | 1,388 |
| Llama 3.1 405B Instruct | 39.26 | 39.11 | 1,988 |
| GPT-4 (original) | 38.13 | 23.58 | 1,365 |
| Qwen2 72B Instruct | 38.07 | 29.85 | 1,626 |
| Llama 3 70B Instruct | 34.42 | 33.18 | 1,919 |
| Claude 3 Sonnet (02/29) | 34.87 | 25.56 | 1,420 |
| Mistral Large (24/02) | 32.65 | 21.44 | 1,362 |
| Mixtral 8x22B v0.1 | 30.88 | 22.21 | 1,445 |
| Claude 2 | 28.16 | 17.19 | 1,069 |
| Gemini Pro | 24.38 | 18.18 | 1,456 |
| Llama 3 8B Instruct | 22.92 | 22.57 | 1,899 |
| GPT-3.5 Turbo (06/13) | 22.72 | 14.13 | 1,328 |
| Llama 2 Chat 70B | 14.69 | 13.89 | 1,790 |
| Alpaca 7B | 5.88 | 2.59 | 396 |
The leaderboard reveals several patterns. Models with high raw win rates but disproportionately long outputs (such as Nanbeige Plus Chat v0.1, which scored 56.70% raw but only 44.46% LC) are penalized by length control. Conversely, models with concise but high-quality outputs (such as Claude 3.5 Sonnet, which achieved a 52.37% LC rate despite a relatively modest 40.56% raw rate) benefit from the debiasing.
Community-developed techniques such as Mixture of Agents (MoA) and best-of-N sampling have also been applied to push scores higher. For example, the SelfMoA approach combined with gemma-2-9b-it-WPO-HB achieved a 78.54% LC win rate, though such methods involve generating and selecting among multiple candidate outputs rather than evaluating a single model's direct responses.
AlpacaEval is distributed as a Python package (requiring Python 3.10 or higher) and can be installed via pip:
pip install alpaca-eval
Basic evaluation requires an OpenAI API key (for the default GPT-4-based auto-annotator) and a JSON file containing the model's outputs:
export OPENAI_API_KEY=<your-key>
alpaca_eval --model_outputs 'outputs.json'
The outputs JSON file must contain entries with at least two fields: instruction (the prompt) and output (the model's response). AlpacaEval provides several commands:
| Command | Description |
|---|---|
evaluate | Score model outputs against the reference baseline |
evaluate_from_model | Generate outputs from a model endpoint and then evaluate them |
make_leaderboard | Precompute and display the full leaderboard |
analyze_evaluators | Compute quality metrics for auto-annotator configurations |
Users can also specify custom annotators, reference models, and evaluation sets through command-line parameters, making the framework flexible for specialized evaluation scenarios.
Despite its popularity and strong correlation with human judgments, AlpacaEval has several recognized limitations.
While length-controlled win rates significantly reduce length bias, they do not eliminate all confounding factors. The auto-annotator still exhibits preferences for certain stylistic patterns, including a bias toward outputs formatted as lists and a bias toward outputs that resemble the annotator model's own generation style. Because GPT-4 serves as both the annotator and the baseline in AlpacaEval 2.0, models fine-tuned on GPT-4 outputs may receive an inherent advantage.
The 805 instructions in AlpacaEval were derived from user interactions with an early Alpaca web demo and from several 2022-2023 era datasets. These instructions may not reflect how users interact with modern, more capable models. Advanced use cases such as multi-turn dialogue, complex reasoning chains, tool use, and long-document analysis are underrepresented or absent from the evaluation set.
Research has demonstrated that AlpacaEval can be exploited by adversarial strategies. A 2024 study showed that deliberately crafted "null models," which prepend specific adversarial prefixes to responses, could achieve LC win rates as high as 86.5% on AlpacaEval 2.0 regardless of the actual answer content. While the length-controlled metric reduces this vulnerability compared to raw win rates, it does not fully prevent gaming.
AlpacaEval evaluates only single-turn instruction following. It does not assess a model's ability to maintain context over multi-turn conversations, follow up on previous responses, or handle complex interactive workflows.
The LLM-based judge evaluates perceived quality and helpfulness rather than factual accuracy. A model that generates fluent, well-structured but factually incorrect responses may score higher than a model that provides correct but less polished answers. AlpacaEval does not incorporate fact-checking or grounding against external knowledge sources.
The default auto-annotators rely on OpenAI's GPT-4 API, which means evaluation costs are subject to OpenAI's pricing and the benchmark's reproducibility depends on continued API availability. Changes to the GPT-4 API (such as model updates or deprecations) could affect score consistency over time.
AlpacaEval occupies a specific niche in the LLM evaluation landscape. The following table compares it with other commonly used benchmarks:
| Benchmark | Type | Scope | Cost | Correlation with Human Preference |
|---|---|---|---|---|
| AlpacaEval 2.0 LC | Automatic (LLM judge) | 805 open-ended instructions | <$10 | 0.98 (Chatbot Arena) |
| MT-Bench | Automatic (LLM judge) | 80 multi-turn questions across 8 categories | <$5 | Lower than AlpacaEval |
| Chatbot Arena | Human preference (crowdsourced) | Open-ended, user-submitted | Free (crowdsourced) | Reference standard |
| MMLU | Multiple-choice knowledge test | 57 academic subjects | Free (local) | Measures knowledge, not preference |
| HumanEval | Code generation | 164 Python programming problems | Free (local) | Measures coding ability specifically |
AlpacaEval's advantage lies in its balance of speed, cost, and correlation with human preferences. It is particularly useful as a development-time metric when researchers need rapid feedback on whether a training run or fine-tuning approach has improved instruction-following quality. For definitive rankings, human evaluation through platforms like Chatbot Arena remains the gold standard.
AlpacaEval has had a significant influence on the LLM research and development community since its release. It is frequently cited in research papers introducing new models or training techniques, and many model release announcements include AlpacaEval scores alongside other benchmark results.
The benchmark helped popularize the concept of "LLM-as-a-judge" evaluation, where one language model is used to assess the quality of another model's outputs. This paradigm has since been adopted by other evaluation frameworks and has become a standard practice in the field.
The length-controlled debiasing methodology introduced in AlpacaEval 2.0 has influenced thinking about bias correction in automated evaluations more broadly. The authors have encouraged future work to apply similar debiasing techniques during reward model training in reinforcement learning from human feedback (RLHF) pipelines, where length biases can similarly distort training signals.
The open-source nature of the project, with its code, evaluation set, and leaderboard all publicly available, has contributed to reproducibility and transparency in LLM evaluation. Researchers can submit their models' outputs to the leaderboard, inspect the evaluation methodology, and even develop custom annotator configurations for specialized use cases.