Alpaca (model)
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v2 · 1,804 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v2 · 1,804 words
Add missing citations, update stale details, or suggest a clearer explanation.
Alpaca is an instruction-following language model released on March 13, 2023 by Stanford University's Center for Research on Foundation Models (CRFM), built by fine-tuning Meta's 7-billion-parameter LLaMA model on 52,000 instruction-following demonstrations that were generated automatically with OpenAI's text-davinci-003, for a total cost the team estimated at under $600 [1]. In a small blind human evaluation, Alpaca won 90 of 179 head-to-head comparisons against text-davinci-003, a far larger commercial model, leading the authors to conclude the two behaved "very similar" despite Alpaca's much smaller size, a result widely read as proof that useful instruction-tuned models could be reproduced cheaply from open weights [1]. Alpaca's web demo was taken offline about a week after launch over hallucination and hosting concerns, and the model was restricted to non-commercial research use, but its recipe and 52K dataset catalyzed the 2023 wave of open instruction-tuned models, including Vicuna and Koala, and gave its name to the AlpacaEval benchmark lineage [1][4][5][8].
Alpaca appeared in the months following the launch of ChatGPT, when capable instruction-following models were available almost exclusively as closed commercial APIs, leaving academics without an accessible model whose failure modes they could study and try to fix [1]. Meta AI had released LLaMA in February 2023 under a non-commercial research license, providing strong open base models but not instruction-tuned ones.
The project was carried out in the lab of Stanford assistant professor Tatsunori Hashimoto by five student researchers who contributed equally, Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, and Xuechen Li, together with faculty co-authors Carlos Guestrin, Percy Liang, and Hashimoto [1][2]. The team framed the release narrowly as a research artifact, stating that "Alpaca is intended only for academic research and any commercial use is prohibited" [1]. The initial release included the 52,000-example training dataset, the data generation code, and the training recipe; a fine-tuned weight diff against LLaMA was published later [2].
Alpaca combined two ingredients: a strong pretrained base model (LLaMA-7B) and inexpensive machine-generated instruction tuning data. To produce the data, the team adapted the self-instruct method of Wang et al. (December 2022), in which a language model is prompted with a small pool of human-written seed tasks and asked to invent new instructions and corresponding outputs [3]. Starting from self-instruct's 175 human-written seed instruction-output pairs, the Stanford team prompted text-davinci-003 to generate further examples, simplifying the original pipeline in several ways: instructions were generated in aggressive batches of 20, the distinction between classification and non-classification tasks was dropped, and only a single instance was generated per instruction rather than two or three [2]. The process yielded 52,000 unique instruction-following demonstrations for under $500 in OpenAI API fees [1].
Fine-tuning was performed with Hugging Face's training framework as plain supervised learning on the generated demonstrations. Training the 7B model took about three hours on eight 80 GB NVIDIA A100 GPUs, which the team estimated at less than $100 on most cloud providers, putting the entire project under $600 [1]. The repository also documents a configuration for a 13B variant [2].
| Hyperparameter | Alpaca-7B | Alpaca-13B |
|---|---|---|
| Base model | LLaMA-7B | LLaMA-13B |
| Batch size | 128 | 128 |
| Learning rate | 2e-5 | 1e-5 |
| Epochs | 3 | 5 |
| Max sequence length | 512 | 512 |
Each training example followed a simple instruction, optional input, and output schema. This "Alpaca format" became a de facto standard for community instruction datasets, and the 52K dataset itself was widely reused, cleaned, extended, and translated by other projects [2].
The authors evaluated Alpaca with a blind pairwise human comparison against text-davinci-003, conducted by the five student authors on inputs from the self-instruct evaluation set, which focuses on everyday user-oriented instructions such as email writing, social media, and productivity tasks [1]. Alpaca won 90 of the 179 comparisons, against 89 wins for text-davinci-003, which the team interpreted as the two models performing very similarly despite Alpaca's much smaller size [1]. They attributed the result partly to the small evaluation set and noted that Alpaca's answers tended to mirror the style of its teacher, typically shorter than ChatGPT's output because text-davinci-003 itself produces concise answers [1].
The team was explicit that Alpaca shared the standard deficiencies of language models, including hallucination, toxicity, and stereotyped outputs, and described hallucination as a particularly common failure mode even relative to text-davinci-003 [1]. In one published example, the model asserted that the capital of Tanzania is Dar es Salaam, the country's largest city but only its capital until 1974, when Dodoma replaced it; the team also showed it could produce fluent, well-written misinformation when prompted [1].
Alpaca launched alongside an interactive web demo intended to make the research accessible and to surface unexpected behaviors. The demo shipped with two mitigations: a content filter built on OpenAI's moderation API, and watermarking of all model outputs so that Alpaca-generated text could later be detected [1].
The demo attracted heavy traffic, and users quickly publicized examples of hallucinations and other failures. Around March 21, 2023, roughly a week after launch, the researchers took the demo offline [4]. A spokesperson for the Stanford Institute for Human-Centered AI explained: "The original goal of releasing a demo was to disseminate our research in an accessible way. We feel that we have mostly achieved this goal, and given the hosting costs and the inadequacies of our content filters, we decided to bring down the demo" [4]. The takedown applied only to the hosted demo; the dataset, data generation code, and training code remained available on GitHub [2][4].
Alpaca's components were released under different terms: the code under Apache 2.0, and both the 52K dataset and the model weight diff under the non-commercial Creative Commons CC BY-NC 4.0 license [2]. The fine-tuned weights were distributed only as a diff, so reconstructing the model required the original LLaMA weights from Meta. Because of these restrictions Alpaca is best described as open-recipe and research-only rather than fully open source.
The team gave three reasons why the model could not be used commercially: Alpaca is based on LLaMA, which carried a non-commercial research license; the instruction data came from text-davinci-003, whose OpenAI terms of use prohibited developing models that compete with OpenAI; and the model lacked safety measures adequate for general deployment [1].
Alpaca's central claim, that a few hundred dollars of API calls plus a few hours of GPU time could approximate an expensive commercial instruction-following model, reshaped expectations about the economics of building such systems. Within days, the community project Alpaca-LoRA reproduced the model using low-rank adaptation on a single consumer GPU, making the recipe accessible to hobbyists [10]. A leaked internal Google memo published by SemiAnalysis in May 2023, often summarized as "we have no moat," pointed to the Alpaca and LoRA lineage as evidence that open models were rapidly closing the gap with proprietary ones [11].
A succession of academic chatbots followed the same teacher-distillation pattern, often continuing the camelid and animal naming theme.
| Model | Developer | Released | Base model | Training data |
|---|---|---|---|---|
| Alpaca | Stanford CRFM | March 13, 2023 | LLaMA-7B | 52K text-davinci-003 demonstrations |
| Alpaca-LoRA | Open-source community | March 2023 | LLaMA-7B | Alpaca's 52K dataset, via LoRA |
| Vicuna-13B | LMSYS (UC Berkeley, CMU, Stanford, UC San Diego, MBZUAI) | March 30, 2023 | LLaMA-13B | About 70K user-shared ChatGPT conversations from ShareGPT, about $300 [5] |
| Koala | Berkeley AI Research (BAIR) | April 3, 2023 | LLaMA-13B | Web dialogue data, including ChatGPT outputs [6] |
The project also seeded a durable evaluation lineage from the same Stanford lab. AlpacaFarm (May 2023) used the Alpaca setting to build a low-cost simulation framework for studying methods that learn from human feedback, such as RLHF [9]. AlpacaEval, released in mid-2023, turned the approach into a widely used automatic benchmark in which a strong judge model compares a candidate's responses against a reference model on a fixed set of 805 instructions to produce a win rate [8]. A 2024 length-controlled version corrected for judges' bias toward verbose answers, raising the benchmark's Spearman correlation with Chatbot Arena from 0.94 to 0.98 while making the win rate harder to game with longer outputs [12]. More broadly, "alpaca-style" became shorthand for distilling instruction-following data from a stronger teacher model, a practice that has remained both popular and contested under model providers' terms of service.
The strongest academic pushback came in "The False Promise of Imitating Proprietary LLMs" (May 2023) by Arnav Gudibande, Eric Wallace, and colleagues at UC Berkeley, who trained a series of imitation models across base-model sizes and amounts of imitation data [7]. They found that such models successfully mimic the style of their teacher, and can initially fool human crowdworkers, but do little to close the underlying capability gap on factuality and reasoning benchmarks; the authors concluded that imitation was a "false promise" and that improving open base models matters more than cheap distillation [7].
Other criticisms targeted the evaluation and the data. Alpaca's headline comparison rested on 179 prompts rated by the model's own authors, a limitation the team acknowledged [1]. The 52K machine-generated dataset was also found by community users to contain noisy and incorrect outputs, prompting cleaned and curated derivatives. The authors themselves emphasized Alpaca's propensity to hallucinate, which the demo takedown made publicly visible within its first week [1][4].