Alpaca (model)

AI Models Open Source AI

9 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 1,804 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Alpaca is an instruction-following language model released on March 13, 2023 by Stanford University's Center for Research on Foundation Models (CRFM), built by fine-tuning Meta's 7-billion-parameter LLaMA model on 52,000 instruction-following demonstrations that were generated automatically with OpenAI's text-davinci-003, for a total cost the team estimated at under $600 ^[1]. In a small blind human evaluation, Alpaca won 90 of 179 head-to-head comparisons against text-davinci-003, a far larger commercial model, leading the authors to conclude the two behaved "very similar" despite Alpaca's much smaller size, a result widely read as proof that useful instruction-tuned models could be reproduced cheaply from open weights ^[1]. Alpaca's web demo was taken offline about a week after launch over hallucination and hosting concerns, and the model was restricted to non-commercial research use, but its recipe and 52K dataset catalyzed the 2023 wave of open instruction-tuned models, including Vicuna and Koala, and gave its name to the AlpacaEval benchmark lineage ^[1]^[4]^[5]^[8].

Overview

Alpaca appeared in the months following the launch of ChatGPT, when capable instruction-following models were available almost exclusively as closed commercial APIs, leaving academics without an accessible model whose failure modes they could study and try to fix ^[1]. Meta AI had released LLaMA in February 2023 under a non-commercial research license, providing strong open base models but not instruction-tuned ones.

The project was carried out in the lab of Stanford assistant professor Tatsunori Hashimoto by five student researchers who contributed equally, Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, and Xuechen Li, together with faculty co-authors Carlos Guestrin, Percy Liang, and Hashimoto ^[1]^[2]. The team framed the release narrowly as a research artifact, stating that "Alpaca is intended only for academic research and any commercial use is prohibited" ^[1]. The initial release included the 52,000-example training dataset, the data generation code, and the training recipe; a fine-tuned weight diff against LLaMA was published later ^[2].

How was Alpaca built?

Alpaca combined two ingredients: a strong pretrained base model (LLaMA-7B) and inexpensive machine-generated instruction tuning data. To produce the data, the team adapted the self-instruct method of Wang et al. (December 2022), in which a language model is prompted with a small pool of human-written seed tasks and asked to invent new instructions and corresponding outputs ^[3]. Starting from self-instruct's 175 human-written seed instruction-output pairs, the Stanford team prompted text-davinci-003 to generate further examples, simplifying the original pipeline in several ways: instructions were generated in aggressive batches of 20, the distinction between classification and non-classification tasks was dropped, and only a single instance was generated per instruction rather than two or three ^[2]. The process yielded 52,000 unique instruction-following demonstrations for under $500 in OpenAI API fees ^[1].

Fine-tuning was performed with Hugging Face's training framework as plain supervised learning on the generated demonstrations. Training the 7B model took about three hours on eight 80 GB NVIDIA A100 GPUs, which the team estimated at less than $100 on most cloud providers, putting the entire project under $600 ^[1]. The repository also documents a configuration for a 13B variant ^[2].

Hyperparameter	Alpaca-7B	Alpaca-13B
Base model	LLaMA-7B	LLaMA-13B
Batch size	128	128
Learning rate	2e-5	1e-5
Epochs	3	5
Max sequence length	512	512

Each training example followed a simple instruction, optional input, and output schema. This "Alpaca format" became a de facto standard for community instruction datasets, and the 52K dataset itself was widely reused, cleaned, extended, and translated by other projects ^[2].

How good was Alpaca?

The authors evaluated Alpaca with a blind pairwise human comparison against text-davinci-003, conducted by the five student authors on inputs from the self-instruct evaluation set, which focuses on everyday user-oriented instructions such as email writing, social media, and productivity tasks ^[1]. Alpaca won 90 of the 179 comparisons, against 89 wins for text-davinci-003, which the team interpreted as the two models performing very similarly despite Alpaca's much smaller size ^[1]. They attributed the result partly to the small evaluation set and noted that Alpaca's answers tended to mirror the style of its teacher, typically shorter than ChatGPT's output because text-davinci-003 itself produces concise answers ^[1].

The team was explicit that Alpaca shared the standard deficiencies of language models, including hallucination, toxicity, and stereotyped outputs, and described hallucination as a particularly common failure mode even relative to text-davinci-003 ^[1]. In one published example, the model asserted that the capital of Tanzania is Dar es Salaam, the country's largest city but only its capital until 1974, when Dodoma replaced it; the team also showed it could produce fluent, well-written misinformation when prompted ^[1].

Why was the Alpaca demo taken down?

Alpaca launched alongside an interactive web demo intended to make the research accessible and to surface unexpected behaviors. The demo shipped with two mitigations: a content filter built on OpenAI's moderation API, and watermarking of all model outputs so that Alpaca-generated text could later be detected ^[1].

The demo attracted heavy traffic, and users quickly publicized examples of hallucinations and other failures. Around March 21, 2023, roughly a week after launch, the researchers took the demo offline ^[4]. A spokesperson for the Stanford Institute for Human-Centered AI explained: "The original goal of releasing a demo was to disseminate our research in an accessible way. We feel that we have mostly achieved this goal, and given the hosting costs and the inadequacies of our content filters, we decided to bring down the demo" ^[4]. The takedown applied only to the hosted demo; the dataset, data generation code, and training code remained available on GitHub ^[2]^[4].

Is Alpaca open source?

Alpaca's components were released under different terms: the code under Apache 2.0, and both the 52K dataset and the model weight diff under the non-commercial Creative Commons CC BY-NC 4.0 license ^[2]. The fine-tuned weights were distributed only as a diff, so reconstructing the model required the original LLaMA weights from Meta. Because of these restrictions Alpaca is best described as open-recipe and research-only rather than fully open source.

The team gave three reasons why the model could not be used commercially: Alpaca is based on LLaMA, which carried a non-commercial research license; the instruction data came from text-davinci-003, whose OpenAI terms of use prohibited developing models that compete with OpenAI; and the model lacked safety measures adequate for general deployment ^[1].

Why did Alpaca matter?

Alpaca's central claim, that a few hundred dollars of API calls plus a few hours of GPU time could approximate an expensive commercial instruction-following model, reshaped expectations about the economics of building such systems. Within days, the community project Alpaca-LoRA reproduced the model using low-rank adaptation on a single consumer GPU, making the recipe accessible to hobbyists ^[10]. A leaked internal Google memo published by SemiAnalysis in May 2023, often summarized as "we have no moat," pointed to the Alpaca and LoRA lineage as evidence that open models were rapidly closing the gap with proprietary ones ^[11].

A succession of academic chatbots followed the same teacher-distillation pattern, often continuing the camelid and animal naming theme.

Model	Developer	Released	Base model	Training data
Alpaca	Stanford CRFM	March 13, 2023	LLaMA-7B	52K text-davinci-003 demonstrations
Alpaca-LoRA	Open-source community	March 2023	LLaMA-7B	Alpaca's 52K dataset, via LoRA
Vicuna-13B	LMSYS (UC Berkeley, CMU, Stanford, UC San Diego, MBZUAI)	March 30, 2023	LLaMA-13B	About 70K user-shared ChatGPT conversations from ShareGPT, about $300 ^[5]
Koala	Berkeley AI Research (BAIR)	April 3, 2023	LLaMA-13B	Web dialogue data, including ChatGPT outputs ^[6]

The project also seeded a durable evaluation lineage from the same Stanford lab. AlpacaFarm (May 2023) used the Alpaca setting to build a low-cost simulation framework for studying methods that learn from human feedback, such as RLHF ^[9]. AlpacaEval, released in mid-2023, turned the approach into a widely used automatic benchmark in which a strong judge model compares a candidate's responses against a reference model on a fixed set of 805 instructions to produce a win rate ^[8]. A 2024 length-controlled version corrected for judges' bias toward verbose answers, raising the benchmark's Spearman correlation with Chatbot Arena from 0.94 to 0.98 while making the win rate harder to game with longer outputs ^[12]. More broadly, "alpaca-style" became shorthand for distilling instruction-following data from a stronger teacher model, a practice that has remained both popular and contested under model providers' terms of service.

What were the main criticisms of Alpaca?

The strongest academic pushback came in "The False Promise of Imitating Proprietary LLMs" (May 2023) by Arnav Gudibande, Eric Wallace, and colleagues at UC Berkeley, who trained a series of imitation models across base-model sizes and amounts of imitation data ^[7]. They found that such models successfully mimic the style of their teacher, and can initially fool human crowdworkers, but do little to close the underlying capability gap on factuality and reasoning benchmarks; the authors concluded that imitation was a "false promise" and that improving open base models matters more than cheap distillation ^[7].

Other criticisms targeted the evaluation and the data. Alpaca's headline comparison rested on 179 prompts rated by the model's own authors, a limitation the team acknowledged ^[1]. The 52K machine-generated dataset was also found by community users to contain noisy and incorrect outputs, prompting cleaned and curated derivatives. The authors themselves emphasized Alpaca's propensity to hallucinate, which the demo takedown made publicly visible within its first week ^[1]^[4].

References

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto. "Alpaca: A Strong, Replicable Instruction-Following Model." Stanford CRFM, March 13, 2023. https://crfm.stanford.edu/2023/03/13/alpaca.html ↩
tatsu-lab/stanford_alpaca, GitHub repository. https://github.com/tatsu-lab/stanford_alpaca ↩
Yizhong Wang et al. "Self-Instruct: Aligning Language Models with Self-Generated Instructions." arXiv:2212.10560, December 2022. https://arxiv.org/abs/2212.10560 ↩
"Stanford takes costly, risky Alpaca AI model offline." The Register, March 21, 2023. https://www.theregister.com/2023/03/21/stanford_ai_alpaca_taken_offline/ ↩
LMSYS Org. "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality." March 30, 2023. https://lmsys.org/blog/2023-03-30-vicuna/ ↩
Xinyang Geng et al. "Koala: A Dialogue Model for Academic Research." Berkeley Artificial Intelligence Research Blog, April 3, 2023. https://bair.berkeley.edu/blog/2023/04/03/koala/ ↩
Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song. "The False Promise of Imitating Proprietary LLMs." arXiv:2305.15717, May 2023. https://arxiv.org/abs/2305.15717 ↩
tatsu-lab/alpaca_eval, GitHub repository. https://github.com/tatsu-lab/alpaca_eval ↩
Yann Dubois et al. "AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback." arXiv:2305.14387, May 2023. https://arxiv.org/abs/2305.14387 ↩
tloen/alpaca-lora, GitHub repository. https://github.com/tloen/alpaca-lora ↩
"Google: We Have No Moat, And Neither Does OpenAI." SemiAnalysis, May 4, 2023. https://www.semianalysis.com/p/google-we-have-no-moat-and-neither ↩
Yann Dubois, Balázs Galambosi, Percy Liang, Tatsunori B. Hashimoto. "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators." arXiv:2404.04475, April 2024. https://arxiv.org/abs/2404.04475 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

COLLIE Evol-Instruct How to Steal ChatGPT-4, GPT-4 and other Proprietary LLMs Instruction backtranslation (Humpback)Llama 2 MT-Bench Self-Instruct Synthetic data TruthfulQA Vicuna (language model)

Overview

How was Alpaca built?

How good was Alpaca?

Why was the Alpaca demo taken down?

Is Alpaca open source?

Why did Alpaca matter?

What were the main criticisms of Alpaca?

References

Improve this article

Related Articles

Sentence-transformers/all-MiniLM-L6-v2 model

Sentence-transformers/all-mpnet-base-v2 model

SmolVLA

Llama 3

OpenVLA

OLMo

What links here

Related Articles

Sentence-transformers/all-MiniLM-L6-v2 model

Sentence-transformers/all-mpnet-base-v2 model

SmolVLA

Llama 3

OpenVLA

OLMo

What links here