Vicuna (language model)
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,410 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,410 words
Add missing citations, update stale details, or suggest a clearer explanation.
Vicuna is a family of open-source chat-tuned large language models released by LMSYS (the Large Model Systems Organization), produced by fine-tuning Meta's LLaMA base models on roughly 70,000 to 125,000 user-shared dialogues scraped from the ShareGPT website. The first model, Vicuna-13B, dropped on March 30, 2023, less than four weeks after LLaMA itself leaked out, and it became the early-2023 reference point for what a small academic team could do with a few hundred dollars of GPU time and a clever data source. The accompanying LMSYS blog post claimed the 13B model reached "more than 90% of the quality" of ChatGPT on a custom 80-question benchmark judged by GPT-4, a claim that simultaneously kicked off the modern open-LLM ecosystem and a long-running fight over LLM-as-a-judge methodology.
Vicuna mattered for three reasons that have very little to do with its raw quality today. First, it was the proof of concept that real conversational data, even messy stuff scraped from a Chrome extension, beats the synthetic Self-Instruct text Alpaca had used a few weeks earlier. Second, the project shipped with FastChat, a serving and training framework that became the substrate for the Chatbot Arena and for a generation of community fine-tunes. Third, the same team turned around and published the MT-Bench and Chatbot Arena papers, which together became the dominant evaluation regime for chat models from mid-2023 onward. Vicuna itself has long since been outclassed by post-LLaMA-2 and post-LLaMA-3 models, but the infrastructure and methodology it dragged into existence are still load bearing.
| Field | Value |
|---|---|
| Developer | LMSYS (UC Berkeley, Carnegie Mellon University, Stanford, UC San Diego, MBZUAI) |
| Initial release | March 30, 2023 (Vicuna-13B v0) |
| Latest weights release | August 1, 2023 (Vicuna v1.5) |
| Base model | LLaMA 1 (v0, v1.1, v1.3); LLaMA 2 (v1.5) |
| Sizes | 7B, 13B, 33B parameters |
| Architecture | Decoder-only transformer (inherits from LLaMA) |
| Training data | ~70K (v0/v1.1) and ~125K (v1.3+) ShareGPT conversations |
| Training objective | Supervised fine-tuning with assistant-only loss masking |
| Hardware | 8x NVIDIA A100 80GB |
| Reported training cost | ~$140 (7B), ~$300 (13B) on cloud spot instances |
| Code license | Apache 2.0 (FastChat) |
| Weights license | LLaMA Research License (v0/v1.1/v1.3); LLaMA 2 Community License (v1.5) |
| Weights | lmsys/ org on Hugging Face |
The spring of 2023 was a strange time. ChatGPT had been out for four months, GPT-4 had been announced two weeks earlier, and on February 24 Meta had published LLaMA, a set of 7B/13B/33B/65B foundation models that were technically gated to academic researchers but in practice had been on torrent trackers within days. By mid-March, Stanford's Alpaca had shown that a 7B LLaMA fine-tuned on 52K Self-Instruct examples generated by text-davinci-003 could imitate ChatGPT well enough to be embarrassing. The recipe was clear and the GPUs were cheap, but the data was synthetic and a bit thin.
A loose collaboration of PhD students and faculty from UC Berkeley, Carnegie Mellon, Stanford, UC San Diego, and MBZUAI noticed that ShareGPT, a site where people pasted their ChatGPT conversations to brag, had accumulated a large pile of organic, multi-turn dialogue against a much stronger teacher model. Real conversations, free, in the wild, with the kind of messy back-and-forth that synthetic instruction data does not produce. The team, which went on to become LMSYS, scraped roughly 70,000 conversations, cleaned them, and ran a one-day fine-tune of LLaMA-7B and LLaMA-13B on a single 8x A100 box. They posted the results to a blog under the title "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality."
The author list on the original blog post and the later paper reads like a who's-who of the systems-and-ML crowd that has produced most of the open-LLM tooling since: Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Lianmin Zheng and Hao Zhang are the listed primary contacts. The asterisk on "90%" was load-bearing and the team flagged it as preliminary up front. People mostly ignored the asterisk.
The original recipe is, by 2026 standards, almost quaint. Take LLaMA-13B, do supervised fine-tuning on around 70K ShareGPT conversations with a standard cross-entropy objective, run for three epochs on 8x NVIDIA A100 (80GB) GPUs, lean heavily on PyTorch's FSDP for memory sharding, use gradient checkpointing and flash attention to fit longer sequences, and serve the model from a small custom framework called FastChat. Total reported wall-clock cost: about $300 for the 13B model and around $140 for the 7B, both achieved by running on managed spot instances via SkyPilot, a Berkeley project from the same lab.
The specific hyperparameters published with the v0 release line up tightly with what later became the de facto LLaMA-derivative SFT recipe.
| Hyperparameter | Vicuna-7B | Vicuna-13B |
|---|---|---|
| Base checkpoint | LLaMA 7B | LLaMA 13B |
| Epochs | 3 | 3 |
| Effective batch size | 128 | 128 |
| Peak learning rate | 2e-5 | 2e-5 |
| Schedule | Cosine decay with linear warmup | Cosine decay with linear warmup |
| Warmup ratio | 0.03 | 0.03 |
| Max sequence length | 2048 | 2048 |
| Optimizer | AdamW (no weight decay on biases/norms) | AdamW (no weight decay on biases/norms) |
| Mixed precision | bf16 | bf16 |
| Hardware | 8x A100 80GB on spot | 8x A100 80GB on spot |
| Reported spot cost | ~$140 | ~$300 |
Three details are worth flagging because they made the difference between Vicuna and Alpaca. First, the team extended the maximum sequence length from Alpaca's 512 tokens to 2048, which is the entire reason multi-turn ShareGPT data was usable at all. Second, the loss was masked over the user prompts and computed only on the assistant turns, so the model was learning to imitate the assistant rather than memorizing the questions. Third, the chat template put the conversation through USER: ... ASSISTANT: ... markers, which sounds trivial but turned out to matter quite a bit for how the model behaved at inference time and for compatibility with downstream tooling. The v1.1 release in April 2023 swapped the segment separator from ### to the EOS token </s>, which made stop-token logic cleaner and more consistent with how libraries like transformers expected to handle generation halts.
The ShareGPT data itself went through several cleaning steps before it hit the trainer. HTML was stripped, conversations longer than 2048 tokens were either truncated at the boundary or split into multiple training examples, non-English-only conversations were filtered for the early releases, and the team manually inspected and removed obvious test prompts and red-team probes. The released text was deduplicated against the test prompts used for evaluation. By the v1.3 release, the dataset had grown to around 125,000 conversations after community contributors fed in additional ShareGPT scrapes.
Weights were not released as full checkpoints in the v0 era. The LLaMA license forbade redistribution, so LMSYS published "delta" weights instead: tensors representing the difference between the fine-tuned Vicuna and the original LLaMA. To run Vicuna v0, you had to obtain LLaMA weights yourself (officially or otherwise) and apply the delta locally. This was annoying but it was the only legally defensible path for almost everything in the 2023 LLaMA-derivative ecosystem. The v1.3 release in June 2023 switched to publishing merged weights directly, on the grounds that the LLaMA weights had by then been distributed widely enough that the legal fiction was no longer holding anyone back.
LMSYS shipped four major weight versions over roughly four months. Each one moved the line on something specific.
| Version | Released | Base model | Sizes | Notable changes |
|---|---|---|---|---|
| v0 | March 30, 2023 | LLaMA 1 | 7B-delta, 13B-delta | Initial release. Distributed as delta weights. ~70K ShareGPT conversations. 2048-token context. |
| v1.1 | April 12, 2023 | LLaMA 1 | 7B, 13B | Replaced ### separator with EOS token </s>. Fixed an SFT loss bug that had been hurting quality. Stable chat template. |
| v1.3 | June 22, 2023 | LLaMA 1 | 7B, 13B, 33B | First time a 33B Vicuna shipped. Trained on roughly 2x the prior ShareGPT data (around 125K conversations). Merged weights, no delta needed. |
| v1.5 | August 1, 2023 | LLaMA 2 | 7B, 13B | Switched base to LLaMA 2 under the LLaMA 2 Community License. Inherits LLaMA 2's 4K native context. |
| v1.5-16k | August 1, 2023 | LLaMA 2 | 7B-16k, 13B-16k | Long-context variants extended to 16,384 tokens via linear RoPE scaling. |
There is no v1.2 or v1.4 in the public lineage. The numbering jumps because internal release candidates with those tags were never promoted. v1.5 is the last weight release; LMSYS effectively wound down direct Vicuna training as their attention shifted to running Chatbot Arena and to other research, and the broader community moved on to Llama 2 Chat, Mistral, and the Llama 3 family.
The most quoted single fact about Vicuna is that the original blog post claimed Vicuna-13B reached "more than 90% of the quality of OpenAI ChatGPT and Google Bard" while beating LLaMA and Alpaca in over 90% of pairwise comparisons. The methodology behind this number is what made the project culturally important and also what made it controversial.
The team built a small evaluation set of 80 questions across eight categories: Fermi problems, counterfactuals, roleplay, generic open-ended writing, knowledge, common sense, math, and coding. Each of five models (LLaMA-13B, Alpaca-13B, Vicuna-13B, Bard, ChatGPT) answered all 80 questions. GPT-4 then scored each answer on a 1-to-10 scale and was also asked to do pairwise comparisons. Vicuna's total score divided by ChatGPT's total score came out to about 92%, hence the headline number with the asterisk.
The specific aggregate numbers from the original blog post were as follows.
| Model | GPT-4 total score (out of 800) | Score relative to ChatGPT |
|---|---|---|
| ChatGPT | 693 | 100% |
| Bard | 664 | 96% |
| Vicuna-13B | 638 | 92% |
| Alpaca-13B | 583 | 84% |
| LLaMA-13B | 489 | 71% |
It is hard to overstate how much pushback this provoked. Three objections came up immediately and have not really gone away. The judge is the same model family as one of the contestants, so position bias and self-enhancement bias are baked in. GPT-4 is known to prefer longer, more confident answers, and Vicuna had been trained on ChatGPT outputs, so it spoke ChatGPT's language. And 80 questions is a small sample for any conclusion ending in a percent sign. The blog post itself acknowledged this, calling the evaluation "non-rigorous" and labeling it as a starting point rather than a ranking. Few people quoted the caveats.
The interesting thing is that the LMSYS team took the criticism on board and turned it into the next two papers. Both were aimed directly at the question "can you actually use an LLM as a judge, and if so, under what conditions does that match human preference?"
MT-Bench was the formal answer to the 80-question demo. Introduced in the June 2023 paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (Zheng et al., later accepted at NeurIPS 2023 Datasets and Benchmarks), it is also a small benchmark by design: 80 questions, eight categories (writing, roleplay, extraction, reasoning, math, coding, knowledge in STEM, knowledge in humanities/social science), but each item is two turns rather than one. The judge, typically GPT-4, scores each model answer 1 through 10 and also runs pairwise comparisons with prompted instructions intended to dampen position and verbosity biases.
The paper's headline finding was that strong LLM judges like GPT-4 agreed with crowdsourced human preference about 80% of the time, which is roughly the agreement rate between two human annotators on the same task. That number is what made MT-Bench an acceptable substitute for human evaluation in lab settings, and for two years roughly every chat model release ran an MT-Bench number whether or not the score actually meant anything in absolute terms. The paper also catalogued the failure modes of LLM-as-a-judge: position bias (judges prefer the first answer they see), verbosity bias (judges prefer longer answers regardless of substance), self-enhancement bias (a judge prefers outputs that look like its own), and limited reasoning ability on math and coding judgment tasks. The proposed mitigations included swapping positions, calibrating with reference answers, and using chain-of-thought prompting.
The Vicuna family produced the following MT-Bench scores (GPT-4 judge, 1-10 scale, averaged across the 80 two-turn questions). The numbers were reported in Zheng et al. (2023) and on the LMSYS leaderboard.
| Model | MT-Bench score | Notes |
|---|---|---|
| GPT-4 (March 2023) | 8.99 | Reference upper bound |
| GPT-3.5 / ChatGPT | 7.94 | Reference frontier-proprietary |
| Vicuna-33B-v1.3 | 7.12 | LLaMA-1-based, largest variant |
| Llama-2-13B-Chat | 6.65 | Meta's first-party tune, for context |
| Vicuna-13B-v1.5 | 6.57 | LLaMA-2-based |
| Vicuna-13B-v1.3 | 6.39 | LLaMA-1-based |
| Vicuna-7B-v1.5 | 6.17 | LLaMA-2-based |
| Vicuna-7B-v1.3 | 5.95 | LLaMA-1-based |
| Alpaca-13B | 4.53 | For comparison |
| LLaMA-13B (raw) | 2.61 | Untuned base, floor |
MT-Bench scores still get cited in academic and industry papers in 2026, though by now they saturate near the top end and have been supplemented by larger arena-style benchmarks and more focused capability evals.
The second prong of the Vicuna team's evaluation work was Chatbot Arena, launched on May 3, 2023 (with operations starting in late April) as an open web platform that lets anyone visit chat.lmsys.org, type a prompt, see two anonymous model responses side by side, and vote which is better. Models are revealed only after the vote.
The data feeds into a leaderboard. Initially the platform used an online Elo rating system, the same one used in chess. In December 2023, LMSYS switched the official ratings to a Bradley-Terry maximum-likelihood estimate computed in batch from the full vote history, while still presenting the result as an Elo-style number for continuity. The Bradley-Terry model is the statistically principled cousin of incremental Elo and produces tighter confidence intervals.
The full Chatbot Arena methodology was written up the following year by Chiang et al. in "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference" (arXiv:2403.04132, ICML 2024 oral). That paper documented more than 240,000 votes collected by early 2024 across 50+ models, with statistical analyses showing that the crowdsourced preference data correlated strongly with expert MT-Bench ratings and with academic benchmarks like MMLU. Vicuna featured throughout the paper as one of the original baseline models that anchored the early leaderboard.
Vicuna was one of the first models in the arena and for several months it sat near the top of the open-source side of the leaderboard, behind only proprietary models. By the time GPT-4 Turbo, Claude 2, Llama 2 Chat, and Mistral 7B Instruct had landed, Vicuna had dropped well down the list, but it stayed in the arena as a reference point for a while. The arena itself outgrew its origins, was rebranded as LMArena and spun out from LMSYS into a separate organization, and continues to operate.
FastChat is the open-source repository at lm-sys/FastChat that holds the training scripts, the inference server, the multi-model chat UI, and the arena backend. It was released alongside Vicuna v0 and grew into the de facto reference implementation for serving chat-tuned LLaMA-family models in 2023. The code includes:
FastChat's importance was indirect but enormous. The OpenAI-compatible endpoint pattern in particular became the default way self-hosted models exposed themselves, and the arena code ended up under the hood of LMArena. The repository, per the project README, has powered Chatbot Arena across more than 10 million chat requests and 70+ models, with over 1.5 million human preference votes recorded.
The Vicuna team did not set out to build a movement, but the spring of 2023 produced a small zoo of LLaMA-derivative chat models that all looked like variations on the Vicuna recipe. The data source changed, the training framework was usually FastChat or a fork of Stanford's Alpaca code, and the evaluation was usually a blog post comparison against ChatGPT.
| Model | Released | Base model | Training data | Released by |
|---|---|---|---|---|
| Alpaca | March 13, 2023 | LLaMA 7B | 52K Self-Instruct examples generated by text-davinci-003 | Stanford CRFM |
| Vicuna v0 | March 30, 2023 | LLaMA 7B/13B | ~70K ShareGPT conversations | LMSYS (Berkeley/CMU/Stanford/UCSD/MBZUAI) |
| Koala | April 3, 2023 | LLaMA 13B | Mix of ShareGPT, HC3, Alpaca, OIG, Anthropic HH | UC Berkeley BAIR |
| Dolly v2 | April 12, 2023 | Pythia 12B (not LLaMA) | 15K human-written instruction pairs (databricks-dolly-15k) | Databricks |
| GPT4All | March 28, 2023 | LLaMA 7B | ~800K GPT-3.5-Turbo conversations | Nomic AI |
| WizardLM | June 2023 | LLaMA | Evol-Instruct generated data | Microsoft Research/Peking University |
Dolly v2 is the odd one out: it deliberately used the non-LLaMA Pythia base and a hand-written dataset to be properly commercially licensable, exactly the constraint Vicuna chose to ignore. WizardLM took the opposite tack and pushed the Vicuna recipe further with synthetic complexity escalation. Most of the post-LLaMA-2 chat fine-tunes (Llama-2-Chat itself, OpenChat, Zephyr, Mistral-7B-Instruct, even later distillation projects) inherit conventions traceable to Vicuna and FastChat: the USER/ASSISTANT chat template, OpenAI-compatible serving, MT-Bench in the eval table.
Vicuna is also responsible, indirectly, for the LLaVA line of vision-language models, which use Vicuna as the language backbone in early versions; for the various Chinese-Vicuna and other localized fine-tunes; and for several rounds of dataset cleanup work after the original ShareGPT site quietly took down its public conversation feed in mid-2023, which forced the community to maintain its own cleaned ShareGPT mirrors.
Vicuna sits inside a wider cluster of open-research artifacts that came out of LMSYS in 2023.
vllm-project/vllm originated in the same Berkeley Sky Computing Lab and shares contributors with FastChat. By late 2023 vLLM had largely replaced FastChat's own model-worker code as the default backend for serving LLaMA-family models at scale, including in Chatbot Arena.Vicuna's licensing situation has always been awkward and the awkwardness changed with each base model.
Vicuna v0, v1.1, and v1.3 inherit Meta's original LLaMA license, which restricts the weights to non-commercial research use. The ShareGPT data also carries OpenAI's terms-of-service constraints, since the conversations are user-paste copies of ChatGPT outputs, and OpenAI's terms prohibit using model output to train competing models. Both sets of restrictions were widely ignored in practice, and LMSYS itself was clear that the model was a research preview only. The FastChat code is Apache 2.0, a clean license; the weights are not.
Vicuna v1.5 sits on the LLaMA 2 Community License, which is closer to permissive but is not OSI-approved and includes a clause kicking in only for organizations whose products had more than 700 million monthly active users at the LLaMA 2 release date, plus an acceptable use policy. That license is comfortable for the vast majority of users including most commercial ones, but it is not technically open source under the standard definition. Anyone redeploying Vicuna v1.5 should still read the LLaMA 2 license and the acceptable use policy directly.
The ShareGPT data remains the messiest piece. The original dataset was scraped without permission from a third-party site, the source has since limited public access, and the relationship to OpenAI's terms is, charitably, unresolved. Most modern downstream projects that reference Vicuna's training data are pointing at one of several community-maintained cleaned mirrors rather than at an official LMSYS dataset.
The original Vicuna blog post listed the limitations honestly, and most of them held up:
The v1.5 era patched some of these by inheriting LLaMA 2's stronger pretraining, but it never closed the alignment gap with proprietary models or the specialized capability gap with later open weights like Llama 3 Instruct, Mistral, or Qwen.
LMSYS stopped doing major Vicuna releases after v1.5 in August 2023. The reasons were not dramatic: the academic team running it had finite bandwidth, Llama 2 Chat shipped its own decent first-party chat tune, the Chatbot Arena had become a substantial operational project on its own, and the most interesting research was migrating to architectural and inference-time work rather than yet another SFT-on-ShareGPT recipe.
In 2026 Vicuna is mostly historical. The default open chat models are Llama 3.x Instruct, Mistral and the various Mixtral derivatives, Qwen 2.5 Instruct, Gemma, DeepSeek's chat models, and a long tail of community fine-tunes that descend from those. Vicuna weights still live on Hugging Face under the lmsys/ organization and they are still useful as a baseline or for reproducing 2023 papers, but nobody serves them in production. The artifacts of the project that are still in active use are FastChat (still maintained as the reference serving framework for many open models), MT-Bench (still cited but increasingly saturated), and the Chatbot Arena (now LMArena, still arguably the most-watched public LLM leaderboard).
What made Vicuna interesting in 2023 was not really the model. It was that a handful of grad students with eight A100s and a scraper produced something close enough to ChatGPT that it was worth arguing about, and then turned the argument into the field's evaluation infrastructure. Almost everything that came after, the open-model release cadence, the MT-Bench-on-the-leaderboard convention, the OpenAI-compatible self-hosting pattern, the side-by-side battle UI, traces back to that one weekend in late March 2023.