Vicuna (language model)

Artificial Intelligence Large Language Models Open Source AI Research Organizations

22 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v4 · 4,494 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Vicuna is a family of open-source chat-tuned large language models released by LMSYS (the Large Model Systems Organization) in 2023, produced by fine-tuning Meta's LLaMA base models on roughly 70,000 to 125,000 user-shared dialogues scraped from the ShareGPT website.^[1] The first model, Vicuna-13B, was released on March 30, 2023, less than four weeks after LLaMA itself leaked, and the accompanying LMSYS blog post reported that the 13B model reached "more than 90%* of the quality of OpenAI ChatGPT and Google Bard" on a custom 80-question benchmark judged by GPT-4, at a reported training cost of about $300 in cloud spot-instance time.^[1] That single claim simultaneously kicked off the modern open-LLM ecosystem and a long-running argument over LLM-as-a-judge methodology.^[1]

Vicuna mattered for three reasons that have very little to do with its raw quality today. First, it was the proof of concept that real conversational data, even messy stuff scraped from a Chrome extension, beats the synthetic Self-Instruct text Alpaca had used a few weeks earlier.^[1] Second, the project shipped with FastChat, a serving and training framework that became the substrate for the Chatbot Arena and for a generation of community fine-tunes.^[8] Third, the same team turned around and published the MT-Bench and Chatbot Arena papers, which together became the dominant evaluation regime for chat models from mid-2023 onward.^[2]^[3] Vicuna itself has long since been outclassed by post-LLaMA-2 and post-LLaMA-3 models, but the infrastructure and methodology it dragged into existence are still load bearing.

Quick facts

Field	Value
Developer	LMSYS (UC Berkeley, Carnegie Mellon University, Stanford, UC San Diego, MBZUAI)
Initial release	March 30, 2023 (Vicuna-13B v0)
Latest weights release	August 1, 2023 (Vicuna v1.5)
Base model	LLaMA 1 (v0, v1.1, v1.3); LLaMA 2 (v1.5)
Sizes	7B, 13B, 33B parameters
Architecture	Decoder-only transformer (inherits from LLaMA)
Training data	~70K (v0/v1.1) and ~125K (v1.3+) ShareGPT conversations
Training objective	Supervised fine-tuning with assistant-only loss masking
Hardware	8x NVIDIA A100 80GB
Reported training cost	~$140 (7B), ~$300 (13B) on cloud spot instances
Code license	Apache 2.0 (FastChat)
Weights license	LLaMA Research License (v0/v1.1/v1.3); LLaMA 2 Community License (v1.5)
Weights	`lmsys/` org on Hugging Face

What is Vicuna and where did it come from?

The spring of 2023 was a strange time. ChatGPT had been out for four months, GPT-4 had been announced two weeks earlier, and on February 24 Meta had published LLaMA, a set of 7B/13B/33B/65B foundation models that were technically gated to academic researchers but in practice had been on torrent trackers within days.^[11] By mid-March, Stanford's Alpaca had shown that a 7B LLaMA fine-tuned on 52K Self-Instruct examples generated by text-davinci-003 could imitate ChatGPT well enough to be embarrassing.^[13] The recipe was clear and the GPUs were cheap, but the data was synthetic and a bit thin.

A loose collaboration of PhD students and faculty from UC Berkeley, Carnegie Mellon, Stanford, UC San Diego, and MBZUAI noticed that ShareGPT, a site where people pasted their ChatGPT conversations to brag, had accumulated a large pile of organic, multi-turn dialogue against a much stronger teacher model. Real conversations, free, in the wild, with the kind of messy back-and-forth that synthetic instruction data does not produce. The team, which went on to become LMSYS, scraped roughly 70,000 conversations, cleaned them, and ran a one-day fine-tune of LLaMA-7B and LLaMA-13B on a single 8x A100 box.^[1] They posted the results to a blog under the title "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality."^[1]

The author list on the original blog post and the later paper reads like a who's-who of the systems-and-ML crowd that has produced most of the open-LLM tooling since: Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing.^[1] Lianmin Zheng and Hao Zhang are the listed primary contacts.^[1] The asterisk on "90%" was load-bearing and the team flagged it as preliminary up front, describing the comparison as "a fun and non-scientific evaluation with GPT-4" that needed "further rigorous evaluation."^[1] People mostly ignored the asterisk.

How was Vicuna trained?

The original recipe is, by 2026 standards, almost quaint. Take LLaMA-13B, do supervised fine-tuning on around 70K ShareGPT conversations with a standard cross-entropy objective, run for three epochs on 8x NVIDIA A100 (80GB) GPUs, lean heavily on PyTorch's FSDP for memory sharding, use gradient checkpointing and flash attention to fit longer sequences, and serve the model from a small custom framework called FastChat.^[1] Total reported wall-clock cost: about $300 for the 13B model and around $140 for the 7B, both achieved by running on managed spot instances via SkyPilot, a Berkeley project from the same lab.^[1] The team noted that SkyPilot's managed spot support with auto-recovery and zone switching cut the 13B training bill from roughly $1,000 to about $300.^[1]

The specific hyperparameters published with the v0 release line up tightly with what later became the de facto LLaMA-derivative SFT recipe.

Hyperparameter	Vicuna-7B	Vicuna-13B
Base checkpoint	LLaMA 7B	LLaMA 13B
Epochs	3	3
Effective batch size	128	128
Peak learning rate	2e-5	2e-5
Schedule	Cosine decay with linear warmup	Cosine decay with linear warmup
Warmup ratio	0.03	0.03
Max sequence length	2048	2048
Optimizer	AdamW (no weight decay on biases/norms)	AdamW (no weight decay on biases/norms)
Mixed precision	bf16	bf16
Hardware	8x A100 80GB on spot	8x A100 80GB on spot
Reported spot cost	~$140	~$300

Three details are worth flagging because they made the difference between Vicuna and Alpaca. First, the team extended the maximum sequence length from Alpaca's 512 tokens to 2048, which is the entire reason multi-turn ShareGPT data was usable at all.^[1] Second, the loss was masked over the user prompts and computed only on the assistant turns, so the model was learning to imitate the assistant rather than memorizing the questions.^[1] Third, the chat template put the conversation through USER: ... ASSISTANT: ... markers, which sounds trivial but turned out to matter quite a bit for how the model behaved at inference time and for compatibility with downstream tooling. The v1.1 release in April 2023 swapped the segment separator from ### to the EOS token </s>, which made stop-token logic cleaner and more consistent with how libraries like transformers expected to handle generation halts.^[9]

The ShareGPT data itself went through several cleaning steps before it hit the trainer. HTML was stripped, conversations longer than 2048 tokens were either truncated at the boundary or split into multiple training examples, non-English-only conversations were filtered for the early releases, and the team manually inspected and removed obvious test prompts and red-team probes.^[1] The released text was deduplicated against the test prompts used for evaluation. By the v1.3 release, the dataset had grown to around 125,000 conversations after community contributors fed in additional ShareGPT scrapes.^[9]

Weights were not released as full checkpoints in the v0 era. The LLaMA license forbade redistribution, so LMSYS published "delta" weights instead: tensors representing the difference between the fine-tuned Vicuna and the original LLaMA.^[9] To run Vicuna v0, you had to obtain LLaMA weights yourself (officially or otherwise) and apply the delta locally.^[9] This was annoying but it was the only legally defensible path for almost everything in the 2023 LLaMA-derivative ecosystem. The v1.3 release in June 2023 switched to publishing merged weights directly, on the grounds that the LLaMA weights had by then been distributed widely enough that the legal fiction was no longer holding anyone back.^[9]

What versions of Vicuna were released?

LMSYS shipped four major weight versions over roughly four months. Each one moved the line on something specific.

Version	Released	Base model	Sizes	Notable changes
v0	March 30, 2023	LLaMA 1	7B-delta, 13B-delta	Initial release. Distributed as delta weights. ~70K ShareGPT conversations. 2048-token context.
v1.1	April 12, 2023	LLaMA 1	7B, 13B	Replaced `###` separator with EOS token `</s>`. Fixed an SFT loss bug that had been hurting quality. Stable chat template.
v1.3	June 22, 2023	LLaMA 1	7B, 13B, 33B	First time a 33B Vicuna shipped. Trained on roughly 2x the prior ShareGPT data (around 125K conversations). Merged weights, no delta needed.
v1.5	August 1, 2023	LLaMA 2	7B, 13B	Switched base to LLaMA 2 under the LLaMA 2 Community License. Inherits LLaMA 2's 4K native context.
v1.5-16k	August 1, 2023	LLaMA 2	7B-16k, 13B-16k	Long-context variants extended to 16,384 tokens via linear RoPE scaling.

There is no v1.2 or v1.4 in the public lineage.^[9] The numbering jumps because internal release candidates with those tags were never promoted. v1.5 is the last weight release; LMSYS effectively wound down direct Vicuna training as their attention shifted to running Chatbot Arena and to other research, and the broader community moved on to Llama 2 Chat, Mistral, and the Llama 3 family.

What was the "90% ChatGPT quality" claim?

The most quoted single fact about Vicuna is that the original blog post claimed Vicuna-13B reached "more than 90% of the quality of OpenAI ChatGPT and Google Bard" while beating LLaMA and Alpaca in over 90% of pairwise comparisons.^[1] The methodology behind this number is what made the project culturally important and also what made it controversial.

The team built a small evaluation set of 80 questions across eight categories: Fermi problems, counterfactuals, roleplay, generic open-ended writing, knowledge, common sense, math, and coding.^[1] Each of five models (LLaMA-13B, Alpaca-13B, Vicuna-13B, Bard, ChatGPT) answered all 80 questions. GPT-4 then scored each answer on a 1-to-10 scale and was also asked to do pairwise comparisons.^[1] Vicuna's total score divided by ChatGPT's total score came out to about 92%, hence the headline number with the asterisk.^[1]

The specific aggregate numbers from the original blog post were as follows.

Model	GPT-4 total score (out of 800)	Score relative to ChatGPT
ChatGPT	693	100%
Bard	664	96%
Vicuna-13B	638	92%
Alpaca-13B	583	84%
LLaMA-13B	489	71%

It is hard to overstate how much pushback this provoked. Three objections came up immediately and have not really gone away. The judge is the same model family as one of the contestants, so position bias and self-enhancement bias are baked in.^[2] GPT-4 is known to prefer longer, more confident answers, and Vicuna had been trained on ChatGPT outputs, so it spoke ChatGPT's language.^[2] And 80 questions is a small sample for any conclusion ending in a percent sign. The blog post itself acknowledged this, calling the evaluation "non-rigorous" and labeling it as a starting point rather than a ranking.^[1] Few people quoted the caveats.

The interesting thing is that the LMSYS team took the criticism on board and turned it into the next two papers.^[2]^[3] Both were aimed directly at the question "can you actually use an LLM as a judge, and if so, under what conditions does that match human preference?"

What is MT-Bench?

MT-Bench was the formal answer to the 80-question demo. Introduced in the June 2023 paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (Zheng et al., later accepted at NeurIPS 2023 Datasets and Benchmarks), it is also a small benchmark by design: 80 questions, eight categories (writing, roleplay, extraction, reasoning, math, coding, knowledge in STEM, knowledge in humanities/social science), but each item is two turns rather than one.^[2] The judge, typically GPT-4, scores each model answer 1 through 10 and also runs pairwise comparisons with prompted instructions intended to dampen position and verbosity biases.^[2]

The paper's headline finding was that strong LLM judges like GPT-4 "can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans."^[2] That number is what made MT-Bench an acceptable substitute for human evaluation in lab settings, and for two years roughly every chat model release ran an MT-Bench number whether or not the score actually meant anything in absolute terms. The paper also catalogued the failure modes of LLM-as-a-judge: position bias (judges prefer the first answer they see), verbosity bias (judges prefer longer answers regardless of substance), self-enhancement bias (a judge prefers outputs that look like its own), and limited reasoning ability on math and coding judgment tasks.^[2] The proposed mitigations included swapping positions, calibrating with reference answers, and using chain-of-thought prompting.^[2]

The Vicuna family produced the following MT-Bench scores (GPT-4 judge, 1-10 scale, averaged across the 80 two-turn questions). The numbers were reported in Zheng et al. (2023) and on the LMSYS leaderboard.^[2]

Model	MT-Bench score	Notes
GPT-4 (March 2023)	8.99	Reference upper bound
GPT-3.5 / ChatGPT	7.94	Reference frontier-proprietary
Vicuna-33B-v1.3	7.12	LLaMA-1-based, largest variant
Llama-2-13B-Chat	6.65	Meta's first-party tune, for context
Vicuna-13B-v1.5	6.57	LLaMA-2-based
Vicuna-13B-v1.3	6.39	LLaMA-1-based
Vicuna-7B-v1.5	6.17	LLaMA-2-based
Vicuna-7B-v1.3	5.95	LLaMA-1-based
Alpaca-13B	4.53	For comparison
LLaMA-13B (raw)	2.61	Untuned base, floor

MT-Bench scores still get cited in academic and industry papers in 2026, though by now they saturate near the top end and have been supplemented by larger arena-style benchmarks and more focused capability evals.

The second prong of the Vicuna team's evaluation work was Chatbot Arena, launched on May 3, 2023 (with operations starting in late April) as an open web platform that lets anyone visit chat.lmsys.org, type a prompt, see two anonymous model responses side by side, and vote which is better.^[4] Models are revealed only after the vote.^[4]

The data feeds into a leaderboard. Initially the platform used an online Elo rating system, the same one used in chess.^[4] In December 2023, LMSYS switched the official ratings to a Bradley-Terry maximum-likelihood estimate computed in batch from the full vote history, while still presenting the result as an Elo-style number for continuity.^[5] The Bradley-Terry model is the statistically principled cousin of incremental Elo and produces tighter confidence intervals.^[5]

The full Chatbot Arena methodology was written up the following year by Chiang et al. in "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference" (arXiv:2403.04132, ICML 2024 oral).^[3] That paper documented more than 240,000 votes from over 90,000 users collected by January 2024 across 50+ models, with statistical analyses showing that the crowdsourced preference data correlated strongly with expert MT-Bench ratings and with academic benchmarks like MMLU.^[3] By early 2026 the platform, rebranded as LMArena, had collected over 6 million user votes across hundreds of models.^[3] Vicuna featured throughout the paper as one of the original baseline models that anchored the early leaderboard.^[3]

Vicuna was one of the first models in the arena and for several months it sat near the top of the open-source side of the leaderboard, behind only proprietary models.^[4] By the time GPT-4 Turbo, Claude 2, Llama 2 Chat, and Mistral 7B Instruct had landed, Vicuna had dropped well down the list, but it stayed in the arena as a reference point for a while. The arena itself outgrew its origins, was rebranded as LMArena and spun out from LMSYS into a separate organization, and continues to operate.

What is FastChat?

FastChat is the open-source repository at lm-sys/FastChat that holds the training scripts, the inference server, the multi-model chat UI, and the arena backend.^[8] It was released alongside Vicuna v0 and grew into the de facto reference implementation for serving chat-tuned LLaMA-family models in 2023.^[8] The code includes:

Supervised fine-tuning scripts for LLaMA, LLaMA 2, T5, and a long list of derivatives.
An OpenAI-compatible REST API server, which let any tool that already spoke the OpenAI Chat Completions format point at a self-hosted Vicuna with one URL change.
A distributed multi-model worker setup with a controller, model workers, and web frontends for both single-model and side-by-side battle UIs.
The MT-Bench evaluation pipeline, including the GPT-4 judge prompts and the result aggregation scripts.

FastChat's importance was indirect but enormous. The OpenAI-compatible endpoint pattern in particular became the default way self-hosted models exposed themselves, and the arena code ended up under the hood of LMArena. The repository, per the project README, has powered Chatbot Arena across more than 10 million chat requests and 70+ models, with over 1.5 million human preference votes recorded.^[8]

How did Vicuna influence the open-LLM ecosystem?

The Vicuna team did not set out to build a movement, but the spring of 2023 produced a small zoo of LLaMA-derivative chat models that all looked like variations on the Vicuna recipe. The data source changed, the training framework was usually FastChat or a fork of Stanford's Alpaca code, and the evaluation was usually a blog post comparison against ChatGPT.

Model	Released	Base model	Training data	Released by
Alpaca	March 13, 2023	LLaMA 7B	52K Self-Instruct examples generated by `text-davinci-003`	Stanford CRFM
Vicuna v0	March 30, 2023	LLaMA 7B/13B	~70K ShareGPT conversations	LMSYS (Berkeley/CMU/Stanford/UCSD/MBZUAI)
Koala	April 3, 2023	LLaMA 13B	Mix of ShareGPT, HC3, Alpaca, OIG, Anthropic HH	UC Berkeley BAIR
Dolly v2	April 12, 2023	Pythia 12B (not LLaMA)	15K human-written instruction pairs (`databricks-dolly-15k`)	Databricks
GPT4All	March 28, 2023	LLaMA 7B	~800K GPT-3.5-Turbo conversations	Nomic AI
WizardLM	June 2023	LLaMA	Evol-Instruct generated data	Microsoft Research/Peking University

Dolly v2 is the odd one out: it deliberately used the non-LLaMA Pythia base and a hand-written dataset to be properly commercially licensable, exactly the constraint Vicuna chose to ignore. WizardLM took the opposite tack and pushed the Vicuna recipe further with synthetic complexity escalation. Most of the post-LLaMA-2 chat fine-tunes (Llama-2-Chat itself, OpenChat, Zephyr, Mistral-7B-Instruct, even later distillation projects) inherit conventions traceable to Vicuna and FastChat: the USER/ASSISTANT chat template, OpenAI-compatible serving, MT-Bench in the eval table.

Vicuna is also responsible, indirectly, for the LLaVA line of vision-language models, which use Vicuna as the language backbone in early versions; for the various Chinese-Vicuna and other localized fine-tunes; and for several rounds of dataset cleanup work after the original ShareGPT site quietly took down its public conversation feed in mid-2023, which forced the community to maintain its own cleaned ShareGPT mirrors.

Vicuna sits inside a wider cluster of open-research artifacts that came out of LMSYS in 2023.

LongChat (June 2023). A pair of long-context fine-tunes (LongChat-7B, LongChat-13B) that extended the LLaMA context window to 16,384 tokens via condensed RoPE scaling, plus the LongEval test for evaluating long-context recall.^[7] LongChat used the same FastChat training stack as Vicuna and informed the v1.5-16k variants.^[7]
LMSYS-Chat-1M (September 2023). A public dataset of one million real-world conversations collected from Chatbot Arena and adjacent LMSYS endpoints across 25 popular models, released by Zheng et al. with the paper "LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset".^[6] The dataset is one of the largest publicly available collections of in-the-wild prompt-response data and is used for content moderation, distillation, and behavior studies.^[6]
SkyPilot. Although technically a separate Berkeley project, SkyPilot is the cloud-orchestration layer that LMSYS used to run Vicuna training on multi-cloud spot instances at the published $300 cost; the integration was tight enough that the FastChat README points users to SkyPilot recipes.^[8]
vLLM. The high-throughput inference engine vllm-project/vllm originated in the same Berkeley Sky Computing Lab and shares contributors with FastChat. By late 2023 vLLM had largely replaced FastChat's own model-worker code as the default backend for serving LLaMA-family models at scale, including in Chatbot Arena.

Is Vicuna open source?

Vicuna's licensing situation has always been awkward and the awkwardness changed with each base model.

Vicuna v0, v1.1, and v1.3 inherit Meta's original LLaMA license, which restricts the weights to non-commercial research use.^[11] The ShareGPT data also carries OpenAI's terms-of-service constraints, since the conversations are user-paste copies of ChatGPT outputs, and OpenAI's terms prohibit using model output to train competing models. Both sets of restrictions were widely ignored in practice, and LMSYS itself was clear that the model was a research preview only.^[1] The FastChat code is Apache 2.0, a clean license; the weights are not.^[8]

Vicuna v1.5 sits on the LLaMA 2 Community License, which is closer to permissive but is not OSI-approved and includes a clause kicking in only for organizations whose products had more than 700 million monthly active users at the LLaMA 2 release date, plus an acceptable use policy.^[12] That license is comfortable for the vast majority of users including most commercial ones, but it is not technically open source under the standard definition. Anyone redeploying Vicuna v1.5 should still read the LLaMA 2 license and the acceptable use policy directly.^[12]

The ShareGPT data remains the messiest piece. The original dataset was scraped without permission from a third-party site, the source has since limited public access, and the relationship to OpenAI's terms is, charitably, unresolved. Most modern downstream projects that reference Vicuna's training data are pointing at one of several community-maintained cleaned mirrors rather than at an official LMSYS dataset.

What are Vicuna's limitations?

The original Vicuna blog post listed the limitations honestly, and most of them held up:^[1]

Reasoning and math. The model was trained on conversational text, which contains very little correctly worked math. It does not reliably do multi-step arithmetic and it confabulates on logic puzzles.^[1]
Hallucination. Vicuna will produce confident, plausible factual claims that are wrong. The fine-tune data made the surface fluency much better than the base LLaMA, but the underlying knowledge was still LLaMA's frozen pretraining cutoff and the SFT did not add new facts so much as a more confident voice.^[1]
Safety. No RLHF or constitutional-style alignment work was performed. The team noted that the model could produce toxic or biased outputs and recommended downstream filtering.^[1]
Code. Coding ability is strictly worse than ChatGPT's at the time, and far worse than what dedicated code models like CodeLlama or DeepSeek-Coder would later achieve.^[1]
Multilingual. LLaMA's pretraining corpus was English-heavy, ShareGPT conversations skew heavily English, and the model performs noticeably worse outside English.^[1]

The v1.5 era patched some of these by inheriting LLaMA 2's stronger pretraining, but it never closed the alignment gap with proprietary models or the specialized capability gap with later open weights like Llama 3 Instruct, Mistral, or Qwen.

Successors and current status

LMSYS stopped doing major Vicuna releases after v1.5 in August 2023. The reasons were not dramatic: the academic team running it had finite bandwidth, Llama 2 Chat shipped its own decent first-party chat tune, the Chatbot Arena had become a substantial operational project on its own, and the most interesting research was migrating to architectural and inference-time work rather than yet another SFT-on-ShareGPT recipe.

In 2026 Vicuna is mostly historical. The default open chat models are Llama 3.x Instruct, Mistral and the various Mixtral derivatives, Qwen 2.5 Instruct, Gemma, DeepSeek's chat models, and a long tail of community fine-tunes that descend from those. Vicuna weights still live on Hugging Face under the lmsys/ organization and they are still useful as a baseline or for reproducing 2023 papers, but nobody serves them in production.^[10] The artifacts of the project that are still in active use are FastChat (still maintained as the reference serving framework for many open models), MT-Bench (still cited but increasingly saturated), and the Chatbot Arena (now LMArena, still arguably the most-watched public LLM leaderboard).

What made Vicuna interesting in 2023 was not really the model. It was that a handful of grad students with eight A100s and a scraper produced something close enough to ChatGPT that it was worth arguing about, and then turned the argument into the field's evaluation infrastructure. Almost everything that came after, the open-model release cadence, the MT-Bench-on-the-leaderboard convention, the OpenAI-compatible self-hosting pattern, the side-by-side battle UI, traces back to that one weekend in late March 2023.

References

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. (2023). "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality." LMSYS Blog, March 30, 2023.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023 Datasets and Benchmarks Track, arXiv:2306.05685.
Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J. E., and Stoica, I. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." ICML 2024 (oral), arXiv:2403.04132.
Zheng, L., Chiang, W.-L., et al. (2023). "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings." LMSYS Blog, May 3, 2023.
LMSYS. "Chatbot Arena: New Models and Elo System Update." LMSYS Blog, December 7, 2023.
Zheng, L., Chiang, W.-L., Sheng, Y., Li, T., Zhuang, S., Wu, Z., Zhuang, Y., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). "LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset." arXiv:2309.11998.
Li, D., Shao, R., Xie, A., Sheng, Y., Zheng, L., Gonzalez, J. E., Stoica, I., Ma, X., and Zhang, H. (2023). "How Long Can Open-Source LLMs Truly Promise on Context Length?" LMSYS Blog (LongChat), June 29, 2023.
LMSYS / lm-sys. FastChat repository on GitHub.
LMSYS / lm-sys. Vicuna weights version notes.
Hugging Face: lmsys/vicuna-7b-v1.5, lmsys/vicuna-13b-v1.5, lmsys/vicuna-13b-v1.5-16k, lmsys/vicuna-33b-v1.3, lmsys/vicuna-13b-delta-v0.
Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
Touvron, H., Martin, L., Stone, K., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288.
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). "Stanford Alpaca: An Instruction-following LLaMA model." Stanford CRFM.
Geng, X., Gudibande, A., Liu, H., Wallace, E., Abbeel, P., Levine, S., and Song, D. (2023). "Koala: A Dialogue Model for Academic Research." UC Berkeley BAIR Blog, April 3, 2023.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

Vicuna (language model)

Quick facts

What is Vicuna and where did it come from?

How was Vicuna trained?

What versions of Vicuna were released?

What was the "90% ChatGPT quality" claim?

What is MT-Bench?

What is FastChat?

How did Vicuna influence the open-LLM ecosystem?

Is Vicuna open source?

What are Vicuna's limitations?

Successors and current status

References

Improve this article

What links here (24 of 28)

What links here (24 of 28)

Quick facts

What is Vicuna and where did it come from?

How was Vicuna trained?

What versions of Vicuna were released?

What was the "90% ChatGPT quality" claim?

What is MT-Bench?

What is Chatbot Arena and how is it related to Vicuna?

What is FastChat?

How did Vicuna influence the open-LLM ecosystem?

Related LMSYS projects

Is Vicuna open source?

What are Vicuna's limitations?

Successors and current status

References

Improve this article

Related Articles

GPT-J

OLMo 2

OLMo 3

OLMoE

DeepSeek

Mistral AI

What links here (24 of 28)

Related Articles

GPT-J

OLMo 2

OLMo 3

OLMoE

DeepSeek

Mistral AI

What links here (24 of 28)