LLaVA (Large Language and Vision Assistant)

Multimodal AI Open Source AI

16 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v4 · 3,295 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LLaVA (Large Language and Vision Assistant) is an open-source family of multimodal large language models that connects a frozen vision encoder to a pre-trained large language model through a single learned projection layer, trained with a technique its authors call visual instruction tuning. It was introduced in the April 2023 paper Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, a collaboration between the University of Wisconsin-Madison, Microsoft Research, and Columbia University, and accepted as an oral presentation at NeurIPS 2023.^[1] On a synthetic multimodal instruction-following dataset the first LLaVA reached an 85.1% relative score compared with GPT-4, and when fine-tuned on ScienceQA it set a new state of the art of 92.53% accuracy.^[1] The paper has been cited more than 7,300 times, with over 1,300 of those flagged as highly influential, making it one of the most-cited open multimodal works of the 2023-2024 wave.^[15]

LLaVA's central contribution is the visual instruction tuning recipe: using language-only GPT-4 to synthesize multimodal instruction-following dialogue from existing image annotations, then fine-tuning a vision language model on that data.^[1] The paper describes itself as "the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data," and defines its model as "LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding."^[1] The result is a system that follows free-form natural language instructions about images while remaining inexpensive enough to train on modest academic hardware. Subsequent releases (LLaVA-1.5 in October 2023, LLaVA-NeXT in January 2024, and LLaVA-OneVision in August 2024) refined the recipe and extended it to multi-image and video inputs.^[2]^[3]^[4]

Because the original architecture and training pipeline were so simple, LLaVA became a reference implementation for the open-source multimodal model ecosystem. Hundreds of follow-on projects (LLaVA-Med for biomedicine, LLaVA-Plus for tool use, LLaVA-RLHF for hallucination mitigation, MoE-LLaVA, LLaVA-Phi, and many others) have adopted the same CLIP-plus-projection-plus-LLM template and the visual instruction tuning data format.

What is the LLaVA architecture?

LLaVA's design is intentionally minimal. It has three components: a frozen vision encoder, a small trainable projection module, and an autoregressive language model. The original 2023 paper showed that this stripped-down combination is competitive with much heavier architectures such as Flamingo's gated cross-attention layers or BLIP-2's Q-Former, provided that visual instruction tuning data is used during fine-tuning.^[1] The follow-up paper found that "the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient."^[2]

Vision encoder

All LLaVA versions through 1.6 use the CLIP ViT-L/14 image encoder from Radford et al. (2021), specifically the OpenAI checkpoint released as openai/clip-vit-large-patch14.^[10] The encoder produces a sequence of patch embeddings (576 tokens at 224x224, 1,176 tokens at 336x336) that are fed into the language model. The vision encoder is frozen during stage one of training and either kept frozen or unfrozen during stage two, depending on the version. LLaVA-OneVision (Li et al., 2024) switched to the SigLIP encoder, which the authors found yielded higher LMM performance among open vision encoders.^[4]

Projection module

The projection maps each image patch embedding into the language model's word-embedding space, so the LLM can attend to visual tokens as if they were text tokens. In LLaVA-1.0 the projection was a single linear layer; in LLaVA-1.5 it was upgraded to a two-layer MLP, which the ablations in Improved Baselines (Liu et al., 2023b) showed gave consistent improvements at negligible cost.^[2] The projection is the only module that is trained during the alignment stage. LLaVA-OneVision retained the two-layer MLP design.^[4]

Language model backbone

The original LLaVA used Vicuna-7B and Vicuna-13B (an instruction-tuned LLaMA variant) as the LLM.^[1] LLaVA-1.5 also used Vicuna 7B and 13B (v1.5 weights).^[2] LLaVA-NeXT extended the family to Mistral-7B, NousResearch's Nous-Hermes-2-Yi-34B, and later Llama-3 8B and Qwen-1.5 72B/110B.^[3] LLaVA-OneVision uses Qwen2 at 0.5B, 7B, and 72B parameter sizes.^[4] The language model receives a sequence that interleaves projected image tokens with the text prompt, and generates the response autoregressively in the same way as a text-only LLM.

What is visual instruction tuning?

The defining contribution of the LLaVA paper is the data and training recipe rather than the architecture. Liu et al. (2023a) observed that no large-scale dataset of multimodal instruction-following dialogues existed, so they used GPT-4 (text-only, with image content represented as captions and bounding boxes from COCO annotations) to synthesize one.^[1]

Synthetic data generation

GPT-4 was prompted with the image's textual representation (caption plus bounding boxes for objects) and asked to produce three styles of dialogue:

Conversation: a short turn-by-turn Q&A about the image's contents
Detailed description: a long descriptive paragraph
Complex reasoning: a multi-step reasoning question and answer that requires going beyond simple recognition

The original LLaVA-Instruct-158K dataset contains 158,000 examples in this format, derived from COCO images.^[1]

Two-stage training

The original LLaVA paper used a two-stage training procedure:

Feature alignment: The vision encoder and LLM are frozen. Only the projection is trained, on filtered image-caption pairs from CC3M (about 595,000 pairs in the 1.0 release). The objective is straightforward next-token prediction of the caption given the image embedding. This teaches the projection to map image features into a region of embedding space the LLM understands.
End-to-end fine-tuning: The vision encoder remains frozen, while the projection and the language model are jointly trained on the LLaVA-Instruct-158K dataset. This step teaches the model to follow user instructions about images.^[1]

LLaVA-1.5 kept the two-stage structure but expanded the instruction tuning mix to 665K examples by adding academic VQA data (VQAv2, GQA, OKVQA, OCRVQA, A-OKVQA), region-level grounding data (RefCOCO, Visual Genome), and a small ShareGPT text-only slice. The pretraining mix was reduced to a 558K LAION-CC-SBU subset re-captioned with BLIP. The 13B variant uses roughly 1.2M total publicly available examples and trains in about one day on a single 8x A100 node (Liu et al., 2023b).^[2]

When was each version released?

Version	Release date	Vision encoder	LLM backbone(s)	Resolution	Notable changes
LLaVA-1.0	April 2023 (NeurIPS 2023)	CLIP ViT-L/14	Vicuna 7B / 13B	224x224	First release; visual instruction tuning recipe; 158K GPT-4-generated examples
LLaVA-1.5	October 5, 2023	CLIP ViT-L/14-336px	Vicuna 7B / 13B (v1.5)	336x336	MLP projection (2-layer); academic VQA mix; SOTA on 11 of 12 benchmarks at release; ~1 day on 8x A100
LLaVA-NeXT (1.6)	January 30, 2024	CLIP ViT-L/14-336px	Vicuna 7B / 13B; Mistral 7B; Nous-Hermes-2-Yi-34B	up to 672x672 (AnyRes)	4x more pixels via dynamic resolution patching; better OCR and reasoning data; first version to exceed Gemini Pro on several benchmarks
LLaVA-NeXT (Stronger)	May 10, 2024	CLIP ViT-L/14-336px	Llama-3 8B; Qwen-1.5 72B / 110B	up to 672x672	New LLM backbones; pushed open-source MLLMs further
LLaVA-NeXT-Video	April 30, 2024	CLIP ViT-L/14-336px	Vicuna / Mistral / Yi	per-frame AnyRes	Zero-shot transfer to video by treating frames as a sequence of images
LLaVA-OneVision	August 6, 2024	SigLIP	Qwen2 0.5B / 7B / 72B	Higher AnyRes (up to 5x base tokens)	Single unified model for single-image, multi-image, and video; curriculum learning over four stages

How does LLaVA perform on benchmarks?

LLaVA-1.5 set the standard for open-source MLLM benchmarking on release. Below are key scores reported in the Improved Baselines paper (Liu et al., 2023b), using the standard 13B and 7B configurations.^[2]

Benchmark	LLaVA-1.5 7B	LLaVA-1.5 13B	InstructBLIP 13B	Qwen-VL-Chat 7B	BLIP-2 14B
VQAv2	78.5	80.0	49.5	78.2*	65.0
GQA	62.0	63.3	49.2	57.5*	41.0
VizWiz	50.0	53.6	33.4	38.9	19.6
ScienceQA-IMG	66.8	71.6	63.1	68.2	61.0
TextVQA	58.2	61.3	50.7	61.5	42.5
POPE (random)	87.3	87.1	78.9	not reported	not reported
MM-Vet	31.1	36.1	25.6	not reported	22.4
MMBench (en)	64.3	67.7	not reported	60.6	not reported
MMBench-CN	58.3	63.6	not reported	56.7	not reported
SEED-Bench	58.6	62.6	not reported	58.2	not reported
LLaVA-Wild	65.4	72.5	58.2	not reported	not reported
MME (perception)	1510.7	1531.3	1212.8	1487.5	1293.8

At release, LLaVA-1.5 13B achieved state-of-the-art results among open-source MLLMs on 11 of 12 benchmarks reported, using only publicly available data.^[2] (* indicates a training-set overlap or a different evaluation protocol; see the paper for caveats.)

LLaVA-NeXT continued the trend. The 34B variant (built on Nous-Hermes-2-Yi-34B) reported 51.1 on the MMMU validation set (44.7 on the test set, up from 36.4 for LLaVA-1.5 13B) and 79.3 on MMBench-English (up from 67.7), and the project team wrote that "LLaVA-1.6 even exceeds Gemini Pro on several benchmarks" at the January 2024 release.^[3] LLaVA-OneVision 72B approaches GPT-4o on chart and document tasks, scoring 85.6 on AI2D against GPT-4V's 78.2, and matches commercial models on VideoMME (66.2) according to Li et al. (2024).^[4]

How does LLaVA differ from other vision-language models?

The table below situates LLaVA-1.5 against the open and closed multimodal models that were active in the same 2023-2024 window.

Model	Vision encoder	LLM backbone	Pretraining data scale	Open weights	Year
LLaVA-1.0	CLIP ViT-L/14	Vicuna 7B / 13B	595K image-text + 158K instruction	Yes	2023
LLaVA-1.5	CLIP ViT-L/14-336px	Vicuna 7B / 13B (v1.5)	558K image-text + 665K instruction	Yes	2023
LLaVA-NeXT	CLIP ViT-L/14-336px	Vicuna / Mistral / Yi-34B	~1.3M total	Yes	2024
LLaVA-OneVision	SigLIP	Qwen2 0.5B / 7B / 72B	~9M (incl. multi-image, video)	Yes	2024
MiniGPT-4	EVA-CLIP ViT-G + Q-Former	Vicuna 7B / 13B	~5M image-text + 3.5K curated	Yes	2023
BLIP-2	EVA-CLIP / CLIP + Q-Former	OPT 2.7B-6.7B; FLAN-T5	~129M image-text	Yes	2023
InstructBLIP	EVA-CLIP + instruction-aware Q-Former	Vicuna 7B / 13B; FLAN-T5	26 datasets in instruction format	Yes	2023
Qwen-VL / Qwen-VL-Chat	OpenCLIP ViT-bigG	Qwen 7B	1.4B image-text pairs	Yes	2023
CogVLM	EVA2-CLIP-E	Vicuna-1.5 7B (with visual expert layers)	~1.5B image-text	Yes	2023
GPT-4V	not disclosed	GPT-4	not disclosed	No	2023
Gemini 1.0 / 1.5	not disclosed	Gemini	not disclosed	No	2023-2024

The shorthand summary: LLaVA's lineage uses the lightest connector (a small MLP) and the smallest pretraining set, and relies on instruction tuning quality to make up the gap. BLIP-2 and InstructBLIP route image features through a heavier Q-Former and were pretrained on much more data.^[7]^[8] Qwen-VL and CogVLM trained on roughly a billion image-text pairs. The interesting empirical result is that LLaVA-1.5 was able to match or exceed those models on most evaluation benchmarks despite the data scale gap, which is the central claim of the Improved Baselines paper.^[2]

Why was LLaVA significant?

LLaVA mattered for several reasons that overlap.

First, it showed that a simple architecture combining a frozen vision transformer encoder, a thin trainable projection, and a chat-tuned LLM is enough to build a competent visual assistant, given the right instruction data.^[1] Before LLaVA the assumption in the open-source community was that you needed Q-Formers, gated cross-attention, or comparable specialised modules. LLaVA pushed back on that assumption with hard benchmark numbers.

Second, it introduced visual instruction tuning as a named training paradigm (Liu et al., 2023a) and released the LLaVA-Instruct-158K dataset.^[1] The recipe (use a strong text-only LLM to bootstrap synthetic multimodal dialogues from existing image annotations) is now standard in the open MLLM literature. Almost every open vision-language assistant released after mid-2023 either uses LLaVA's data, replicates the methodology with their own LLM, or extends it with new dialogue types.

Third, LLaVA was cheap. LLaVA-1.5 13B finishes its full training pipeline in about one day on a single 8x A100 node (Liu et al., 2023b), and the LLaVA-NeXT team reported that the compute and training-data cost of the recipe is "100-1000 times smaller than others."^[2]^[3] That cost profile put serious MLLM research within reach of academic labs that could not afford the kind of compute Google and OpenAI were spending on Gemini and GPT-4V. (The project blog has separately cited a figure near $200 of compute for the original 13B model; this lower figure is not documented in the peer-reviewed papers.)

Fourth, LLaVA became the template for an entire family of derivatives:

LLaVA-Med (Li et al., 2023): a biomedical assistant trained on PubMed Central figure-caption pairs in less than 15 hours on 8x A100; accepted as a NeurIPS 2023 Datasets and Benchmarks spotlight^[5]
LLaVA-RLHF: applied factually-augmented RLHF on 10K human preference labels to reduce hallucination
LLaVA-Plus (Liu et al., 2023, arXiv:2311.05437): added tool use, allowing the assistant to call external vision and vision-language models from a skill repository during a conversation^[6]
MoE-LLaVA: replaced the dense LLM with a Mixture-of-Experts variant
LLaVA-Phi: scaled the recipe down to a Phi-2 backbone for edge deployment
LLaVA-NeXT-Video and LLaVA-Video: extended the family to video understanding

The original LLaVA paper has been cited more than 7,300 times since publication, of which Semantic Scholar flags over 1,300 as highly influential citations.^[15]

What are LLaVA's limitations?

LLaVA inherits the failure modes of its components and adds a few of its own.

Hallucinations on detailed visual content. The model can confidently describe objects that are not in the image, especially for long-form descriptions. Liu et al. (2023b) added the POPE (Polling-based Object Probing Evaluation) benchmark to measure this and found that LLaVA-1.5 reduced but did not eliminate the problem.^[2] LLaVA-RLHF and follow-on hallucination evaluation work like On Evaluating Hallucinations in MLLMs explored this further.

Limited multi-image and video reasoning in pre-OneVision versions. LLaVA-1.0 through 1.6 were trained almost exclusively on single-image inputs; multi-image conversations and video understanding required either crude frame concatenation or separate models. LLaVA-OneVision (August 2024) was, in the authors' words, "the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios."^[4]

Resolution limits in early versions. LLaVA-1.5 used a fixed 336x336 input, which is too low for fine-grained OCR, chart reading, or dense scene understanding.^[2] The AnyRes patching scheme in LLaVA-NeXT supports up to 672x672, 336x1344, and 1344x336 resolutions for landscape and portrait images, which addressed this but added compute cost.^[3] LLaVA-OneVision's Higher AnyRes pushed this further by allocating up to 5x base tokens for single images while reducing per-frame tokens for video to keep memory tractable.^[4]

English-centric training data in the original versions. LLaVA-1.5 reported MMBench-CN scores of 63.6 (13B), but Chinese capability was emergent rather than directly trained.^[2] LLaVA-NeXT introduced explicit zero-shot Chinese support; later versions with Qwen2 backbones in OneVision are stronger on non-English tasks.^[3]

The vision encoder is frozen in stage one and often in stage two. This means the model inherits whatever biases and blind spots CLIP brings, including limited spatial reasoning and difficulty with text rendered inside images. SigLIP in OneVision helps, but the architectural choice to keep the encoder small and frozen is a deliberate trade-off for training efficiency rather than maximum visual capability.

Is LLaVA open source and where can you get it?

LLaVA model weights are available on Hugging Face under the liuhaotian namespace, including liuhaotian/llava-v1.5-7b, liuhaotian/llava-v1.5-13b, liuhaotian/llava-v1.6-vicuna-7b, liuhaotian/llava-v1.6-vicuna-13b, liuhaotian/llava-v1.6-mistral-7b, and liuhaotian/llava-v1.6-34b. LLaVA-OneVision checkpoints (llava-hf/llava-onevision-qwen2-0.5b-ov-hf, llava-hf/llava-onevision-qwen2-7b-ov-hf, llava-hf/llava-onevision-qwen2-72b-ov-hf) are hosted under the llava-hf organisation.

The training and inference code is released under Apache 2.0 at github.com/haotian-liu/LLaVA.^[11] Hugging Face Transformers integrates LLaVA via the LlavaForConditionalGeneration class (and LlavaNextForConditionalGeneration for 1.6), with LLaVA-NeXT-Video and LLaVA-OneVision having their own model docs.^[12]^[13] Quantized GGUF versions are available for local inference via llama.cpp; Ollama ships pre-built LLaVA images for one-line installation. The complete model zoo, training scripts, and instruction data are documented in the GitHub repository.^[11]

Users must comply with the original licenses for the underlying components: OpenAI's CLIP terms, the LLaMA and Vicuna community licenses for the LLM backbone, and the OpenAI Terms of Use for any data generated from GPT-4 (this is why the LLaVA-Instruct-158K dataset is research-only).

How influential is LLaVA?

LLaVA is one of the most-cited and most-replicated open-source vision language model projects of the 2023-2024 wave.^[14]^[15] The visual instruction tuning recipe is now standard in nearly every open MLLM release: Qwen-VL, InternVL, MiniCPM-V, DeepSeek-VL, Phi-3-Vision, Idefics, Molmo, and many others adopted variants of the same data format and training pipeline.

The LLaVA-Instruct-158K dataset and its successors (the academic VQA mix from 1.5, the OCR and reasoning extensions from 1.6, the multi-image and video data from OneVision) are widely reused as ingredients in other training mixtures. GitHub issues and Hugging Face discussions for new MLLM releases routinely cite LLaVA for ablation comparisons, recipe inspiration, or as a starting point for fine-tuning.

For anyone studying the open-source multimodal AI lineage, LLaVA is the project that demonstrated the field could move quickly and cheaply, and that simple recipes plus careful data curation can compete with much more elaborate approaches.

References

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023a). Visual Instruction Tuning. *Advances in Neural Information Processing Systems 36 (NeurIPS 2023, oral)*. arXiv:2304.08485. https://arxiv.org/abs/2304.08485 ↩
Liu, H., Li, C., Li, Y., & Lee, Y. J. (2023b). Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744 (CVPR 2024 highlight). https://arxiv.org/abs/2310.03744 ↩
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y. J. (2024). LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. Project blog, January 30, 2024. https://llava-vl.github.io/blog/2024-01-30-llava-1-6/ ↩
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., & Li, C. (2024). LLaVA-OneVision: Easy Visual Task Transfer. arXiv:2408.03326. https://arxiv.org/abs/2408.03326 ↩
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., & Gao, J. (2023). LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. NeurIPS 2023 Datasets and Benchmarks (spotlight). arXiv:2306.00890. https://arxiv.org/abs/2306.00890 ↩
Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., Zhang, L., Gao, J., & Li, C. (2023). LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. arXiv:2311.05437. https://arxiv.org/abs/2311.05437 ↩
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. *ICML 2023*. arXiv:2301.12597. https://arxiv.org/abs/2301.12597 ↩
Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P., & Hoi, S. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500. https://arxiv.org/abs/2305.06500 ↩
Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592. https://arxiv.org/abs/2304.10592
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). *ICML 2021*. arXiv:2103.00020. https://arxiv.org/abs/2103.00020 ↩
haotian-liu/LLaVA GitHub repository. https://github.com/haotian-liu/LLaVA ↩
Hugging Face Transformers documentation: LLaVA model. https://huggingface.co/docs/transformers/model_doc/llava ↩
Hugging Face Transformers documentation: LLaVA-NeXT model. https://huggingface.co/docs/transformers/model_doc/llava_next ↩
LLaVA project page. https://llava-vl.github.io/ ↩
Semantic Scholar entry for Visual Instruction Tuning (Liu et al., 2023), citation and highly-influential-citation counts. https://www.semanticscholar.org/paper/Visual-Instruction-Tuning-Liu-Li/a5036f31f0e629dc661f120b8c3b1f374d479ab8 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

LLaVA (Large Language and Vision Assistant)

What is the LLaVA architecture?

Vision encoder

Projection module

Language model backbone

What is visual instruction tuning?

Synthetic data generation

Two-stage training

When was each version released?

How does LLaVA perform on benchmarks?

How does LLaVA differ from other vision-language models?

Why was LLaVA significant?

What are LLaVA's limitations?

Is LLaVA open source and where can you get it?

How influential is LLaVA?

References

Improve this article

What links here (24 of 32)

What links here (24 of 32)

What is the LLaVA architecture?

Vision encoder

Projection module

Language model backbone

What is visual instruction tuning?

Synthetic data generation

Two-stage training

When was each version released?

How does LLaVA perform on benchmarks?

How does LLaVA differ from other vision-language models?

Why was LLaVA significant?

What are LLaVA's limitations?

Is LLaVA open source and where can you get it?

How influential is LLaVA?

References

Improve this article

Related Articles

DeepSeek-OCR

SmolVLA

Llama 3.2

Gemma 3

Pixtral

Llama 4 Scout and Maverick

What links here (24 of 32)

Related Articles

DeepSeek-OCR

SmolVLA

Llama 3.2

Gemma 3

Pixtral

Llama 4 Scout and Maverick

What links here (24 of 32)