# LLaVA (Large Language and Vision Assistant)

> Source: https://aiwiki.ai/wiki/llava
> Updated: 2026-06-23
> Categories: Multimodal AI, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**LLaVA** (Large Language and Vision Assistant) is an open-source family of [multimodal](/wiki/multimodal_ai) large language models that connects a frozen vision encoder to a pre-trained large language model through a single learned projection layer, trained with a technique its authors call visual instruction tuning. It was introduced in the April 2023 paper *Visual Instruction Tuning* by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, a collaboration between the University of Wisconsin-Madison, Microsoft Research, and Columbia University, and accepted as an oral presentation at NeurIPS 2023.[1] On a synthetic multimodal instruction-following dataset the first LLaVA reached an 85.1% relative score compared with GPT-4, and when fine-tuned on ScienceQA it set a new state of the art of 92.53% accuracy.[1] The paper has been cited more than 7,300 times, with over 1,300 of those flagged as highly influential, making it one of the most-cited open multimodal works of the 2023-2024 wave.[15]

LLaVA's central contribution is the visual instruction tuning recipe: using language-only GPT-4 to synthesize multimodal instruction-following dialogue from existing image annotations, then fine-tuning a [vision language model](/wiki/vision_language_model) on that data.[1] The paper describes itself as "the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data," and defines its model as "LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding."[1] The result is a system that follows free-form natural language instructions about images while remaining inexpensive enough to train on modest academic hardware. Subsequent releases (LLaVA-1.5 in October 2023, LLaVA-NeXT in January 2024, and LLaVA-OneVision in August 2024) refined the recipe and extended it to multi-image and video inputs.[2][3][4]

Because the original architecture and training pipeline were so simple, LLaVA became a reference implementation for the open-source [multimodal model](/wiki/multimodal_model) ecosystem. Hundreds of follow-on projects (LLaVA-Med for biomedicine, LLaVA-Plus for tool use, LLaVA-RLHF for hallucination mitigation, MoE-LLaVA, LLaVA-Phi, and many others) have adopted the same CLIP-plus-projection-plus-LLM template and the visual instruction tuning data format.

## What is the LLaVA architecture?

LLaVA's design is intentionally minimal. It has three components: a frozen vision encoder, a small trainable projection module, and an autoregressive language model. The original 2023 paper showed that this stripped-down combination is competitive with much heavier architectures such as Flamingo's gated cross-attention layers or BLIP-2's Q-Former, provided that visual instruction tuning data is used during fine-tuning.[1] The follow-up paper found that "the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient."[2]

### Vision encoder

All LLaVA versions through 1.6 use the [CLIP](/wiki/clip) ViT-L/14 image encoder from Radford et al. (2021), specifically the OpenAI checkpoint released as `openai/clip-vit-large-patch14`.[10] The encoder produces a sequence of patch embeddings (576 tokens at 224x224, 1,176 tokens at 336x336) that are fed into the language model. The vision encoder is frozen during stage one of training and either kept frozen or unfrozen during stage two, depending on the version. LLaVA-OneVision (Li et al., 2024) switched to the SigLIP encoder, which the authors found yielded higher LMM performance among open vision encoders.[4]

### Projection module

The projection maps each image patch embedding into the language model's word-embedding space, so the LLM can attend to visual tokens as if they were text tokens. In LLaVA-1.0 the projection was a single linear layer; in LLaVA-1.5 it was upgraded to a two-layer MLP, which the ablations in *Improved Baselines* (Liu et al., 2023b) showed gave consistent improvements at negligible cost.[2] The projection is the only module that is trained during the alignment stage. LLaVA-OneVision retained the two-layer MLP design.[4]

### Language model backbone

The original LLaVA used Vicuna-7B and Vicuna-13B (an instruction-tuned LLaMA variant) as the LLM.[1] LLaVA-1.5 also used Vicuna 7B and 13B (v1.5 weights).[2] LLaVA-NeXT extended the family to Mistral-7B, NousResearch's Nous-Hermes-2-Yi-34B, and later Llama-3 8B and Qwen-1.5 72B/110B.[3] LLaVA-OneVision uses Qwen2 at 0.5B, 7B, and 72B parameter sizes.[4] The language model receives a sequence that interleaves projected image tokens with the text prompt, and generates the response autoregressively in the same way as a text-only LLM.

## What is visual instruction tuning?

The defining contribution of the LLaVA paper is the data and training recipe rather than the architecture. Liu et al. (2023a) observed that no large-scale dataset of multimodal instruction-following dialogues existed, so they used GPT-4 (text-only, with image content represented as captions and bounding boxes from COCO annotations) to synthesize one.[1]

### Synthetic data generation

GPT-4 was prompted with the image's textual representation (caption plus bounding boxes for objects) and asked to produce three styles of dialogue:

- **Conversation**: a short turn-by-turn Q&A about the image's contents
- **Detailed description**: a long descriptive paragraph
- **Complex reasoning**: a multi-step reasoning question and answer that requires going beyond simple recognition

The original LLaVA-Instruct-158K dataset contains 158,000 examples in this format, derived from COCO images.[1]

### Two-stage training

The original LLaVA paper used a two-stage training procedure:

1. **Feature alignment**: The vision encoder and LLM are frozen. Only the projection is trained, on filtered image-caption pairs from CC3M (about 595,000 pairs in the 1.0 release). The objective is straightforward next-token prediction of the caption given the image embedding. This teaches the projection to map image features into a region of embedding space the LLM understands.
2. **End-to-end fine-tuning**: The vision encoder remains frozen, while the projection and the language model are jointly trained on the LLaVA-Instruct-158K dataset. This step teaches the model to follow user instructions about images.[1]

LLaVA-1.5 kept the two-stage structure but expanded the instruction tuning mix to 665K examples by adding academic VQA data (VQAv2, GQA, OKVQA, OCRVQA, A-OKVQA), region-level grounding data (RefCOCO, Visual Genome), and a small ShareGPT text-only slice. The pretraining mix was reduced to a 558K LAION-CC-SBU subset re-captioned with BLIP. The 13B variant uses roughly 1.2M total publicly available examples and trains in about one day on a single 8x A100 node (Liu et al., 2023b).[2]

## When was each version released?

| Version | Release date | Vision encoder | LLM backbone(s) | Resolution | Notable changes |
|---|---|---|---|---|---|
| LLaVA-1.0 | April 2023 (NeurIPS 2023) | CLIP ViT-L/14 | Vicuna 7B / 13B | 224x224 | First release; visual instruction tuning recipe; 158K GPT-4-generated examples |
| LLaVA-1.5 | October 5, 2023 | CLIP ViT-L/14-336px | Vicuna 7B / 13B (v1.5) | 336x336 | MLP projection (2-layer); academic VQA mix; SOTA on 11 of 12 benchmarks at release; ~1 day on 8x A100 |
| LLaVA-NeXT (1.6) | January 30, 2024 | CLIP ViT-L/14-336px | Vicuna 7B / 13B; Mistral 7B; Nous-Hermes-2-Yi-34B | up to 672x672 (AnyRes) | 4x more pixels via dynamic resolution patching; better OCR and reasoning data; first version to exceed Gemini Pro on several benchmarks |
| LLaVA-NeXT (Stronger) | May 10, 2024 | CLIP ViT-L/14-336px | Llama-3 8B; Qwen-1.5 72B / 110B | up to 672x672 | New LLM backbones; pushed open-source MLLMs further |
| LLaVA-NeXT-Video | April 30, 2024 | CLIP ViT-L/14-336px | Vicuna / Mistral / Yi | per-frame AnyRes | Zero-shot transfer to video by treating frames as a sequence of images |
| LLaVA-OneVision | August 6, 2024 | SigLIP | Qwen2 0.5B / 7B / 72B | Higher AnyRes (up to 5x base tokens) | Single unified model for single-image, multi-image, and video; curriculum learning over four stages |

## How does LLaVA perform on benchmarks?

LLaVA-1.5 set the standard for open-source MLLM benchmarking on release. Below are key scores reported in the *Improved Baselines* paper (Liu et al., 2023b), using the standard 13B and 7B configurations.[2]

| Benchmark | LLaVA-1.5 7B | LLaVA-1.5 13B | InstructBLIP 13B | Qwen-VL-Chat 7B | BLIP-2 14B |
|---|---|---|---|---|---|
| VQAv2 | 78.5 | 80.0 | 49.5 | 78.2* | 65.0 |
| GQA | 62.0 | 63.3 | 49.2 | 57.5* | 41.0 |
| VizWiz | 50.0 | 53.6 | 33.4 | 38.9 | 19.6 |
| ScienceQA-IMG | 66.8 | 71.6 | 63.1 | 68.2 | 61.0 |
| TextVQA | 58.2 | 61.3 | 50.7 | 61.5 | 42.5 |
| POPE (random) | 87.3 | 87.1 | 78.9 | not reported | not reported |
| MM-Vet | 31.1 | 36.1 | 25.6 | not reported | 22.4 |
| MMBench (en) | 64.3 | 67.7 | not reported | 60.6 | not reported |
| MMBench-CN | 58.3 | 63.6 | not reported | 56.7 | not reported |
| SEED-Bench | 58.6 | 62.6 | not reported | 58.2 | not reported |
| LLaVA-Wild | 65.4 | 72.5 | 58.2 | not reported | not reported |
| MME (perception) | 1510.7 | 1531.3 | 1212.8 | 1487.5 | 1293.8 |

At release, LLaVA-1.5 13B achieved state-of-the-art results among open-source MLLMs on 11 of 12 benchmarks reported, using only publicly available data.[2] (* indicates a training-set overlap or a different evaluation protocol; see the paper for caveats.)

LLaVA-NeXT continued the trend. The 34B variant (built on Nous-Hermes-2-Yi-34B) reported 51.1 on the MMMU validation set (44.7 on the test set, up from 36.4 for LLaVA-1.5 13B) and 79.3 on MMBench-English (up from 67.7), and the project team wrote that "LLaVA-1.6 even exceeds Gemini Pro on several benchmarks" at the January 2024 release.[3] LLaVA-OneVision 72B approaches GPT-4o on chart and document tasks, scoring 85.6 on AI2D against GPT-4V's 78.2, and matches commercial models on VideoMME (66.2) according to Li et al. (2024).[4]

## How does LLaVA differ from other vision-language models?

The table below situates LLaVA-1.5 against the open and closed multimodal models that were active in the same 2023-2024 window.

| Model | Vision encoder | LLM backbone | Pretraining data scale | Open weights | Year |
|---|---|---|---|---|---|
| LLaVA-1.0 | CLIP ViT-L/14 | Vicuna 7B / 13B | 595K image-text + 158K instruction | Yes | 2023 |
| LLaVA-1.5 | CLIP ViT-L/14-336px | Vicuna 7B / 13B (v1.5) | 558K image-text + 665K instruction | Yes | 2023 |
| LLaVA-NeXT | CLIP ViT-L/14-336px | Vicuna / Mistral / Yi-34B | ~1.3M total | Yes | 2024 |
| LLaVA-OneVision | SigLIP | Qwen2 0.5B / 7B / 72B | ~9M (incl. multi-image, video) | Yes | 2024 |
| MiniGPT-4 | EVA-CLIP ViT-G + Q-Former | Vicuna 7B / 13B | ~5M image-text + 3.5K curated | Yes | 2023 |
| BLIP-2 | EVA-CLIP / CLIP + Q-Former | OPT 2.7B-6.7B; FLAN-T5 | ~129M image-text | Yes | 2023 |
| InstructBLIP | EVA-CLIP + instruction-aware Q-Former | Vicuna 7B / 13B; FLAN-T5 | 26 datasets in instruction format | Yes | 2023 |
| Qwen-VL / Qwen-VL-Chat | OpenCLIP ViT-bigG | Qwen 7B | 1.4B image-text pairs | Yes | 2023 |
| CogVLM | EVA2-CLIP-E | Vicuna-1.5 7B (with visual expert layers) | ~1.5B image-text | Yes | 2023 |
| GPT-4V | not disclosed | GPT-4 | not disclosed | No | 2023 |
| Gemini 1.0 / 1.5 | not disclosed | Gemini | not disclosed | No | 2023-2024 |

The shorthand summary: LLaVA's lineage uses the lightest connector (a small MLP) and the smallest pretraining set, and relies on instruction tuning quality to make up the gap. BLIP-2 and InstructBLIP route image features through a heavier Q-Former and were pretrained on much more data.[7][8] Qwen-VL and CogVLM trained on roughly a billion image-text pairs. The interesting empirical result is that LLaVA-1.5 was able to match or exceed those models on most evaluation benchmarks despite the data scale gap, which is the central claim of the *Improved Baselines* paper.[2]

## Why was LLaVA significant?

LLaVA mattered for several reasons that overlap.

First, it showed that a simple architecture combining a frozen [vision transformer](/wiki/vision_transformer_vit) encoder, a thin trainable projection, and a chat-tuned LLM is enough to build a competent visual assistant, given the right instruction data.[1] Before LLaVA the assumption in the open-source community was that you needed Q-Formers, gated cross-attention, or comparable specialised modules. LLaVA pushed back on that assumption with hard benchmark numbers.

Second, it introduced visual instruction tuning as a named training paradigm (Liu et al., 2023a) and released the LLaVA-Instruct-158K dataset.[1] The recipe (use a strong text-only LLM to bootstrap synthetic multimodal dialogues from existing image annotations) is now standard in the open MLLM literature. Almost every open vision-language assistant released after mid-2023 either uses LLaVA's data, replicates the methodology with their own LLM, or extends it with new dialogue types.

Third, LLaVA was cheap. LLaVA-1.5 13B finishes its full training pipeline in about one day on a single 8x A100 node (Liu et al., 2023b), and the LLaVA-NeXT team reported that the compute and training-data cost of the recipe is "100-1000 times smaller than others."[2][3] That cost profile put serious MLLM research within reach of academic labs that could not afford the kind of compute Google and OpenAI were spending on Gemini and GPT-4V. (The project blog has separately cited a figure near $200 of compute for the original 13B model; this lower figure is not documented in the peer-reviewed papers.)

Fourth, LLaVA became the template for an entire family of derivatives:

- **LLaVA-Med** (Li et al., 2023): a biomedical assistant trained on PubMed Central figure-caption pairs in less than 15 hours on 8x A100; accepted as a NeurIPS 2023 Datasets and Benchmarks spotlight[5]
- **LLaVA-RLHF**: applied factually-augmented RLHF on 10K human preference labels to reduce hallucination
- **LLaVA-Plus** (Liu et al., 2023, arXiv:2311.05437): added tool use, allowing the assistant to call external vision and vision-language models from a skill repository during a conversation[6]
- **MoE-LLaVA**: replaced the dense LLM with a Mixture-of-Experts variant
- **LLaVA-Phi**: scaled the recipe down to a Phi-2 backbone for edge deployment
- **LLaVA-NeXT-Video** and **LLaVA-Video**: extended the family to video understanding

The original LLaVA paper has been cited more than 7,300 times since publication, of which Semantic Scholar flags over 1,300 as highly influential citations.[15]

## What are LLaVA's limitations?

LLaVA inherits the failure modes of its components and adds a few of its own.

**Hallucinations on detailed visual content**. The model can confidently describe objects that are not in the image, especially for long-form descriptions. Liu et al. (2023b) added the POPE (Polling-based Object Probing Evaluation) benchmark to measure this and found that LLaVA-1.5 reduced but did not eliminate the problem.[2] LLaVA-RLHF and follow-on hallucination evaluation work like *On Evaluating Hallucinations in MLLMs* explored this further.

**Limited multi-image and video reasoning** in pre-OneVision versions. LLaVA-1.0 through 1.6 were trained almost exclusively on single-image inputs; multi-image conversations and video understanding required either crude frame concatenation or separate models. LLaVA-OneVision (August 2024) was, in the authors' words, "the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios."[4]

**Resolution limits in early versions**. LLaVA-1.5 used a fixed 336x336 input, which is too low for fine-grained OCR, chart reading, or dense scene understanding.[2] The AnyRes patching scheme in LLaVA-NeXT supports up to 672x672, 336x1344, and 1344x336 resolutions for landscape and portrait images, which addressed this but added compute cost.[3] LLaVA-OneVision's Higher AnyRes pushed this further by allocating up to 5x base tokens for single images while reducing per-frame tokens for video to keep memory tractable.[4]

**English-centric training data** in the original versions. LLaVA-1.5 reported MMBench-CN scores of 63.6 (13B), but Chinese capability was emergent rather than directly trained.[2] LLaVA-NeXT introduced explicit zero-shot Chinese support; later versions with Qwen2 backbones in OneVision are stronger on non-English tasks.[3]

**The vision encoder is frozen** in stage one and often in stage two. This means the model inherits whatever biases and blind spots CLIP brings, including limited spatial reasoning and difficulty with text rendered inside images. SigLIP in OneVision helps, but the architectural choice to keep the encoder small and frozen is a deliberate trade-off for training efficiency rather than maximum visual capability.

## Is LLaVA open source and where can you get it?

LLaVA model weights are available on Hugging Face under the `liuhaotian` namespace, including `liuhaotian/llava-v1.5-7b`, `liuhaotian/llava-v1.5-13b`, `liuhaotian/llava-v1.6-vicuna-7b`, `liuhaotian/llava-v1.6-vicuna-13b`, `liuhaotian/llava-v1.6-mistral-7b`, and `liuhaotian/llava-v1.6-34b`. LLaVA-OneVision checkpoints (`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, `llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-72b-ov-hf`) are hosted under the `llava-hf` organisation.

The training and inference code is released under Apache 2.0 at github.com/haotian-liu/LLaVA.[11] Hugging Face Transformers integrates LLaVA via the `LlavaForConditionalGeneration` class (and `LlavaNextForConditionalGeneration` for 1.6), with LLaVA-NeXT-Video and LLaVA-OneVision having their own model docs.[12][13] Quantized GGUF versions are available for local inference via llama.cpp; Ollama ships pre-built LLaVA images for one-line installation. The complete model zoo, training scripts, and instruction data are documented in the GitHub repository.[11]

Users must comply with the original licenses for the underlying components: OpenAI's CLIP terms, the LLaMA and Vicuna community licenses for the LLM backbone, and the OpenAI Terms of Use for any data generated from GPT-4 (this is why the LLaVA-Instruct-158K dataset is research-only).

## How influential is LLaVA?

LLaVA is one of the most-cited and most-replicated open-source [vision language model](/wiki/vision_language_model) projects of the 2023-2024 wave.[14][15] The visual instruction tuning recipe is now standard in nearly every open MLLM release: Qwen-VL, InternVL, MiniCPM-V, DeepSeek-VL, Phi-3-Vision, Idefics, Molmo, and many others adopted variants of the same data format and training pipeline.

The LLaVA-Instruct-158K dataset and its successors (the academic VQA mix from 1.5, the OCR and reasoning extensions from 1.6, the multi-image and video data from OneVision) are widely reused as ingredients in other training mixtures. GitHub issues and Hugging Face discussions for new MLLM releases routinely cite LLaVA for ablation comparisons, recipe inspiration, or as a starting point for fine-tuning.

For anyone studying the open-source multimodal AI lineage, LLaVA is the project that demonstrated the field could move quickly and cheaply, and that simple recipes plus careful data curation can compete with much more elaborate approaches.

## References

1. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023a). Visual Instruction Tuning. *Advances in Neural Information Processing Systems 36 (NeurIPS 2023, oral)*. arXiv:2304.08485. https://arxiv.org/abs/2304.08485
2. Liu, H., Li, C., Li, Y., & Lee, Y. J. (2023b). Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744 (CVPR 2024 highlight). https://arxiv.org/abs/2310.03744
3. Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y. J. (2024). LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. Project blog, January 30, 2024. https://llava-vl.github.io/blog/2024-01-30-llava-1-6/
4. Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., & Li, C. (2024). LLaVA-OneVision: Easy Visual Task Transfer. arXiv:2408.03326. https://arxiv.org/abs/2408.03326
5. Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., & Gao, J. (2023). LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. NeurIPS 2023 Datasets and Benchmarks (spotlight). arXiv:2306.00890. https://arxiv.org/abs/2306.00890
6. Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., Zhang, L., Gao, J., & Li, C. (2023). LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. arXiv:2311.05437. https://arxiv.org/abs/2311.05437
7. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. *ICML 2023*. arXiv:2301.12597. https://arxiv.org/abs/2301.12597
8. Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P., & Hoi, S. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500. https://arxiv.org/abs/2305.06500
9. Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592. https://arxiv.org/abs/2304.10592
10. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). *ICML 2021*. arXiv:2103.00020. https://arxiv.org/abs/2103.00020
11. haotian-liu/LLaVA GitHub repository. https://github.com/haotian-liu/LLaVA
12. Hugging Face Transformers documentation: LLaVA model. https://huggingface.co/docs/transformers/model_doc/llava
13. Hugging Face Transformers documentation: LLaVA-NeXT model. https://huggingface.co/docs/transformers/model_doc/llava_next
14. LLaVA project page. https://llava-vl.github.io/
15. Semantic Scholar entry for Visual Instruction Tuning (Liu et al., 2023), citation and highly-influential-citation counts. https://www.semanticscholar.org/paper/Visual-Instruction-Tuning-Liu-Li/a5036f31f0e629dc661f120b8c3b1f374d479ab8

