LLaVA (Large Language and Vision Assistant)
Last reviewed
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 3,295 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 3,295 words
Add missing citations, update stale details, or suggest a clearer explanation.
LLaVA (Large Language and Vision Assistant) is an open-source family of multimodal large language models that connects a frozen vision encoder to a pre-trained large language model through a single learned projection layer, trained with a technique its authors call visual instruction tuning. It was introduced in the April 2023 paper Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, a collaboration between the University of Wisconsin-Madison, Microsoft Research, and Columbia University, and accepted as an oral presentation at NeurIPS 2023.[1] On a synthetic multimodal instruction-following dataset the first LLaVA reached an 85.1% relative score compared with GPT-4, and when fine-tuned on ScienceQA it set a new state of the art of 92.53% accuracy.[1] The paper has been cited more than 7,300 times, with over 1,300 of those flagged as highly influential, making it one of the most-cited open multimodal works of the 2023-2024 wave.[15]
LLaVA's central contribution is the visual instruction tuning recipe: using language-only GPT-4 to synthesize multimodal instruction-following dialogue from existing image annotations, then fine-tuning a vision language model on that data.[1] The paper describes itself as "the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data," and defines its model as "LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding."[1] The result is a system that follows free-form natural language instructions about images while remaining inexpensive enough to train on modest academic hardware. Subsequent releases (LLaVA-1.5 in October 2023, LLaVA-NeXT in January 2024, and LLaVA-OneVision in August 2024) refined the recipe and extended it to multi-image and video inputs.[2][3][4]
Because the original architecture and training pipeline were so simple, LLaVA became a reference implementation for the open-source multimodal model ecosystem. Hundreds of follow-on projects (LLaVA-Med for biomedicine, LLaVA-Plus for tool use, LLaVA-RLHF for hallucination mitigation, MoE-LLaVA, LLaVA-Phi, and many others) have adopted the same CLIP-plus-projection-plus-LLM template and the visual instruction tuning data format.
LLaVA's design is intentionally minimal. It has three components: a frozen vision encoder, a small trainable projection module, and an autoregressive language model. The original 2023 paper showed that this stripped-down combination is competitive with much heavier architectures such as Flamingo's gated cross-attention layers or BLIP-2's Q-Former, provided that visual instruction tuning data is used during fine-tuning.[1] The follow-up paper found that "the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient."[2]
All LLaVA versions through 1.6 use the CLIP ViT-L/14 image encoder from Radford et al. (2021), specifically the OpenAI checkpoint released as openai/clip-vit-large-patch14.[10] The encoder produces a sequence of patch embeddings (576 tokens at 224x224, 1,176 tokens at 336x336) that are fed into the language model. The vision encoder is frozen during stage one of training and either kept frozen or unfrozen during stage two, depending on the version. LLaVA-OneVision (Li et al., 2024) switched to the SigLIP encoder, which the authors found yielded higher LMM performance among open vision encoders.[4]
The projection maps each image patch embedding into the language model's word-embedding space, so the LLM can attend to visual tokens as if they were text tokens. In LLaVA-1.0 the projection was a single linear layer; in LLaVA-1.5 it was upgraded to a two-layer MLP, which the ablations in Improved Baselines (Liu et al., 2023b) showed gave consistent improvements at negligible cost.[2] The projection is the only module that is trained during the alignment stage. LLaVA-OneVision retained the two-layer MLP design.[4]
The original LLaVA used Vicuna-7B and Vicuna-13B (an instruction-tuned LLaMA variant) as the LLM.[1] LLaVA-1.5 also used Vicuna 7B and 13B (v1.5 weights).[2] LLaVA-NeXT extended the family to Mistral-7B, NousResearch's Nous-Hermes-2-Yi-34B, and later Llama-3 8B and Qwen-1.5 72B/110B.[3] LLaVA-OneVision uses Qwen2 at 0.5B, 7B, and 72B parameter sizes.[4] The language model receives a sequence that interleaves projected image tokens with the text prompt, and generates the response autoregressively in the same way as a text-only LLM.
The defining contribution of the LLaVA paper is the data and training recipe rather than the architecture. Liu et al. (2023a) observed that no large-scale dataset of multimodal instruction-following dialogues existed, so they used GPT-4 (text-only, with image content represented as captions and bounding boxes from COCO annotations) to synthesize one.[1]
GPT-4 was prompted with the image's textual representation (caption plus bounding boxes for objects) and asked to produce three styles of dialogue:
The original LLaVA-Instruct-158K dataset contains 158,000 examples in this format, derived from COCO images.[1]
The original LLaVA paper used a two-stage training procedure:
LLaVA-1.5 kept the two-stage structure but expanded the instruction tuning mix to 665K examples by adding academic VQA data (VQAv2, GQA, OKVQA, OCRVQA, A-OKVQA), region-level grounding data (RefCOCO, Visual Genome), and a small ShareGPT text-only slice. The pretraining mix was reduced to a 558K LAION-CC-SBU subset re-captioned with BLIP. The 13B variant uses roughly 1.2M total publicly available examples and trains in about one day on a single 8x A100 node (Liu et al., 2023b).[2]
| Version | Release date | Vision encoder | LLM backbone(s) | Resolution | Notable changes |
|---|---|---|---|---|---|
| LLaVA-1.0 | April 2023 (NeurIPS 2023) | CLIP ViT-L/14 | Vicuna 7B / 13B | 224x224 | First release; visual instruction tuning recipe; 158K GPT-4-generated examples |
| LLaVA-1.5 | October 5, 2023 | CLIP ViT-L/14-336px | Vicuna 7B / 13B (v1.5) | 336x336 | MLP projection (2-layer); academic VQA mix; SOTA on 11 of 12 benchmarks at release; ~1 day on 8x A100 |
| LLaVA-NeXT (1.6) | January 30, 2024 | CLIP ViT-L/14-336px | Vicuna 7B / 13B; Mistral 7B; Nous-Hermes-2-Yi-34B | up to 672x672 (AnyRes) | 4x more pixels via dynamic resolution patching; better OCR and reasoning data; first version to exceed Gemini Pro on several benchmarks |
| LLaVA-NeXT (Stronger) | May 10, 2024 | CLIP ViT-L/14-336px | Llama-3 8B; Qwen-1.5 72B / 110B | up to 672x672 | New LLM backbones; pushed open-source MLLMs further |
| LLaVA-NeXT-Video | April 30, 2024 | CLIP ViT-L/14-336px | Vicuna / Mistral / Yi | per-frame AnyRes | Zero-shot transfer to video by treating frames as a sequence of images |
| LLaVA-OneVision | August 6, 2024 | SigLIP | Qwen2 0.5B / 7B / 72B | Higher AnyRes (up to 5x base tokens) | Single unified model for single-image, multi-image, and video; curriculum learning over four stages |
LLaVA-1.5 set the standard for open-source MLLM benchmarking on release. Below are key scores reported in the Improved Baselines paper (Liu et al., 2023b), using the standard 13B and 7B configurations.[2]
| Benchmark | LLaVA-1.5 7B | LLaVA-1.5 13B | InstructBLIP 13B | Qwen-VL-Chat 7B | BLIP-2 14B |
|---|---|---|---|---|---|
| VQAv2 | 78.5 | 80.0 | 49.5 | 78.2* | 65.0 |
| GQA | 62.0 | 63.3 | 49.2 | 57.5* | 41.0 |
| VizWiz | 50.0 | 53.6 | 33.4 | 38.9 | 19.6 |
| ScienceQA-IMG | 66.8 | 71.6 | 63.1 | 68.2 | 61.0 |
| TextVQA | 58.2 | 61.3 | 50.7 | 61.5 | 42.5 |
| POPE (random) | 87.3 | 87.1 | 78.9 | not reported | not reported |
| MM-Vet | 31.1 | 36.1 | 25.6 | not reported | 22.4 |
| MMBench (en) | 64.3 | 67.7 | not reported | 60.6 | not reported |
| MMBench-CN | 58.3 | 63.6 | not reported | 56.7 | not reported |
| SEED-Bench | 58.6 | 62.6 | not reported | 58.2 | not reported |
| LLaVA-Wild | 65.4 | 72.5 | 58.2 | not reported | not reported |
| MME (perception) | 1510.7 | 1531.3 | 1212.8 | 1487.5 | 1293.8 |
At release, LLaVA-1.5 13B achieved state-of-the-art results among open-source MLLMs on 11 of 12 benchmarks reported, using only publicly available data.[2] (* indicates a training-set overlap or a different evaluation protocol; see the paper for caveats.)
LLaVA-NeXT continued the trend. The 34B variant (built on Nous-Hermes-2-Yi-34B) reported 51.1 on the MMMU validation set (44.7 on the test set, up from 36.4 for LLaVA-1.5 13B) and 79.3 on MMBench-English (up from 67.7), and the project team wrote that "LLaVA-1.6 even exceeds Gemini Pro on several benchmarks" at the January 2024 release.[3] LLaVA-OneVision 72B approaches GPT-4o on chart and document tasks, scoring 85.6 on AI2D against GPT-4V's 78.2, and matches commercial models on VideoMME (66.2) according to Li et al. (2024).[4]
The table below situates LLaVA-1.5 against the open and closed multimodal models that were active in the same 2023-2024 window.
| Model | Vision encoder | LLM backbone | Pretraining data scale | Open weights | Year |
|---|---|---|---|---|---|
| LLaVA-1.0 | CLIP ViT-L/14 | Vicuna 7B / 13B | 595K image-text + 158K instruction | Yes | 2023 |
| LLaVA-1.5 | CLIP ViT-L/14-336px | Vicuna 7B / 13B (v1.5) | 558K image-text + 665K instruction | Yes | 2023 |
| LLaVA-NeXT | CLIP ViT-L/14-336px | Vicuna / Mistral / Yi-34B | ~1.3M total | Yes | 2024 |
| LLaVA-OneVision | SigLIP | Qwen2 0.5B / 7B / 72B | ~9M (incl. multi-image, video) | Yes | 2024 |
| MiniGPT-4 | EVA-CLIP ViT-G + Q-Former | Vicuna 7B / 13B | ~5M image-text + 3.5K curated | Yes | 2023 |
| BLIP-2 | EVA-CLIP / CLIP + Q-Former | OPT 2.7B-6.7B; FLAN-T5 | ~129M image-text | Yes | 2023 |
| InstructBLIP | EVA-CLIP + instruction-aware Q-Former | Vicuna 7B / 13B; FLAN-T5 | 26 datasets in instruction format | Yes | 2023 |
| Qwen-VL / Qwen-VL-Chat | OpenCLIP ViT-bigG | Qwen 7B | 1.4B image-text pairs | Yes | 2023 |
| CogVLM | EVA2-CLIP-E | Vicuna-1.5 7B (with visual expert layers) | ~1.5B image-text | Yes | 2023 |
| GPT-4V | not disclosed | GPT-4 | not disclosed | No | 2023 |
| Gemini 1.0 / 1.5 | not disclosed | Gemini | not disclosed | No | 2023-2024 |
The shorthand summary: LLaVA's lineage uses the lightest connector (a small MLP) and the smallest pretraining set, and relies on instruction tuning quality to make up the gap. BLIP-2 and InstructBLIP route image features through a heavier Q-Former and were pretrained on much more data.[7][8] Qwen-VL and CogVLM trained on roughly a billion image-text pairs. The interesting empirical result is that LLaVA-1.5 was able to match or exceed those models on most evaluation benchmarks despite the data scale gap, which is the central claim of the Improved Baselines paper.[2]
LLaVA mattered for several reasons that overlap.
First, it showed that a simple architecture combining a frozen vision transformer encoder, a thin trainable projection, and a chat-tuned LLM is enough to build a competent visual assistant, given the right instruction data.[1] Before LLaVA the assumption in the open-source community was that you needed Q-Formers, gated cross-attention, or comparable specialised modules. LLaVA pushed back on that assumption with hard benchmark numbers.
Second, it introduced visual instruction tuning as a named training paradigm (Liu et al., 2023a) and released the LLaVA-Instruct-158K dataset.[1] The recipe (use a strong text-only LLM to bootstrap synthetic multimodal dialogues from existing image annotations) is now standard in the open MLLM literature. Almost every open vision-language assistant released after mid-2023 either uses LLaVA's data, replicates the methodology with their own LLM, or extends it with new dialogue types.
Third, LLaVA was cheap. LLaVA-1.5 13B finishes its full training pipeline in about one day on a single 8x A100 node (Liu et al., 2023b), and the LLaVA-NeXT team reported that the compute and training-data cost of the recipe is "100-1000 times smaller than others."[2][3] That cost profile put serious MLLM research within reach of academic labs that could not afford the kind of compute Google and OpenAI were spending on Gemini and GPT-4V. (The project blog has separately cited a figure near $200 of compute for the original 13B model; this lower figure is not documented in the peer-reviewed papers.)
Fourth, LLaVA became the template for an entire family of derivatives:
The original LLaVA paper has been cited more than 7,300 times since publication, of which Semantic Scholar flags over 1,300 as highly influential citations.[15]
LLaVA inherits the failure modes of its components and adds a few of its own.
Hallucinations on detailed visual content. The model can confidently describe objects that are not in the image, especially for long-form descriptions. Liu et al. (2023b) added the POPE (Polling-based Object Probing Evaluation) benchmark to measure this and found that LLaVA-1.5 reduced but did not eliminate the problem.[2] LLaVA-RLHF and follow-on hallucination evaluation work like On Evaluating Hallucinations in MLLMs explored this further.
Limited multi-image and video reasoning in pre-OneVision versions. LLaVA-1.0 through 1.6 were trained almost exclusively on single-image inputs; multi-image conversations and video understanding required either crude frame concatenation or separate models. LLaVA-OneVision (August 2024) was, in the authors' words, "the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios."[4]
Resolution limits in early versions. LLaVA-1.5 used a fixed 336x336 input, which is too low for fine-grained OCR, chart reading, or dense scene understanding.[2] The AnyRes patching scheme in LLaVA-NeXT supports up to 672x672, 336x1344, and 1344x336 resolutions for landscape and portrait images, which addressed this but added compute cost.[3] LLaVA-OneVision's Higher AnyRes pushed this further by allocating up to 5x base tokens for single images while reducing per-frame tokens for video to keep memory tractable.[4]
English-centric training data in the original versions. LLaVA-1.5 reported MMBench-CN scores of 63.6 (13B), but Chinese capability was emergent rather than directly trained.[2] LLaVA-NeXT introduced explicit zero-shot Chinese support; later versions with Qwen2 backbones in OneVision are stronger on non-English tasks.[3]
The vision encoder is frozen in stage one and often in stage two. This means the model inherits whatever biases and blind spots CLIP brings, including limited spatial reasoning and difficulty with text rendered inside images. SigLIP in OneVision helps, but the architectural choice to keep the encoder small and frozen is a deliberate trade-off for training efficiency rather than maximum visual capability.
LLaVA model weights are available on Hugging Face under the liuhaotian namespace, including liuhaotian/llava-v1.5-7b, liuhaotian/llava-v1.5-13b, liuhaotian/llava-v1.6-vicuna-7b, liuhaotian/llava-v1.6-vicuna-13b, liuhaotian/llava-v1.6-mistral-7b, and liuhaotian/llava-v1.6-34b. LLaVA-OneVision checkpoints (llava-hf/llava-onevision-qwen2-0.5b-ov-hf, llava-hf/llava-onevision-qwen2-7b-ov-hf, llava-hf/llava-onevision-qwen2-72b-ov-hf) are hosted under the llava-hf organisation.
The training and inference code is released under Apache 2.0 at github.com/haotian-liu/LLaVA.[11] Hugging Face Transformers integrates LLaVA via the LlavaForConditionalGeneration class (and LlavaNextForConditionalGeneration for 1.6), with LLaVA-NeXT-Video and LLaVA-OneVision having their own model docs.[12][13] Quantized GGUF versions are available for local inference via llama.cpp; Ollama ships pre-built LLaVA images for one-line installation. The complete model zoo, training scripts, and instruction data are documented in the GitHub repository.[11]
Users must comply with the original licenses for the underlying components: OpenAI's CLIP terms, the LLaMA and Vicuna community licenses for the LLM backbone, and the OpenAI Terms of Use for any data generated from GPT-4 (this is why the LLaVA-Instruct-158K dataset is research-only).
LLaVA is one of the most-cited and most-replicated open-source vision language model projects of the 2023-2024 wave.[14][15] The visual instruction tuning recipe is now standard in nearly every open MLLM release: Qwen-VL, InternVL, MiniCPM-V, DeepSeek-VL, Phi-3-Vision, Idefics, Molmo, and many others adopted variants of the same data format and training pipeline.
The LLaVA-Instruct-158K dataset and its successors (the academic VQA mix from 1.5, the OCR and reasoning extensions from 1.6, the multi-image and video data from OneVision) are widely reused as ingredients in other training mixtures. GitHub issues and Hugging Face discussions for new MLLM releases routinely cite LLaVA for ablation comparisons, recipe inspiration, or as a starting point for fine-tuning.
For anyone studying the open-source multimodal AI lineage, LLaVA is the project that demonstrated the field could move quickly and cheaply, and that simple recipes plus careful data curation can compete with much more elaborate approaches.