LLaVA (Large Language and Vision Assistant)
LLaVA (Large Language and Vision Assistant) is an open-source family of multimodal large language models that combines a pre-trained vision encoder with a pre-trained large language model through a learned projection layer. The model was introduced in the April 2023 paper Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, a collaboration between the University of Wisconsin-Madison, Microsoft Research, and Columbia University. The paper was accepted as an oral presentation at NeurIPS 2023 (Liu et al., 2023a).
LLaVA's central contribution is the visual instruction tuning recipe: using language-only GPT-4 to synthesize multimodal instruction-following dialogue from existing image annotations, then fine-tuning a vision language model on that data. The result is a system that follows free-form natural language instructions about images while remaining inexpensive enough to train on modest academic hardware. Subsequent releases (LLaVA-1.5 in October 2023, LLaVA-NeXT in January 2024, and LLaVA-OneVision in August 2024) refined the recipe and extended it to multi-image and video inputs.
Because the original architecture and training pipeline were so simple, LLaVA became a reference implementation for the open-source multimodal model ecosystem. Hundreds of follow-on projects (LLaVA-Med for biomedicine, LLaVA-Plus for tool use, LLaVA-RLHF for hallucination mitigation, MoE-LLaVA, LLaVA-Phi, and many others) have adopted the same CLIP-plus-projection-plus-LLM template and the visual instruction tuning data format.
Architecture
LLaVA's design is intentionally minimal. It has three components: a frozen vision encoder, a small trainable projection module, and an autoregressive language model. The original 2023 paper showed that this stripped-down combination is competitive with much heavier architectures such as Flamingo's gated cross-attention layers or BLIP-2's Q-Former, provided that visual instruction tuning data is used during fine-tuning.
Vision encoder
All LLaVA versions through 1.6 use the CLIP ViT-L/14 image encoder from Radford et al. (2021), specifically the OpenAI checkpoint released as openai/clip-vit-large-patch14. The encoder produces a sequence of patch embeddings (576 tokens at 224x224, 1,176 tokens at 336x336) that are fed into the language model. The vision encoder is frozen during stage one of training and either kept frozen or unfrozen during stage two, depending on the version. LLaVA-OneVision (Li et al., 2024) switched to the SigLIP encoder, which the authors found yielded higher LMM performance among open vision encoders.
Projection module
The projection maps each image patch embedding into the language model's word-embedding space, so the LLM can attend to visual tokens as if they were text tokens. In LLaVA-1.0 the projection was a single linear layer; in LLaVA-1.5 it was upgraded to a two-layer MLP, which the ablations in Improved Baselines (Liu et al., 2023b) showed gave consistent improvements at negligible cost. The projection is the only module that is trained during the alignment stage. LLaVA-OneVision retained the two-layer MLP design.
Language model backbone
The original LLaVA used Vicuna-7B and Vicuna-13B (an instruction-tuned LLaMA variant) as the LLM. LLaVA-1.5 also used Vicuna 7B and 13B (v1.5 weights). LLaVA-NeXT extended the family to Mistral-7B, NousResearch's Nous-Hermes-2-Yi-34B, and later Llama-3 8B and Qwen-1.5 72B/110B. LLaVA-OneVision uses Qwen2 at 0.5B, 7B, and 72B parameter sizes. The language model receives a sequence that interleaves projected image tokens with the text prompt, and generates the response autoregressively in the same way as a text-only LLM.
Visual instruction tuning
The defining contribution of the LLaVA paper is the data and training recipe rather than the architecture. Liu et al. (2023a) observed that no large-scale dataset of multimodal instruction-following dialogues existed, so they used GPT-4 (text-only, with image content represented as captions and bounding boxes from COCO annotations) to synthesize one.
Synthetic data generation
GPT-4 was prompted with the image's textual representation (caption plus bounding boxes for objects) and asked to produce three styles of dialogue:
- Conversation: a short turn-by-turn Q&A about the image's contents
- Detailed description: a long descriptive paragraph
- Complex reasoning: a multi-step reasoning question and answer that requires going beyond simple recognition
The original LLaVA-Instruct-158K dataset contains 158,000 examples in this format, derived from COCO images.
Two-stage training
The original LLaVA paper used a two-stage training procedure:
- Feature alignment: The vision encoder and LLM are frozen. Only the projection is trained, on filtered image-caption pairs from CC3M (about 595,000 pairs in the 1.0 release). The objective is straightforward next-token prediction of the caption given the image embedding. This teaches the projection to map image features into a region of embedding space the LLM understands.
- End-to-end fine-tuning: The vision encoder remains frozen, while the projection and the language model are jointly trained on the LLaVA-Instruct-158K dataset. This step teaches the model to follow user instructions about images.
LLaVA-1.5 kept the two-stage structure but expanded the instruction tuning mix to 665K examples by adding academic VQA data (VQAv2, GQA, OKVQA, OCRVQA, A-OKVQA), region-level grounding data (RefCOCO, Visual Genome), and a small ShareGPT text-only slice. The pretraining mix was reduced to a 558K LAION-CC-SBU subset re-captioned with BLIP. The 13B variant trains in roughly one day on a single 8x A100 node (Liu et al., 2023b).
Versions and timeline
| Version | Release date | Vision encoder | LLM backbone(s) | Resolution | Notable changes |
|---|
| LLaVA-1.0 | April 2023 (NeurIPS 2023) | CLIP ViT-L/14 | Vicuna 7B / 13B | 224x224 | First release; visual instruction tuning recipe; 158K GPT-4-generated examples |
| LLaVA-1.5 | October 5, 2023 | CLIP ViT-L/14-336px | Vicuna 7B / 13B (v1.5) | 336x336 | MLP projection (2-layer); academic VQA mix; SOTA on 11 of 12 benchmarks at release; ~1 day on 8x A100 |
| LLaVA-NeXT (1.6) | January 30, 2024 | CLIP ViT-L/14-336px | Vicuna 7B / 13B; Mistral 7B; Nous-Hermes-2-Yi-34B | up to 672x672 (AnyRes) | 4x more pixels via dynamic resolution patching; better OCR and reasoning data; first version to exceed Gemini Pro on several benchmarks |
| LLaVA-NeXT (Stronger) | May 10, 2024 | CLIP ViT-L/14-336px | Llama-3 8B; Qwen-1.5 72B / 110B | up to 672x672 | New LLM backbones; pushed open-source MLLMs further |
| LLaVA-NeXT-Video | April 30, 2024 | CLIP ViT-L/14-336px | Vicuna / Mistral / Yi | per-frame AnyRes | Zero-shot transfer to video by treating frames as a sequence of images |
| LLaVA-OneVision | August 6, 2024 | SigLIP | Qwen2 0.5B / 7B / 72B | Higher AnyRes (up to 5x base tokens) | Single unified model for single-image, multi-image, and video; curriculum learning over four stages |
Benchmarks
LLaVA-1.5 set the standard for open-source MLLM benchmarking on release. Below are key scores reported in the Improved Baselines paper (Liu et al., 2023b), using the standard 13B and 7B configurations.
| Benchmark | LLaVA-1.5 7B | LLaVA-1.5 13B | InstructBLIP 13B | Qwen-VL-Chat 7B | BLIP-2 14B |
|---|
| VQAv2 | 78.5 | 80.0 | 49.5 | 78.2* | 65.0 |
| GQA | 62.0 | 63.3 | 49.2 | 57.5* | 41.0 |
| VizWiz | 50.0 | 53.6 | 33.4 | 38.9 | 19.6 |
| ScienceQA-IMG | 66.8 | 71.6 | 63.1 | 68.2 | 61.0 |
| TextVQA | 58.2 | 61.3 | 50.7 | 61.5 | 42.5 |
| POPE (random) | 87.3 | 87.1 | 78.9 | not reported | not reported |
| MM-Vet | 31.1 | 36.1 | 25.6 | not reported | 22.4 |
| MMBench (en) | 64.3 | 67.7 | not reported | 60.6 | not reported |
| MMBench-CN | 58.3 | 63.6 | not reported | 56.7 | not reported |
| SEED-Bench | 58.6 | 62.6 | not reported | 58.2 | not reported |
| LLaVA-Wild | 65.4 | 72.5 | 58.2 | not reported | not reported |
| MME (perception) | 1510.7 | 1531.3 | 1212.8 | 1487.5 | 1293.8 |
At release, LLaVA-1.5 13B achieved state-of-the-art results among open-source MLLMs on 11 of 12 benchmarks reported. (* indicates a training-set overlap or a different evaluation protocol; see the paper for caveats.)
LLaVA-NeXT continued the trend. The 34B variant (built on Nous-Hermes-2-Yi-34B) reported 51.1 on MMMU (up from 36.4 for LLaVA-1.5 13B) and 79.3 on MMBench-English (up from 67.7), and outperformed Gemini Pro on several benchmarks at the January 2024 blog release. LLaVA-OneVision 72B approaches GPT-4o on chart and document tasks, scoring 85.6 on AI2D, and matches commercial models on VideoMME (66.2) according to Li et al. (2024).
Comparison with other vision-language models
The table below situates LLaVA-1.5 against the open and closed multimodal models that were active in the same 2023-2024 window.
| Model | Vision encoder | LLM backbone | Pretraining data scale | Open weights | Year |
|---|
| LLaVA-1.0 | CLIP ViT-L/14 | Vicuna 7B / 13B | 595K image-text + 158K instruction | Yes | 2023 |
| LLaVA-1.5 | CLIP ViT-L/14-336px | Vicuna 7B / 13B (v1.5) | 558K image-text + 665K instruction | Yes | 2023 |
| LLaVA-NeXT | CLIP ViT-L/14-336px | Vicuna / Mistral / Yi-34B | ~1.3M total | Yes | 2024 |
| LLaVA-OneVision | SigLIP | Qwen2 0.5B / 7B / 72B | ~9M (incl. multi-image, video) | Yes | 2024 |
| MiniGPT-4 | EVA-CLIP ViT-G + Q-Former | Vicuna 7B / 13B | ~5M image-text + 3.5K curated | Yes | 2023 |
| BLIP-2 | EVA-CLIP / CLIP + Q-Former | OPT 2.7B-6.7B; FLAN-T5 | ~129M image-text | Yes | 2023 |
| InstructBLIP | EVA-CLIP + instruction-aware Q-Former | Vicuna 7B / 13B; FLAN-T5 | 26 datasets in instruction format | Yes | 2023 |
| Qwen-VL / Qwen-VL-Chat | OpenCLIP ViT-bigG | Qwen 7B | 1.4B image-text pairs | Yes | 2023 |
| CogVLM | EVA2-CLIP-E | Vicuna-1.5 7B (with visual expert layers) | ~1.5B image-text | Yes | 2023 |
| GPT-4V | not disclosed | GPT-4 | not disclosed | No | 2023 |
| Gemini 1.0 / 1.5 | not disclosed | Gemini | not disclosed | No | 2023-2024 |
The shorthand summary: LLaVA's lineage uses the lightest connector (a small MLP) and the smallest pretraining set, and relies on instruction tuning quality to make up the gap. BLIP-2 and InstructBLIP route image features through a heavier Q-Former and were pretrained on much more data. Qwen-VL and CogVLM trained on roughly a billion image-text pairs. The interesting empirical result is that LLaVA-1.5 was able to match or exceed those models on most evaluation benchmarks despite the data scale gap, which is the central claim of the Improved Baselines paper.
Significance
LLaVA mattered for several reasons that overlap.
First, it showed that a simple architecture combining a frozen vision transformer encoder, a thin trainable projection, and a chat-tuned LLM is enough to build a competent visual assistant, given the right instruction data. Before LLaVA the assumption in the open-source community was that you needed Q-Formers, gated cross-attention, or comparable specialised modules. LLaVA pushed back on that assumption with hard benchmark numbers.
Second, it introduced visual instruction tuning as a named training paradigm (Liu et al., 2023a) and released the LLaVA-Instruct-158K dataset. The recipe (use a strong text-only LLM to bootstrap synthetic multimodal dialogues from existing image annotations) is now standard in the open MLLM literature. Almost every open vision-language assistant released after mid-2023 either uses LLaVA's data, replicates the methodology with their own LLM, or extends it with new dialogue types.
Third, LLaVA was cheap. The original 13B model was trained for roughly $200 worth of compute according to the project blog, and LLaVA-1.5 13B finishes its full training pipeline in about one day on a single 8x A100 node (Liu et al., 2023b). That cost profile put serious MLLM research within reach of academic labs that could not afford the kind of compute Google and OpenAI were spending on Gemini and GPT-4V.
Fourth, LLaVA became the template for an entire family of derivatives:
- LLaVA-Med (Li et al., 2023): a biomedical assistant trained on PubMed Central figure-caption pairs in less than 15 hours on 8x A100; accepted as a NeurIPS 2023 Datasets and Benchmarks spotlight
- LLaVA-RLHF: applied factually-augmented RLHF on 10K human preference labels to reduce hallucination
- LLaVA-Plus (Liu et al., 2023, arXiv:2311.05437): added tool use, allowing the assistant to call external vision and vision-language models from a skill repository during a conversation
- MoE-LLaVA: replaced the dense LLM with a Mixture-of-Experts variant
- LLaVA-Phi: scaled the recipe down to a Phi-2 backbone for edge deployment
- LLaVA-NeXT-Video and LLaVA-Video: extended the family to video understanding
The original LLaVA paper has been cited many thousands of times since publication.
Limitations
LLaVA inherits the failure modes of its components and adds a few of its own.
Hallucinations on detailed visual content. The model can confidently describe objects that are not in the image, especially for long-form descriptions. Liu et al. (2023b) added the POPE (Polling-based Object Probing Evaluation) benchmark to measure this and found that LLaVA-1.5 reduced but did not eliminate the problem. LLaVA-RLHF and follow-on hallucination evaluation work like On Evaluating Hallucinations in MLLMs explored this further.
Limited multi-image and video reasoning in pre-OneVision versions. LLaVA-1.0 through 1.6 were trained almost exclusively on single-image inputs; multi-image conversations and video understanding required either crude frame concatenation or separate models. LLaVA-OneVision (August 2024) was the first version of LLaVA explicitly designed to handle single-image, multi-image, and video inputs in a single checkpoint.
Resolution limits in early versions. LLaVA-1.5 used a fixed 336x336 input, which is too low for fine-grained OCR, chart reading, or dense scene understanding. The AnyRes patching scheme in LLaVA-NeXT (up to 672x672, or 336x1344 / 1344x336 for landscape/portrait images) addressed this but added compute cost. LLaVA-OneVision's Higher AnyRes pushed this further by allocating up to 5x base tokens for single images while reducing per-frame tokens for video to keep memory tractable.
English-centric training data in the original versions. LLaVA-1.5 reported MMBench-CN scores of 63.6 (13B), but Chinese capability was emergent rather than directly trained. LLaVA-NeXT introduced explicit zero-shot Chinese support; later versions with Qwen2 backbones in OneVision are stronger on non-English tasks.
The vision encoder is frozen in stage one and often in stage two. This means the model inherits whatever biases and blind spots CLIP brings, including limited spatial reasoning and difficulty with text rendered inside images. SigLIP in OneVision helps, but the architectural choice to keep the encoder small and frozen is a deliberate trade-off for training efficiency rather than maximum visual capability.
Software and availability
LLaVA model weights are available on Hugging Face under the liuhaotian namespace, including liuhaotian/llava-v1.5-7b, liuhaotian/llava-v1.5-13b, liuhaotian/llava-v1.6-vicuna-7b, liuhaotian/llava-v1.6-vicuna-13b, liuhaotian/llava-v1.6-mistral-7b, and liuhaotian/llava-v1.6-34b. LLaVA-OneVision checkpoints (llava-hf/llava-onevision-qwen2-0.5b-ov-hf, llava-hf/llava-onevision-qwen2-7b-ov-hf, llava-hf/llava-onevision-qwen2-72b-ov-hf) are hosted under the llava-hf organisation.
The training and inference code is released under Apache 2.0 at github.com/haotian-liu/LLaVA. Hugging Face Transformers integrates LLaVA via the LlavaForConditionalGeneration class (and LlavaNextForConditionalGeneration for 1.6), with LLaVA-NeXT-Video and LLaVA-OneVision having their own model docs. Quantized GGUF versions are available for local inference via llama.cpp; Ollama ships pre-built LLaVA images for one-line installation. The complete model zoo, training scripts, and instruction data are documented in the GitHub repository.
Users must comply with the original licenses for the underlying components: OpenAI's CLIP terms, the LLaMA and Vicuna community licenses for the LLM backbone, and the OpenAI Terms of Use for any data generated from GPT-4 (this is why the LLaVA-Instruct-158K dataset is research-only).
Influence
LLaVA is one of the most-cited and most-replicated open-source vision language model projects of the 2023-2024 wave. The visual instruction tuning recipe is now standard in nearly every open MLLM release: Qwen-VL, InternVL, MiniCPM-V, DeepSeek-VL, Phi-3-Vision, Idefics, Molmo, and many others adopted variants of the same data format and training pipeline.
The LLaVA-Instruct-158K dataset and its successors (the academic VQA mix from 1.5, the OCR and reasoning extensions from 1.6, the multi-image and video data from OneVision) are widely reused as ingredients in other training mixtures. GitHub issues and Hugging Face discussions for new MLLM releases routinely cite LLaVA for ablation comparisons, recipe inspiration, or as a starting point for fine-tuning.
For anyone studying the open-source multimodal AI lineage, LLaVA is the project that demonstrated the field could move quickly and cheaply, and that simple recipes plus careful data curation can compete with much more elaborate approaches.
References