Vision language model
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 3,535 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 3,535 words
Add missing citations, update stale details, or suggest a clearer explanation.
A vision-language model (VLM) is a class of multimodal artificial intelligence model that jointly processes visual inputs (typically still images, sometimes video frames or documents) and natural-language text, most often producing text as output. The category spans contrastive image-text encoders such as CLIP and SigLIP, generative systems such as LLaVA, BLIP-2, Flamingo, GPT-4V, Claude 3, and Gemini, and the closely related vision-language-action models used in robotics. Both the full phrase "vision-language model" and the acronym "VLM" appear interchangeably in the research literature, which is why the aiwiki maintains separate slugs that resolve to the same content. Broader coverage of the field, including audio and video modalities and the general theory of multimodal fusion, is at multimodal model; the robotics extension is documented at vision-language-action model and VLA.
The label "vision-language model" covers any neural network whose forward pass conditions on at least one image and at least one text token sequence. In current practice researchers distinguish two broad families.
The first family is contrastive VLMs, which learn a shared embedding space where matching image-text pairs have high similarity and non-matching pairs have low similarity. CLIP (Contrastive Language-Image Pre-training) from OpenAI established this template in 2021 by training on 400 million image-text pairs scraped from the public web.[1] ALIGN from Google scaled the same recipe to roughly 1.8 billion noisy alt-text pairs.[2] SigLIP from Google DeepMind in 2023 replaced the standard softmax contrastive loss with a per-pair sigmoid loss, allowing larger batch sizes and stronger small-batch performance.[3] Contrastive VLMs do not generate text; they score similarity, which makes them ideal as zero-shot classifiers, retrieval systems, and frozen vision encoders for the larger generative models described below.
The second family is generative VLMs, sometimes called large multimodal models (LMMs) or multimodal large language models, which produce free-form text output conditioned on an image plus a prompt. DeepMind's Flamingo (2022) was an early example demonstrating few-shot in-context learning over interleaved image-text sequences.[4] BLIP-2 (Salesforce, 2023) bootstrapped from frozen image encoders and frozen language models, connected by a lightweight "Querying Transformer."[5] LLaVA (Visual Instruction Tuning, 2023) popularized the simpler "vision encoder plus projector plus LLM" pattern that now dominates open-source releases.[6] MiniGPT-4 appeared the same month with a similar frozen-encoder-plus-projection design.[7] Proprietary entries include GPT-4V from OpenAI (September 2023),[8] Claude 3 from Anthropic (March 2024),[9] and Gemini from Google DeepMind (December 2023).[10]
A third category, vision-language-action models (VLAs), extends the generative VLM stack with an action head that emits robot motor commands or trajectories. These systems are treated as a distinct topic at vision-language-action model.
Most modern generative VLMs converge on a three-part pipeline: a vision encoder that turns an image into a sequence of visual features, a connector (also called a bridge, adapter, or projector) that maps those features into a form the language model can consume, and a large language model backbone that autoregressively generates text conditioned on the combined visual and text tokens. The design choices at each stage define the major architectural families.
The vision encoder is almost always a Vision Transformer (ViT), and in most open systems it is a frozen, pretrained encoder reused from a contrastive model. The two most common choices are a CLIP-trained ViT-L/14 (often at 336 by 336 pixel resolution) and a SigLIP encoder.[6][3] LLaVA-1.5, for example, uses OpenAI's CLIP ViT-L/14 at 336 pixels and keeps it frozen.[11] Some systems train a bespoke high-capacity encoder: InternVL pairs its language model with InternViT-6B, a roughly six-billion-parameter vision transformer.[12]
Because a fixed 224 or 336 pixel input is too coarse for dense text and documents, recent systems support higher and variable resolution. InternVL-1.5 introduced a dynamic tiling scheme that splits an image into 448 by 448 pixel tiles (1 to 12 tiles at training time, scalable to 40 tiles, roughly 4K, at inference) and applies a pixel-shuffle operation to compress each tile to 256 visual tokens.[13] Qwen2-VL and Qwen2.5-VL instead train a native dynamic-resolution ViT from scratch so the encoder can ingest images at their original size, with window attention added in Qwen2.5-VL so that most layers scale linearly with the number of patches.[14][15] Pixtral 12B from Mistral uses a custom 400-million-parameter encoder, Pixtral-ViT, trained from scratch with 2D rotary position embeddings (RoPE-2D) to handle variable image sizes and aspect ratios.[16]
The connector determines how visual information reaches the language model. Three patterns dominate.
Projection (MLP) connectors flatten the encoder's patch features and pass them through a small linear layer or multilayer perceptron into the LLM's token embedding space, where they are concatenated with the text tokens. The original LLaVA used a single linear projection; LLaVA-1.5 upgraded it to a two-layer MLP with a GELU activation, which improved benchmark results.[11] This approach is simple, cheap to train, and now the default for most open-weights models, including the InternVL family, which uses a randomly initialized MLP between InternViT and its language backbone.[13]
Query-based connectors insert a small transformer between the frozen encoder and the LLM. BLIP-2's Querying Transformer (Q-Former) is the canonical example: it carries 32 learnable query embeddings of dimension 768 that attend to the frozen image features through cross-attention and to each other through self-attention, acting as an information bottleneck that extracts only the most text-relevant visual content.[5] This compresses an image to a small fixed number of tokens regardless of resolution, which keeps the LLM context short.
Cross-attention connectors leave the visual tokens outside the LLM's main token stream and instead inject them through new attention layers added inside the frozen language model. Flamingo pioneered this with a Perceiver Resampler that maps variable visual features to a fixed set of tokens, followed by gated cross-attention dense layers interleaved between the frozen language-model blocks; a tanh gate initialized to zero ensures the model starts out identical to the original text-only LLM and learns to use vision gradually.[4] Meta's Llama 3.2 Vision follows the same philosophy, training a vision adapter of cross-attention layers that feed image-encoder representations into a frozen Llama 3.1 text model (the 11B vision model is built on Llama 3.1 8B, and the 90B on Llama 3.1 70B).[17]
| Connector | Representative model | Mechanism | Trade-off |
|---|---|---|---|
| Linear / MLP projection | LLaVA, LLaVA-1.5, InternVL | Project patch features into LLM token space, concatenate with text | Simple and cheap; token count grows with image resolution |
| Q-Former (query transformer) | BLIP-2 | 32 learnable queries cross-attend to frozen image features | Fixed small token count; extra module to pretrain |
| Perceiver resampler + gated cross-attention | Flamingo | Resample to fixed tokens, inject via gated cross-attention into frozen LM | Handles interleaved image-text; more architectural complexity |
| Cross-attention adapter | Llama 3.2 Vision | Cross-attention layers feed image features into frozen LLM | Keeps text weights intact; adapter must be trained on image-text pairs |
The backbone ranges from open models such as Vicuna, LLaMA, Qwen2, OLMo, and Mistral NeMo in research and open-weights systems to undisclosed proprietary frontier models in commercial products. Open systems frequently advertise their pairing explicitly: Molmo-7B-D is built on Qwen2 7B and Molmo-72B on Qwen2 72B, while MolmoE-1B uses the OLMoE mixture-of-experts backbone.[18] In cross-attention designs the backbone weights are typically frozen and only the adapter is trained, whereas in projection designs the backbone is often unfrozen during instruction tuning.
Training a generative VLM typically proceeds in two or three stages, and contrastive encoders are trained separately beforehand.
Contrastive image-text pretraining produces the vision encoder. A dual-encoder model maximizes the cosine similarity of matching image-text pairs and minimizes it for non-matching pairs across a large web-scale corpus. CLIP used a symmetric softmax cross-entropy loss over each batch on 400 million pairs; SigLIP replaced this with a per-pair sigmoid loss that does not require a global normalization over the batch.[1][3] The resulting encoder is then reused, usually frozen, inside generative VLMs.
Alignment (feature pretraining) is the first stage of generative training. Only the connector is trained, on image-caption pairs, so that the language model learns to interpret visual tokens while the encoder and the LLM stay frozen. LLaVA used 558,000 image-text pairs for this stage.[6]
Multimodal instruction tuning is the second stage. The connector and often the LLM are fine-tuned on conversational, visual-question-answering, and document data so the model can follow instructions about images. LLaVA's influential demonstration in 2023 used GPT-4-generated visual instruction-tuning data (150,000 multimodal instruction-following examples), later augmented in LLaVA-1.5 with academic VQA datasets, to produce a capable assistant with modest compute.[6][11]
Preference optimization (RLHF and DPO), where applied, is a further stage aimed mainly at reducing hallucination and improving helpfulness. LLaVA-RLHF collected roughly 10,000 human preference pairs and applied a factually augmented reward to curb hallucinated content.[19] RLHF-V instead gathered fine-grained, segment-level human corrections on hallucinated spans and applied dense direct preference optimization (DPO); using about 1,400 annotated samples it reported a 34.8 percent reduction in the base model's hallucination rate.[20] DPO is attractive in this setting because it learns directly from preference pairs without training a separate reward model.
Generative VLMs support a recognizable set of tasks that also serve as the field's evaluation targets.
The table below lists widely cited VLMs with details verifiable from primary sources. Release years refer to the first public release of the named model.
| Model | Organization | Year | Type | Verifiable details |
|---|---|---|---|---|
| CLIP | OpenAI | 2021 | Contrastive | Dual-encoder trained on 400M image-text pairs; strong zero-shot transfer[1] |
| ALIGN | 2021 | Contrastive | Dual-encoder (EfficientNet + BERT) scaled to ~1.8B noisy pairs[2] | |
| Flamingo | DeepMind | 2022 | Generative | Frozen Chinchilla LM + gated cross-attention; sizes 3B, 9B, 80B; few-shot learning[4] |
| BLIP-2 | Salesforce | 2023 | Generative | Q-Former with 32 queries bridging frozen image encoder and frozen LLM[5] |
| LLaVA | U. Wisconsin-Madison et al. | 2023 | Generative | CLIP ViT + projection + Vicuna; visual instruction tuning[6] |
| MiniGPT-4 | KAUST | 2023 | Generative | Frozen encoder + single projection layer to a frozen LLM[7] |
| SigLIP | Google DeepMind | 2023 | Contrastive | Sigmoid loss for scalable contrastive pretraining[3] |
| Qwen-VL | Alibaba | 2023 | Generative | ViT-bigG visual receptor + Qwen LLM; 448px input; grounding and OCR[26] |
| GPT-4V | OpenAI | 2023 | Generative | Vision added to GPT-4; architecture undisclosed[8] |
| Gemini | 2023 | Generative | Natively multimodal; Ultra, Pro, Nano sizes[10] | |
| Claude 3 | Anthropic | 2024 | Generative | Vision across Haiku, Sonnet, Opus[9] |
| GPT-4o | OpenAI | 2024 | Generative | "Omni" model over text, image, and audio; announced May 13, 2024[27] |
| InternVL-1.5 | Shanghai AI Lab | 2024 | Generative | InternViT-6B + MLP + InternLM2; dynamic tiling to ~4K[13] |
| Llama 3.2 Vision | Meta | 2024 | Generative | Cross-attention adapter on frozen Llama 3.1; 11B and 90B variants[17] |
| Pixtral 12B | Mistral | 2024 | Generative | 400M Pixtral-ViT (RoPE-2D) + Mistral NeMo 12B decoder[16] |
| Molmo | Allen Institute for AI | 2024 | Generative | CLIP ViT + MLP + OLMo/Qwen2 backbones; native pointing; PixMo data[18] |
| Qwen2-VL / Qwen2.5-VL | Alibaba | 2024 / 2025 | Generative | Native dynamic-resolution ViT; long-video and grounding support[14][15] |
| InternVL2.5 / InternVL3 | Shanghai AI Lab | 2024 / 2025 | Generative | ViT-MLP-LLM; InternVL2.5-78B first open model above 70% on MMMU[12] |
Evaluation of VLMs draws on a layered set of public benchmarks that probe perception, reading, reasoning, and robustness.
| Benchmark | Year | What it tests | Notable details |
|---|---|---|---|
| VQAv2 | 2017 | General visual question answering | Over 1M questions on real images; balanced to reduce language priors[21] |
| TextVQA | 2019 | Reading text in natural images | 45,336 questions over 28,408 images requiring OCR[22] |
| DocVQA | 2021 | Document understanding | ~50,000 questions over 12,000+ document images[23] |
| ChartQA | 2022 | Chart question answering | Visual and logical reasoning over bar, line, and pie charts[24] |
| MME | 2023 | Perception and cognition | 14 subtasks; max 2000 (perception) + 800 (cognition); yes/no format[28] |
| MMBench | 2023 | Fine-grained ability suite | Multiple-choice across many abilities; CircularEval to reduce option bias[29] |
| SEED-Bench | 2023 | Comprehensive MLLM evaluation | Multiple-choice across spatial and temporal dimensions[30] |
| MMMU | 2023 | College-level multimodal reasoning | ~11,500 questions, 6 disciplines, 30 subjects, 183 subfields, 30 image types[25] |
| MathVista | 2023 | Visual mathematical reasoning | 6,141 examples combining prior datasets with new ones[31] |
| OCRBench | 2023 | OCR-centric capabilities | Aggregates text recognition and reading tasks across settings[32] |
MMMU has become the headline benchmark for multimodal reasoning, and a harder, contamination-resistant follow-up, MMMU-Pro, was released afterward.[25] By late 2024, InternVL2.5-78B became the first open-weights model to exceed 70 percent on MMMU, a level previously reached only by leading proprietary systems, and InternVL3-78B reported scores above 72 percent.[12] These figures still sit below the expert human ceiling originally reported for the benchmark, which illustrates how far multimodal reasoning has to go.
Despite rapid progress, VLMs share several persistent weaknesses.
Hallucination. Generative VLMs frequently describe objects, attributes, or text that are not present in the image. This is measured by benchmarks such as CHAIR, which scores hallucinated objects in captions, and POPE (Polling-based Object Probing Evaluation), which shows that models tend to confirm the presence of objects that merely co-occur frequently with what is actually shown.[33] Hallucination arises from cross-modal misalignment, over-reliance on language priors that override visual evidence, and spurious correlations in training data. Preference-optimization methods such as RLHF-V and LLaVA-RLHF are among the leading mitigations.[19][20]
Counting. Accurately counting objects remains one of the weakest areas; counting is consistently among the lowest-accuracy tasks across evaluated VLMs. Molmo's explicit point-then-count strategy was introduced partly to address this gap.[18]
Spatial reasoning. Judging relative positions, orientation, and geometric relationships (left of, behind, above) is unreliable, in part because contrastive encoders are trained to match images to holistic captions rather than to represent precise spatial layout.
Fine-grained perception. Reading small text, distinguishing visually similar objects, and resolving fine detail historically suffered from low-resolution inputs; the move to dynamic and native high resolution in InternVL, Qwen2-VL, and Pixtral was a direct response.[13][14][16]
Modality gap. Even after contrastive training, image and text embeddings tend to occupy separate regions of the shared space rather than fully intermixing, which can bias cross-modal retrieval and classification. The general theory of this effect is discussed at multimodal model.
VLMs underpin a growing set of products and research tools: image and document question answering, automated captioning and alt-text generation for accessibility, OCR and intelligent document processing (invoices, forms, receipts), chart and screenshot understanding for analytics and UI agents, visual search and retrieval, content moderation, medical-image report drafting (under regulatory oversight), and the perception front end of robotic vision-language-action models. Many of these reuse the same contrastive encoders and instruction-tuned backbones described above, so progress on core capabilities propagates quickly into applications.
The literature uses the full phrase "vision-language model," the unhyphenated "vision language model," and the acronym "VLM" interchangeably. Flamingo's 2022 title called itself "a Visual Language Model for Few-Shot Learning,"[4] using yet another close variant. Subsequent papers, model cards, and benchmark write-ups settled on "VLM" as the dominant short form while keeping the spelled-out version in formal titles. The aiwiki maintains both slugs so that readers searching either form land on the same content, with broader multimodal coverage at multimodal model.