Image-to-Text Models

Machine Learning Multimodal AI

36 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

69 citations

Revision

v5 · 7,183 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Multimodal Models and Tasks

Image-to-text models are machine learning systems that take an image as input and produce natural language text as output. The category covers image captioning (writing a sentence that describes what is in a picture), visual question answering (answering a question about an image), document understanding (reading scanned forms, receipts, or charts), OCR (transcribing printed or handwritten text), and the broader class of vision language models (VLMs) that combine an image encoder with a large language model to support open-ended visual reasoning. Image-to-text is one of the core task directions within the wider field of multimodal models. Modern systems such as GPT-4 with vision, Gemini, LLaVA, Qwen-VL, and Pixtral treat images as just another token stream that the language model can attend to alongside text. This article is a catalog of notable image-to-text models; for the architecture and theory of how image and text modalities are fused, see the dedicated vision language model page linked above.

Overview

An image-to-text model accepts pixel data (one image, several images, or a video frame sequence) and emits a text string. The task family includes:

Image captioning. Produce a short description of an image, often a single sentence. Classic benchmarks use the COCO dataset and report BLEU, METEOR, CIDEr, and SPICE scores.^[54]
Visual question answering (VQA). Given an image and a natural language question, return an answer. Datasets include VQAv2, GQA, OK-VQA, and ScienceQA.^[5]^[6]^[7]
Document understanding. Parse receipts, invoices, scientific papers, slide decks, and screenshots. Benchmarks include DocVQA, ChartQA, InfographicVQA, and TextVQA.^[51]^[52]
OCR with reasoning. Transcribe text and also act on it (translate it, summarize it, fill out a form, or extract structured data). Systems such as GOT-OCR2.0 and olmOCR specialize here.^[45]^[46]
Visual chain-of-thought. Step by step reasoning grounded in an image, often for math problems (MathVista), scientific diagrams (ScienceQA), or multi step puzzles.^[48]
Open ended visual dialogue. Multi turn conversation about one or more images, the format pioneered by LLaVA and now standard in commercial chat assistants.^[15]

A single modern VLM typically handles all of these tasks at once. The same checkpoint that captions a photo can also read a receipt, answer a question about a chart, and explain a meme. This consolidation is recent. Before 2022, image captioning models, VQA models, and OCR systems were usually separate research lines with their own architectures.

History

Pre deep learning era

Early image captioning relied on hand crafted pipelines: detect objects, retrieve sentences from a fixed corpus, and stitch them together with templates. The Flickr8k and Flickr30k datasets, released by Hodosh, Young, and Hockenmaier in 2013 and 2014, were the standard benchmarks. The MS COCO Captions dataset (Chen et al. 2015) expanded the field with around 330,000 images, each paired with five human written captions, and remains the most cited captioning benchmark.

Show and Tell era (2015 to 2017)

The first widely cited deep learning captioning model was "Show and Tell: A Neural Image Caption Generator" by Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan at Google (2015).^[1] It paired a convolutional neural network image encoder (Inception) with a long short term memory (LSTM) decoder.^[1]

A few months later, "Show, Attend and Tell" by Kelvin Xu and collaborators (2015) added soft and hard visual attention so the decoder could look at different image regions for different output words.^[2] The bottom up top down attention model from Peter Anderson et al. (2018) replaced grid features with region features from Faster R-CNN and was the basis for several VQA Challenge winners.^[3]

VQA itself became a benchmark with the VQA dataset (Antol et al. 2015) and its rebalanced successor VQAv2 (Goyal et al. 2017).^[4]^[5] Models in this era used custom fusion modules such as MUTAN, BAN, and MCAN to combine image and question features.

Contrastive era (2021)

CLIP (Contrastive Language Image Pretraining), released by OpenAI in January 2021 (Radford et al.), trained an image encoder and a text encoder jointly on 400 million image text pairs scraped from the web.^[8] The model learned to match images and captions in a shared embedding space. CLIP itself does not generate captions, but its image encoder became the standard vision backbone for most later generative VLMs (LLaVA, BLIP-2, MiniGPT-4, and others all use a CLIP ViT-L/14 or SigLIP variant). ALIGN, published by Chao Jia and collaborators at Google in 2021, scaled the same contrastive recipe to roughly 1.8 billion noisy pairs.^[9]

Generative VLM era (2022)

In early 2022 Salesforce Research released BLIP (Bootstrapping Language Image Pre-training) by Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP introduced a unified vision language pretraining objective and a captioning filter that bootstrapped its own training data.^[10] The successor BLIP-2 (January 2023) was more influential. It froze a vision encoder (typically a CLIP ViT) and froze an LLM (OPT or FLAN-T5), then trained a small lightweight Querying Transformer (Q-Former) to bridge the two.^[11] With only the Q-Former being trainable, BLIP-2 reached strong VQA and captioning numbers at a fraction of the training cost of end to end models.^[11]

DeepMind's Flamingo (Alayrac et al., April 2022) took a different route. A frozen Chinchilla LLM was extended with new gated cross attention layers that attended to features from a frozen NFNet image encoder.^[12] Flamingo was the first VLM to support interleaved image text prompts with strong few shot performance.^[12]

Other 2022 entries included Microsoft's GIT (Generative Image to Text) by Jianfeng Wang and collaborators (May 2022),^[13] Alibaba's OFA (Wang et al. 2022) which framed every vision language task as text to text,^[14] and Google's PaLI (Chen et al. 2022). Google extended the PaLI line with PaLI-X, a 55 billion parameter multilingual model with an improved OCR training recipe (Chen et al., May 2023), and PaLI-3 (Chen et al., October 2023), a much smaller 5 billion parameter model that swapped the classification pretrained vision backbone for a contrastively pretrained SigLIP encoder and matched models roughly ten times its size on many benchmarks.[^pali3]

LLaVA and the open VLM explosion (2023)

The model that opened the floodgates was LLaVA (Large Language and Vision Assistant) by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee at the University of Wisconsin and Microsoft Research, first posted to arXiv in April 2023.^[15] LLaVA combined a CLIP ViT-L/14 image encoder with the Vicuna LLM via a simple linear projection layer.^[15] Crucially, the authors used GPT-4 to generate 158,000 multimodal instruction following examples from COCO captions and bounding box annotations, then fine tuned the model on this synthetic instruction data.^[15] The recipe was simple, the code was open, the results were good, and within a few months dozens of similar systems appeared.

LLaVA-1.5 (October 2023) replaced the linear projector with a two layer MLP, added academic VQA datasets to the training mix, and reached state of the art numbers among open 7B and 13B models.^[16] LLaVA-NeXT (also called LLaVA-1.6, January 2024) added higher input resolution via image tiling.^[17] LLaVA-OneVision (August 2024) extended the family to single image, multi image, and video inputs in one checkpoint.^[18]

MiniGPT-4 by Deyao Zhu and collaborators (April 2023) followed a similar recipe with a Q-Former plug to Vicuna.^[19] MiniGPT-v2 (October 2023) generalized it to multiple visual tasks.

Microsoft Research pursued a parallel "multimodal large language model" line under the Kosmos name. Kosmos-1 (Huang et al., February 2023), subtitled "Language Is Not All You Need," trained a language model from scratch on interleaved image and text data and could perform captioning, VQA, and even a nonverbal IQ style test (Raven's Progressive Matrices) in a zero shot setting. Kosmos-2 (Peng et al., June 2023) added grounding: it represented referring expressions as markdown style links carrying bounding box location tokens, trained on a large grounded image text dataset called GrIT, so the model could point to the specific image region it was describing.[^kosmos]

Hugging Face released Idefics (Image aware Decoder Enhanced a la Flamingo) in August 2023, an open reproduction of Flamingo in 9B and 80B sizes.^[20] Idefics2 (April 2024) replaced the cross attention design with a more modern "perceiver resampler plus LLM" stack and pushed open benchmark scores significantly higher.^[21] Idefics3 (August 2024) reached parity with much larger closed models on document and chart benchmarks.^[22]

Alibaba's Qwen-VL (Bai et al., August 2023) and its instruction tuned variant Qwen-VL-Chat were the first credibly bilingual (English and Chinese) open VLMs.^[23] Qwen2-VL (August 2024) introduced dynamic resolution, allowing the model to process images of varying sizes natively, and added long video support.^[24] Qwen2.5-VL (January 2025) extended the lineup further with stronger document parsing across 3B, 7B, 32B, and 72B sizes;^[25] a refreshed Qwen2.5-VL-32B followed on March 24, 2025 with outputs tuned more closely to human preferences.[^qwen25vl] In September 2025 Alibaba released Qwen3-VL, available in both dense (2B, 4B, 8B, 32B) and mixture of experts (30B-A3B, 235B-A22B) variants, with native context up to 256K tokens (extendable toward 1M), OCR across 32 languages, and long video understanding.[^qwen3vl]

InternVL from the Shanghai AI Lab and OpenGVLab (Chen et al., December 2023) scaled the vision encoder itself to 6 billion parameters, far larger than the ViT-L/14 used by most VLMs.^[26] InternVL 1.5 (April 2024) added native dynamic resolution and a 26B parameter combined model.^[27] InternVL 2 (July 2024) introduced progressive alignment training. InternVL 2.5 (December 2024) became one of the first open models to surpass 70% on the MMMU benchmark.^[28] InternVL3 (Zhu et al., April 2025) switched to a native multimodal pretraining recipe that jointly learns vision and language in a single stage rather than bolting vision onto a finished text model; the 78B variant reached 72.2 on MMMU, a state of the art among open weight models at release.[^internvl3]

2024 wave of small open VLMs

Microsoft released Phi-3-Vision (4.2 billion parameters) in May 2024, the multimodal version of its Phi-3 small language model family.^[29] Phi-3.5-Vision (August 2024) added multi image support. The Phi-Vision models were aimed at on device and edge use cases.

OpenBMB and Tsinghua released the MiniCPM-V family. MiniCPM-Llama3-V 2.5 (May 2024) packed a Llama 3 8B LLM with a SigLIP vision encoder into a model that ran on a phone.^[37] MiniCPM-V 2.6 (August 2024) added single image, multi image, and video understanding in one 8B model.^[37] The line later extended to MiniCPM-o for on device multimodal interaction.

DeepSeek released DeepSeek-VL in March 2024 and DeepSeek-VL2 in December 2024.^[38]^[39] The VL2 release used a mixture of experts (MoE) language model backbone, with the largest active expert configuration around 4.5B active parameters.^[39]

Mistral AI's first multimodal release, Pixtral 12B, appeared on September 11, 2024.^[35] Pixtral introduced a 400M parameter vision encoder trained from scratch and used a Mistral Nemo 12B language backbone.^[35] Pixtral Large (November 18, 2024) scaled the same architecture with a Mistral Large 2 backbone to roughly 124B parameters.^[36]

Meta released Llama 3.2 11B Vision and Llama 3.2 90B Vision on September 25, 2024, alongside the smaller text only Llama 3.2 1B and 3B models.^[30] The vision variants used cross attention adapters in the style of Flamingo rather than a projector and concat design, and were the first Llama models with native image input.^[30]

The Allen Institute for AI released Molmo (Multimodal Open Language Model) by Matt Deitke, Christopher Clark, and collaborators in September 2024.^[40] Molmo was trained on a fully open new dataset called PixMo that included spoken caption transcripts collected from human annotators.^[40] The 72B model reached numbers competitive with proprietary systems while keeping the entire data, weights, and code open.^[40] NYU and Meta's Cambrian-1, led by Yann LeCun's group (June 2024), focused on vision centric design choices and benchmarked many vision backbones in one study.^[41]

Google also entered the open VLM space with PaliGemma (Beyer et al., released May 2024, paper July 2024), a versatile sub-3B base model that paired a 400M SigLIP-So400m vision encoder with a 2B Gemma language model and was meant to be fine tuned for transfer to specific tasks.[^paligemma] PaliGemma 2 (December 2024) kept the SigLIP encoder but upgraded to Gemma 2 backbones, offered 3B, 10B, and 28B sizes at 224, 448, and 896 pixel resolutions, and reported strong results on specialized tasks such as chemical formula recognition, music score recognition, and chest X-ray report generation.[^paligemma2]

Other notable 2024 open releases included 01.AI's Yi-VL (January 2024), Kuaishou's KwaiVGI, Shanghai AI Lab's InternLM-XComposer line, Tencent's HunyuanVL, and ByteDance's CogVLM and CogVLM2.

2025 wave of open VLMs

Open multimodal releases accelerated through 2025, with several major labs folding vision directly into their flagship open weight model families.

Google DeepMind's Gemma 3 (announced March 12, 2025) made the Gemma line multimodal for the first time. The 4B, 12B, and 27B variants share a single frozen SigLIP based vision encoder, handle 128K token contexts, and can answer questions about images and read text inside them, while the 1B variant stays text only.[^gemma3]

Meta's Llama 4 herd (April 5, 2025) was the first Llama generation built as natively multimodal, with an early fusion design that mixes text and vision tokens during joint pretraining, and a mixture of experts architecture. Llama 4 Scout used 17B active parameters with 16 experts and an unusually long context window, while Llama 4 Maverick used 17B active parameters with 128 experts.[^llama4]

Mistral AI folded vision into its small open model line with Mistral Small 3.1 (March 17, 2025), an Apache 2.0 licensed 24B model adding image understanding and a 128K token context window.[^mistralsmall31] Moonshot AI released Kimi-VL (April 2025), a mixture of experts VLM that activates only about 2.8B parameters in its language decoder (the Kimi-VL-A3B configuration) yet competes with larger efficient VLMs; a long thinking variant, Kimi-VL-A3B-Thinking, followed in June 2025.[^kimivl] These joined the continuing Qwen3-VL and InternVL3 families described above as the leading open weight options of 2025.

Closed frontier vision models (2023 to 2025)

OpenAI added vision input to GPT-4 with the GPT-4V release on September 25, 2023, and rolled it out to ChatGPT Plus users a few weeks later.^[31] GPT-4o ("omni"), released May 13, 2024, was OpenAI's first natively multimodal model, trained end to end on text, image, and audio tokens in one network.^[32] GPT-4o was followed by GPT-4o-mini (July 2024) and o1 with vision (December 2024). OpenAI's later flagship GPT-5 (released August 7, 2025) continued to accept text, image, and file input as a unified multimodal model.[^gpt5]

Google introduced Gemini 1.0 Ultra, Pro, and Nano in December 2023, with Gemini 1.0 Pro Vision being the first widely available variant.^[33] Gemini was described as natively multimodal from the start.^[33] Gemini 1.5 Pro (February 2024) launched with a 1 million token context window, allowing it to process around an hour of video or 11 hours of audio in a single prompt.^[34] Gemini 2.0 Flash (December 2024) added native image generation as well as native image understanding. The Gemini 2.5 family (March to June 2025) added thinking modes and 2 million token contexts.

Anthropic's Claude 3 family (March 2024) added image input to the Opus, Sonnet, and Haiku tiers. Claude 3.5 Sonnet (June 2024), Claude 3.7 Sonnet (February 2025), the Claude 4 models (Opus and Sonnet, May 2025), and Claude Opus 4.1 (August 3, 2025) all retained vision input.[^claude41] xAI's Grok 1.5V (April 2024) and later Grok 2 with vision introduced the RealWorldQA benchmark;^[53] Grok 4 (July 2025) continued the multimodal line. By 2025 the leading proprietary VLMs (GPT-5, the Gemini 2.5 family, and Claude Opus 4.1) and the strongest open weight models (Qwen3-VL, InternVL3) reported broadly comparable scores on multimodal reasoning benchmarks such as MMMU.

Notable open vision language models

Model	Released	Org	Size	Vision encoder	Language backbone	Notes
BLIP	Jan 2022	Salesforce	224M to 446M	ViT-B/L	BERT-large	Bootstrapped caption filtering^[10]
Flamingo	Apr 2022	DeepMind	80B	NFNet-F6	Chinchilla 70B	Closed, but defined cross attention design^[12]
BLIP-2	Jan 2023	Salesforce	188M Q-Former	ViT-g/14	OPT, FLAN-T5	Q-Former bridge to frozen LLM^[11]
MiniGPT-4	Apr 2023	KAUST	13B	ViT-g/14	Vicuna	Open Q-Former plus Vicuna^[19]
LLaVA-1.0	Apr 2023	UW, Microsoft	7B, 13B	CLIP ViT-L/14	Vicuna	GPT-4 generated visual instructions^[15]
InstructBLIP	May 2023	Salesforce	4B to 13B	ViT-g/14	Vicuna, FLAN-T5	Instruction tuned BLIP-2
IDEFICS	Aug 2023	Hugging Face	9B, 80B	OpenCLIP H	LLaMA	Open reproduction of Flamingo^[20]
Kosmos-2	Jun 2023	Microsoft	1.6B	CLIP ViT-L/14	from scratch	Grounding via bounding box tokens
Qwen-VL	Aug 2023	Alibaba	9.6B	OpenCLIP-bigG	Qwen-7B	First strong bilingual open VLM^[23]
CogVLM	Sep 2023	Zhipu, Tsinghua	17B	EVA-2-CLIP-E	Vicuna	Visual expert attention modules
LLaVA-1.5	Oct 2023	UW, Microsoft	7B, 13B	CLIP ViT-L/14-336	Vicuna 1.5	MLP projector, academic VQA mix^[16]
Fuyu-8B	Oct 2023	Adept	8B	none, raw patches	Persimmon	No separate encoder
Yi-VL	Jan 2024	01.AI	6B, 34B	CLIP ViT-H	Yi	Bilingual
LLaVA-NeXT (1.6)	Jan 2024	UW, Microsoft	7B, 13B, 34B	CLIP ViT-L/14	Vicuna, Mistral, Nous-Hermes	Higher resolution via tiling^[17]
MoE-LLaVA	Jan 2024	Peking U	3B active	CLIP	Phi-2, StableLM	Sparse MoE
DeepSeek-VL	Mar 2024	DeepSeek	1.3B, 7B	SigLIP-L	DeepSeek LLM	Hybrid SAM and SigLIP encoder^[38]
Idefics2	Apr 2024	Hugging Face	8B	SigLIP-SO400M	Mistral 7B	Replaced cross attn with perceiver^[21]
InternVL 1.5	Apr 2024	Shanghai AI Lab	26B	InternViT-6B	InternLM2-20B	Dynamic resolution^[27]
Phi-3-Vision	May 2024	Microsoft	4.2B	CLIP ViT-L/14	Phi-3-mini	Small device focused^[29]
MiniCPM-Llama3-V 2.5	May 2024	OpenBMB	8B	SigLIP-400M	Llama 3 8B	Runs on phones^[37]
PaliGemma	May 2024	Google	3B	SigLIP-So400m	Gemma 2B	Versatile transfer base model
Chameleon	Jun 2024	Meta FAIR	7B, 34B	quantized image tokens	Llama-style	Early fusion, mixed modality output
Cambrian-1	Jun 2024	NYU, Meta	8B to 34B	SVA spatial vision aggregator	Llama 3, Hermes	Vision centric design study^[41]
InternVL 2	Jul 2024	Shanghai AI Lab	1B to 76B	InternViT-300M to 6B	Various	Progressive alignment
Pixtral 12B	Sep 2024	Mistral	12B	400M from scratch	Mistral Nemo	First Mistral multimodal^[35]
Llama 3.2 Vision	Sep 2024	Meta	11B, 90B	self trained vision adapter	Llama 3.1	Cross attention adapter^[30]
Molmo	Sep 2024	Allen AI	1B to 72B	OpenAI CLIP, SigLIP	OLMo, Qwen2	PixMo dataset, fully open^[40]
Pixtral Large	Nov 2024	Mistral	124B	1B Pixtral-L vision	Mistral Large 2	Frontier scale open weights^[36]
InternVL 2.5	Dec 2024	Shanghai AI Lab	1B to 78B	InternViT-300M to 6B	InternLM 2.5	First open MMMU 70^[28]
DeepSeek-VL2	Dec 2024	DeepSeek	4.5B active MoE	SigLIP	DeepSeek-MoE	MoE backbone^[39]
PaliGemma 2	Dec 2024	Google	3B, 10B, 28B	SigLIP-So400m	Gemma 2	Multi resolution, OCR and X-ray captions
Qwen2.5-VL	Jan 2025	Alibaba	3B, 7B, 32B, 72B	dynamic ViT	Qwen2.5	Strong document parsing^[25]
Gemma 3	Mar 2025	Google	4B, 12B, 27B	SigLIP	Gemma 3	Single GPU, 128K context
Mistral Small 3.1	Mar 2025	Mistral	24B	Pixtral encoder	Mistral Small 3	Vision added to small open model
Llama 4 Scout / Maverick	Apr 2025	Meta	17B active MoE	early fusion	Llama 4	Native multimodal, MoE
Kimi-VL	Apr 2025	Moonshot AI	2.8B active MoE	MoonViT	Moonlight MoE	Efficient MoE, long thinking variant
InternVL3	Apr 2025	Shanghai AI Lab	1B to 78B	InternViT	various	Native multimodal pretraining, MMMU 72.2
Qwen3-VL	Sep 2025	Alibaba	2B to 235B (MoE)	dynamic ViT	Qwen3	256K context, 32-language OCR

Notable closed vision language models

Model	Released	Org	Notes
GPT-4V	Sep 25, 2023	OpenAI	First closed VLM at frontier scale, rolled out via ChatGPT Plus^[31]
Gemini 1.0 Pro Vision	Dec 2023	Google	Marketed as natively multimodal from training^[33]
Claude 3 Opus, Sonnet, Haiku	Mar 2024	Anthropic	Image input added to entire Claude line
Gemini 1.5 Pro	Feb 2024	Google	1M token context with video and audio^[34]
Reka Core, Flash, Edge	Apr 2024	Reka AI	Native multimodal trained from scratch
GPT-4o	May 13, 2024	OpenAI	End to end text, image, and audio tokens^[32]
Claude 3.5 Sonnet	Jun 2024	Anthropic	Strong chart and document VQA
Grok 1.5V	Apr 2024	xAI	RealWorldQA benchmark released alongside^[53]
GPT-4o mini	Jul 2024	OpenAI	Cheap vision tier
o1	Dec 2024	OpenAI	Reasoning model with vision
Gemini 2.0 Flash	Dec 2024	Google	Native image generation plus understanding
Claude 3.7 Sonnet	Feb 2025	Anthropic	Extended thinking, image input
Gemini 2.5 Pro	Mar 2025	Google	Thinking model, up to 2M context
Claude Opus 4, Sonnet 4	May 2025	Anthropic	Vision retained in Claude 4 generation
Grok 4	Jul 2025	xAI	Multimodal reasoning model
GPT-5	Aug 7, 2025	OpenAI	Unified text, image, and file input
Claude Opus 4.1	Aug 3, 2025	Anthropic	Refined Claude 4 flagship with vision

OCR focused models

While general VLMs handle OCR as a side capability, several models target text recognition and document parsing as the primary task.

Model	Released	Org	Approach
LayoutLM	2019	Microsoft	Joint text plus layout pretraining for forms^[43]
LayoutLMv2	2020	Microsoft	Adds image features
LayoutLMv3	2022	Microsoft	Unified text and image masking^[44]
Donut	2022	NAVER Clova	OCR free document understanding via end to end transformer^[42]
Pix2Struct	2022	Google	Screenshot to structured text pretraining
Nougat	2023	Meta	Academic PDF to markdown, math friendly
GOT-OCR2.0	2024	Stepfun, USTC	Unified text, formula, chart, music score recognition^[45]
Florence-2	2024	Microsoft	Unified detection, segmentation, captioning, OCR
Mistral OCR 3	2024	Mistral	API based document OCR
olmOCR	Feb 2025	Allen AI	7B OCR model, fully open dataset and weights^[46]
DeepSeek-OCR	2025	DeepSeek	Open OCR with mixture of experts backbone

Donut, published by Geewook Kim and collaborators (ECCV 2022), introduced the OCR free paradigm: instead of running a separate text detector and recognizer, a single transformer reads pixel patches and emits structured output such as JSON.^[42] Most modern VLMs follow this end to end approach implicitly.

Architectures

Three dominant architectures account for most image-to-text models from 2022 onward.

Vision encoder plus projection plus LLM (the LLaVA pattern)

A pretrained vision transformer encodes the image into a sequence of patch embeddings. A small adapter (a linear layer, an MLP, or a Q-Former) projects those embeddings into the language model's hidden size. The projected tokens are concatenated with the text tokens and fed to a standard decoder LLM. This is the dominant open source recipe, used by LLaVA, MiniGPT-4, Qwen-VL, InternVL, MiniCPM-V, Pixtral, and DeepSeek-VL. It is cheap to train because the vision encoder and often the LLM are frozen or only partially tuned.

Cross attention adapter (the Flamingo pattern)

The LLM is kept entirely frozen and unchanged. New gated cross attention layers are interleaved between the existing transformer blocks. These new layers attend from the text hidden states to a perceived sequence of image features. Because the new layers are gated and start at zero, the base LLM's text only behavior is preserved exactly at initialization. Flamingo, Idefics 1, and Llama 3.2 Vision use this design.^[12]^[20]^[30] It is more parameter efficient at inference but requires modifying the base LLM, which complicates deployment.

Native multimodal early fusion (the Chameleon and Gemini pattern)

Every modality is tokenized at the input. Text uses byte pair encoding. Images are tokenized into discrete codes (Chameleon uses a VQ-VAE tokenizer; Gemini uses a custom image tokenizer). Audio uses a similar discrete code. All token streams are interleaved into a single sequence that a standard transformer processes from scratch. The model can generate any modality as output if it has been trained to predict tokens of that modality. Chameleon (Meta, May 2024), Janus (DeepSeek, October 2024), and Gemini are the public examples. Native multimodal models tend to require much more compute to train but can generate as well as understand multiple modalities with one network.

Q-Former (BLIP-2)

The Q-Former is a small transformer with a fixed set of learnable query embeddings (32 in BLIP-2).^[11] It cross attends to the frozen image encoder output, extracting a fixed length summary that the frozen LLM can ingest. The Q-Former trades some expressiveness for very low cost: only the Q-Former parameters (around 188M) are trained.^[11] Its successors in InstructBLIP and X-LLM keep this idea.

Evaluation benchmarks

The benchmark landscape mirrors the task families described above.

Benchmark	Released	Focus	Metric
Flickr30k	2014	Captioning	BLEU, CIDEr
COCO Captions	2015	Captioning	BLEU, METEOR, CIDEr, SPICE
VQAv2	2017	General VQA	Accuracy^[5]
GQA	2019	Compositional VQA	Accuracy^[6]
OK-VQA	2019	VQA requiring external knowledge	Accuracy^[7]
TextVQA	2019	VQA over images with text	Accuracy
DocVQA	2020	Document VQA	ANLS^[51]
ChartQA	2022	Charts and plots VQA	Relaxed accuracy^[52]
InfographicVQA	2022	Infographics VQA	ANLS
ScienceQA	2022	K-12 science with images	Accuracy
POPE	2023	Object hallucination	F1^[50]
MM-Vet	2023	Integrated capabilities	Open ended GPT-4 grading^[49]
MMBench	2023	Multiple choice multimodal	Accuracy
SEED-Bench	2023	Multiple choice multimodal	Accuracy
LLaVA-Bench (In-the-Wild)	2023	Open ended VQA	GPT-4 grading
MMMU	Nov 2023	College multi discipline	Accuracy^[47]
MathVista	Oct 2023	Mathematical visual reasoning	Accuracy^[48]
RealWorldQA	Apr 2024	Spatial real world questions	Accuracy^[53]
VisualWebArena	2024	Web browsing agents	Task success
MMMU-Pro	Sep 2024	Harder MMMU	Accuracy
MathVerse	2024	Diagram heavy math	Accuracy

Key landmarks:

VQAv2 (Goyal et al. 2017) rebalanced the original VQA dataset to remove language priors and is the closest thing to a universal sanity check.^[5]
GQA (Hudson and Manning 2019) used scene graphs from Visual Genome to generate compositional questions that test reasoning over object attributes and relations.^[6]
OK-VQA (Marino et al. 2019) requires external world knowledge to answer.^[7]
MMMU (Yue et al., November 2023) introduced college level multi discipline questions covering art, business, science, health, humanities, and engineering, with around 11,500 questions and explicit image grounding.^[47] As of early 2025, GPT-4o, Gemini 2.0, Claude 3.5 Sonnet, and InternVL 2.5 78B all report scores in the high 60s to mid 70s on the validation set.
MathVista (Lu et al., October 2023) combines algebra, geometry, statistics, and scientific figure questions.^[48] It surfaced large gaps between models on visual mathematical reasoning.
MM-Vet (Yu et al. 2023) uses GPT-4 as a judge to grade integrated capabilities (recognition, OCR, knowledge, language generation, spatial reasoning, math).^[49]
RealWorldQA, released by xAI in April 2024 with Grok 1.5V, focuses on physical world spatial questions photographed from cars and rooms.^[53]

Leaderboards aggregated by OpenCompass, the Hugging Face Open VLM Leaderboard, and lmms-eval track most of these benchmarks across hundreds of open and closed models.^[55]^[56]

Open source ecosystem

A mature stack supports running and fine tuning open VLMs.

Hugging Face Transformers ships first class support for most open VLMs via AutoModelForVision2Seq, LlavaForConditionalGeneration, Qwen2VLForConditionalGeneration, MllamaForConditionalGeneration (Llama 3.2 Vision), Idefics3ForConditionalGeneration, PixtralForConditionalGeneration, and similar classes. The Hugging Face Hub hosts public weights for thousands of fine tuned VLM variants.
vLLM added vision model support in 2024 and now serves LLaVA, Qwen-VL, Llama 3.2 Vision, Pixtral, InternVL, Phi-3-Vision, MiniCPM-V, and most other major open VLMs via OpenAI compatible APIs.
SGLang by the LMSYS group supports VLMs with structured generation primitives.
Ollama packages open VLMs (LLaVA, BakLLaVA, MiniCPM-V, Llama 3.2 Vision, Qwen2.5-VL, Granite Vision) as local quantized GGUF builds runnable on a laptop.
LM Studio offers a desktop GUI for the same family of models.
MLX-VLM runs VLMs natively on Apple Silicon via the MLX framework.
llama.cpp supports LLaVA style models via the mmproj adapter file pattern.
lmms-eval is the de facto open evaluation harness for VLMs.

Datasets used for instruction tuning of open VLMs include LLaVA-Instruct-150K, ShareGPT4V, the Cauldron (used for Idefics2), Cambrian-10M (used for Cambrian-1), PixMo (used for Molmo), the Idefics3 training mix, and various synthetic OCR and chart datasets generated by Qwen-VL and InternVL teams.

Applications

Accessibility. Generating alt text for images, describing scenes for blind and low vision users (Be My Eyes integrated GPT-4V in 2023 for this purpose).^[31]
Document workflows. Receipt and invoice extraction, contract analysis, scanned form digitization, slide and PDF parsing in tools such as Anthropic Claude's document features and Google's Document AI.
Coding from screenshots. Tools such as v0 by Vercel and Cursor accept screenshots and generate matching HTML, CSS, and React code.
Robotics and embodied AI. Vision language action models build on VLMs to produce robot motor commands. RT-2 (Google DeepMind 2023), PaLM-E, and OpenVLA all use a VLM backbone.
Web and GUI agents. Anthropic's Claude with computer use (October 2024), OpenAI's Operator (January 2025), and Google's Project Astra read screen pixels to control computers and phones.
Content moderation. Automated flagging of unsafe images using textual policy descriptions.
E-commerce. Product attribute extraction from listing photos, visual search, automatic product description writing.
Medical imaging. Med-PaLM M (Google 2023), LLaVA-Med (Microsoft 2023), and CheXagent (Stanford 2024) generate radiology reports from chest X-rays.
Education. Math homework helpers, science diagram tutors, and visual chain of thought tools.
Search. Google Lens, Bing Visual Search, and Perplexity's image search use VLM components.

Limitations

Hallucinated objects. Open VLMs often describe objects that are not in the image, especially when the prompt suggests certain objects. The POPE benchmark (Li et al. 2023) specifically measures this failure mode.^[50]
Counting. Even frontier models miscount objects past four or five, and struggle with crowd estimation.
Fine grained text in images. OCR quality drops sharply at small font sizes or rotated text; specialized OCR models still outperform general VLMs on dense documents.
Spatial reasoning. Locations, distances, and 3D relationships are hard. RealWorldQA highlighted this in 2024.^[53]
Charts with subtle numeric differences. Chart VQA degrades when models must read exact values rather than relative trends.
Bias from web pretraining. CLIP and successors inherited biases from web image text pairs, including gender and racial associations with occupations and demographic labels.^[8]
Resolution constraints. Most VLMs internally downsample images to a fixed grid (often 336 by 336 or 448 by 448 pixels per tile), which limits the readable detail. Dynamic resolution methods (Qwen2-VL, InternVL 1.5) partially address this.^[24]^[27]
Adversarial images. Imperceptible pixel perturbations can flip model outputs.
Visual prompt injection. Text instructions embedded inside an image can override the user's prompt, a security risk for any agent that reads screenshots.
Privacy. A photograph contains far more identifying detail than the user usually realizes (timestamps in EXIF data, license plates, signage, faces). VLMs can extract all of it.

References

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). "Show and Tell: A Neural Image Caption Generator." CVPR 2015. arXiv:1411.4555. ↩
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015. arXiv:1502.03044. ↩
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering." CVPR 2018. arXiv:1707.07998. ↩
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). "VQA: Visual Question Answering." ICCV 2015. arXiv:1505.00468. ↩
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). "Making the V in VQA Matter." CVPR 2017. arXiv:1612.00837. ↩
Hudson, D. A., & Manning, C. D. (2019). "GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering." CVPR 2019. arXiv:1902.09506. ↩
Marino, K., Rastegari, M., Farhadi, A., & Mottaghi, R. (2019). "OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge." CVPR 2019. ↩
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). arXiv:2103.00020. ↩
Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., et al. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision" (ALIGN). ICML 2021. arXiv:2102.05918. ↩
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation." arXiv:2201.12086. ↩
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." arXiv:2301.12597. ↩
Alayrac, J. B., Donahue, J., Luc, P., Miech, A., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning." arXiv:2204.14198. ↩
Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., & Wang, L. (2022). "GIT: A Generative Image-to-text Transformer for Vision and Language." arXiv:2205.14100. ↩
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., et al. (2022). "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework." arXiv:2202.03052. ↩
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). "Visual Instruction Tuning" (LLaVA). NeurIPS 2023. arXiv:2304.08485. ↩
Liu, H., Li, C., Li, Y., & Lee, Y. J. (2023). "Improved Baselines with Visual Instruction Tuning" (LLaVA-1.5). arXiv:2310.03744. ↩
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y. J. (2024). "LLaVA-NeXT: Improved reasoning, OCR, and world knowledge." Blog post, January 30, 2024. ↩
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., et al. (2024). "LLaVA-OneVision: Easy Visual Task Transfer." arXiv:2408.03326. ↩
Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models." arXiv:2304.10592. ↩
Hugging Face team (2023). "Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model." Hugging Face blog, August 22, 2023. ↩
Laurençon, H., Tronchon, L., Cord, M., & Sanh, V. (2024). "What matters when building vision-language models?" (Idefics2). arXiv:2405.02246. ↩
Laurençon, H., Marafioti, A., Sanh, V., & Tronchon, L. (2024). "Building and better understanding vision-language models: insights and future directions" (Idefics3). arXiv:2408.12637. ↩
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., & Zhou, J. (2023). "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv:2308.12966. ↩
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., et al. (2024). "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution." arXiv:2409.12191. ↩
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., et al. (2025). "Qwen2.5-VL Technical Report." arXiv:2502.13923. ↩
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., et al. (2023). "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks." arXiv:2312.14238. ↩
Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., et al. (2024). "How Far Are We to GPT-4V? Closing the Gap with Commercial Multimodal Models with Open-Source Suites" (InternVL 1.5). arXiv:2404.16821. ↩
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., et al. (2024). "Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling" (InternVL 2.5). arXiv:2412.05271. ↩
Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., et al. (2024). "Phi-3 Technical Report." arXiv:2404.14219. ↩
Meta AI (2024). "Llama 3.2: Revolutionizing edge AI and vision with open, customizable models." Meta AI blog, September 25, 2024. ↩
OpenAI (2023). "GPT-4V(ision) System Card." OpenAI, September 25, 2023. ↩
OpenAI (2024). "Hello GPT-4o." OpenAI blog, May 13, 2024. ↩
Gemini Team, Google (2023). "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805. ↩
Gemini Team, Google (2024). "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530. ↩
Mistral AI (2024). "Pixtral 12B." Mistral AI blog, September 11, 2024. ↩
Mistral AI (2024). "Pixtral Large." Mistral AI blog, November 18, 2024. ↩
Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., et al. (2024). "MiniCPM-V: A GPT-4V Level MLLM on Your Phone." arXiv:2408.01800. ↩
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., et al. (2024). "DeepSeek-VL: Towards Real-World Vision-Language Understanding." arXiv:2403.05525. ↩
Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., et al. (2024). "DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding." arXiv:2412.10302. ↩
Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J. S., et al. (2024). "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models." arXiv:2409.17146. ↩
Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S. C., et al. (2024). "Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs." arXiv:2406.16860. ↩
Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., et al. (2022). "OCR-free Document Understanding Transformer" (Donut). ECCV 2022. arXiv:2111.15664. ↩
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). "LayoutLM: Pre-training of Text and Layout for Document Image Understanding." KDD 2020. arXiv:1912.13318. ↩
Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022). "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking." arXiv:2204.08387. ↩
Wei, H., Liu, C., Chen, J., Wang, J., Kong, L., Xu, Y., et al. (2024). "General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model" (GOT-OCR2.0). arXiv:2409.01704. ↩
Poznanski, J., Borchardt, J., Dunkelberger, J., Tran, R., Lochab, R., Hossain, B., et al. (2025). "olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models." arXiv:2502.18443. ↩
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., et al. (2023). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." arXiv:2311.16502. ↩
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., et al. (2023). "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts." arXiv:2310.02255. ↩
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., & Wang, L. (2023). "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities." arXiv:2308.02490. ↩
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., & Wen, J. R. (2023). "Evaluating Object Hallucination in Large Vision-Language Models" (POPE). arXiv:2305.10355. ↩
Mathew, M., Karatzas, D., & Jawahar, C. V. (2021). "DocVQA: A Dataset for VQA on Document Images." WACV 2021. arXiv:2007.00398. ↩
Masry, A., Long, D. X., Tan, J. Q., Joty, S., & Hoque, E. (2022). "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning." ACL 2022. arXiv:2203.10244. ↩
xAI (2024). "Grok-1.5 Vision Preview." xAI blog, April 12, 2024 (includes RealWorldQA release). ↩
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). "Microsoft COCO: Common Objects in Context." ECCV 2014. arXiv:1405.0312. ↩
Hugging Face. "Open VLM Leaderboard." Hugging Face Spaces. ↩
OpenCompass. "OpenCompass Multi-modal Leaderboard." ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Zero-Shot Image Classification Models