Image-to-Text Models
Last reviewed
May 31, 2026
Sources
56 citations
Review status
Source-backed
Revision
v3 · 7,183 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
56 citations
Review status
Source-backed
Revision
v3 · 7,183 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Multimodal Models and Tasks
Image-to-text models are machine learning systems that take an image as input and produce natural language text as output. The category covers image captioning (writing a sentence that describes what is in a picture), visual question answering (answering a question about an image), document understanding (reading scanned forms, receipts, or charts), OCR (transcribing printed or handwritten text), and the broader class of vision language models (VLMs) that combine an image encoder with a large language model to support open-ended visual reasoning. Image-to-text is one of the core task directions within the wider field of multimodal models. Modern systems such as GPT-4 with vision, Gemini, LLaVA, Qwen-VL, and Pixtral treat images as just another token stream that the language model can attend to alongside text. This article is a catalog of notable image-to-text models; for the architecture and theory of how image and text modalities are fused, see the dedicated vision language model page linked above.
An image-to-text model accepts pixel data (one image, several images, or a video frame sequence) and emits a text string. The task family includes:
A single modern VLM typically handles all of these tasks at once. The same checkpoint that captions a photo can also read a receipt, answer a question about a chart, and explain a meme. This consolidation is recent. Before 2022, image captioning models, VQA models, and OCR systems were usually separate research lines with their own architectures.
Early image captioning relied on hand crafted pipelines: detect objects, retrieve sentences from a fixed corpus, and stitch them together with templates. The Flickr8k and Flickr30k datasets, released by Hodosh, Young, and Hockenmaier in 2013 and 2014, were the standard benchmarks. The MS COCO Captions dataset (Chen et al. 2015) expanded the field with around 330,000 images, each paired with five human written captions, and remains the most cited captioning benchmark.
The first widely cited deep learning captioning model was "Show and Tell: A Neural Image Caption Generator" by Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan at Google (2015). It paired a convolutional neural network image encoder (Inception) with a long short term memory (LSTM) decoder.
A few months later, "Show, Attend and Tell" by Kelvin Xu and collaborators (2015) added soft and hard visual attention so the decoder could look at different image regions for different output words. The bottom up top down attention model from Peter Anderson et al. (2018) replaced grid features with region features from Faster R-CNN and was the basis for several VQA Challenge winners.
VQA itself became a benchmark with the VQA dataset (Antol et al. 2015) and its rebalanced successor VQAv2 (Goyal et al. 2017). Models in this era used custom fusion modules such as MUTAN, BAN, and MCAN to combine image and question features.
CLIP (Contrastive Language Image Pretraining), released by OpenAI in January 2021 (Radford et al.), trained an image encoder and a text encoder jointly on 400 million image text pairs scraped from the web. The model learned to match images and captions in a shared embedding space. CLIP itself does not generate captions, but its image encoder became the standard vision backbone for most later generative VLMs (LLaVA, BLIP-2, MiniGPT-4, and others all use a CLIP ViT-L/14 or SigLIP variant). ALIGN, published by Chao Jia and collaborators at Google in 2021, scaled the same contrastive recipe to roughly 1.8 billion noisy pairs.
In early 2022 Salesforce Research released BLIP (Bootstrapping Language Image Pre-training) by Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP introduced a unified vision language pretraining objective and a captioning filter that bootstrapped its own training data. The successor BLIP-2 (January 2023) was more influential. It froze a vision encoder (typically a CLIP ViT) and froze an LLM (OPT or FLAN-T5), then trained a small lightweight Querying Transformer (Q-Former) to bridge the two. With only the Q-Former being trainable, BLIP-2 reached strong VQA and captioning numbers at a fraction of the training cost of end to end models.
DeepMind's Flamingo (Alayrac et al., April 2022) took a different route. A frozen Chinchilla LLM was extended with new gated cross attention layers that attended to features from a frozen NFNet image encoder. Flamingo was the first VLM to support interleaved image text prompts with strong few shot performance.
Other 2022 entries included Microsoft's GIT (Generative Image to Text) by Jianfeng Wang and collaborators (May 2022), Alibaba's OFA (Wang et al. 2022) which framed every vision language task as text to text, and Google's PaLI (Chen et al. 2022). Google extended the PaLI line with PaLI-X, a 55 billion parameter multilingual model with an improved OCR training recipe (Chen et al., May 2023), and PaLI-3 (Chen et al., October 2023), a much smaller 5 billion parameter model that swapped the classification pretrained vision backbone for a contrastively pretrained SigLIP encoder and matched models roughly ten times its size on many benchmarks.[^pali3]
The model that opened the floodgates was LLaVA (Large Language and Vision Assistant) by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee at the University of Wisconsin and Microsoft Research, first posted to arXiv in April 2023. LLaVA combined a CLIP ViT-L/14 image encoder with the Vicuna LLM via a simple linear projection layer. Crucially, the authors used GPT-4 to generate 158,000 multimodal instruction following examples from COCO captions and bounding box annotations, then fine tuned the model on this synthetic instruction data. The recipe was simple, the code was open, the results were good, and within a few months dozens of similar systems appeared.
LLaVA-1.5 (October 2023) replaced the linear projector with a two layer MLP, added academic VQA datasets to the training mix, and reached state of the art numbers among open 7B and 13B models. LLaVA-NeXT (also called LLaVA-1.6, January 2024) added higher input resolution via image tiling. LLaVA-OneVision (August 2024) extended the family to single image, multi image, and video inputs in one checkpoint.
MiniGPT-4 by Deyao Zhu and collaborators (April 2023) followed a similar recipe with a Q-Former plug to Vicuna. MiniGPT-v2 (October 2023) generalized it to multiple visual tasks.
Microsoft Research pursued a parallel "multimodal large language model" line under the Kosmos name. Kosmos-1 (Huang et al., February 2023), subtitled "Language Is Not All You Need," trained a language model from scratch on interleaved image and text data and could perform captioning, VQA, and even a nonverbal IQ style test (Raven's Progressive Matrices) in a zero shot setting. Kosmos-2 (Peng et al., June 2023) added grounding: it represented referring expressions as markdown style links carrying bounding box location tokens, trained on a large grounded image text dataset called GrIT, so the model could point to the specific image region it was describing.[^kosmos]
Hugging Face released Idefics (Image aware Decoder Enhanced a la Flamingo) in August 2023, an open reproduction of Flamingo in 9B and 80B sizes. Idefics2 (April 2024) replaced the cross attention design with a more modern "perceiver resampler plus LLM" stack and pushed open benchmark scores significantly higher. Idefics3 (August 2024) reached parity with much larger closed models on document and chart benchmarks.
Alibaba's Qwen-VL (Bai et al., August 2023) and its instruction tuned variant Qwen-VL-Chat were the first credibly bilingual (English and Chinese) open VLMs. Qwen2-VL (August 2024) introduced dynamic resolution, allowing the model to process images of varying sizes natively, and added long video support. Qwen2.5-VL (January 2025) extended the lineup further with stronger document parsing across 3B, 7B, 32B, and 72B sizes; a refreshed Qwen2.5-VL-32B followed on March 24, 2025 with outputs tuned more closely to human preferences.[^qwen25vl] In September 2025 Alibaba released Qwen3-VL, available in both dense (2B, 4B, 8B, 32B) and mixture of experts (30B-A3B, 235B-A22B) variants, with native context up to 256K tokens (extendable toward 1M), OCR across 32 languages, and long video understanding.[^qwen3vl]
InternVL from the Shanghai AI Lab and OpenGVLab (Chen et al., December 2023) scaled the vision encoder itself to 6 billion parameters, far larger than the ViT-L/14 used by most VLMs. InternVL 1.5 (April 2024) added native dynamic resolution and a 26B parameter combined model. InternVL 2 (July 2024) introduced progressive alignment training. InternVL 2.5 (December 2024) became one of the first open models to surpass 70% on the MMMU benchmark. InternVL3 (Zhu et al., April 2025) switched to a native multimodal pretraining recipe that jointly learns vision and language in a single stage rather than bolting vision onto a finished text model; the 78B variant reached 72.2 on MMMU, a state of the art among open weight models at release.[^internvl3]
Microsoft released Phi-3-Vision (4.2 billion parameters) in May 2024, the multimodal version of its Phi-3 small language model family. Phi-3.5-Vision (August 2024) added multi image support. The Phi-Vision models were aimed at on device and edge use cases.
OpenBMB and Tsinghua released the MiniCPM-V family. MiniCPM-Llama3-V 2.5 (May 2024) packed a Llama 3 8B LLM with a SigLIP vision encoder into a model that ran on a phone. MiniCPM-V 2.6 (August 2024) added single image, multi image, and video understanding in one 8B model. The line later extended to MiniCPM-o for on device multimodal interaction.
DeepSeek released DeepSeek-VL in March 2024 and DeepSeek-VL2 in December 2024. The VL2 release used a mixture of experts (MoE) language model backbone, with the largest active expert configuration around 4.5B active parameters.
Mistral AI's first multimodal release, Pixtral 12B, appeared on September 11, 2024. Pixtral introduced a 400M parameter vision encoder trained from scratch and used a Mistral Nemo 12B language backbone. Pixtral Large (November 18, 2024) scaled the same architecture with a Mistral Large 2 backbone to roughly 124B parameters.
Meta released Llama 3.2 11B Vision and Llama 3.2 90B Vision on September 25, 2024, alongside the smaller text only Llama 3.2 1B and 3B models. The vision variants used cross attention adapters in the style of Flamingo rather than a projector and concat design, and were the first Llama models with native image input.
The Allen Institute for AI released Molmo (Multimodal Open Language Model) by Matt Deitke, Christopher Clark, and collaborators in September 2024. Molmo was trained on a fully open new dataset called PixMo that included spoken caption transcripts collected from human annotators. The 72B model reached numbers competitive with proprietary systems while keeping the entire data, weights, and code open. NYU and Meta's Cambrian-1, led by Yann LeCun's group (June 2024), focused on vision centric design choices and benchmarked many vision backbones in one study.
Google also entered the open VLM space with PaliGemma (Beyer et al., released May 2024, paper July 2024), a versatile sub-3B base model that paired a 400M SigLIP-So400m vision encoder with a 2B Gemma language model and was meant to be fine tuned for transfer to specific tasks.[^paligemma] PaliGemma 2 (December 2024) kept the SigLIP encoder but upgraded to Gemma 2 backbones, offered 3B, 10B, and 28B sizes at 224, 448, and 896 pixel resolutions, and reported strong results on specialized tasks such as chemical formula recognition, music score recognition, and chest X-ray report generation.[^paligemma2]
Other notable 2024 open releases included 01.AI's Yi-VL (January 2024), Kuaishou's KwaiVGI, Shanghai AI Lab's InternLM-XComposer line, Tencent's HunyuanVL, and ByteDance's CogVLM and CogVLM2.
Open multimodal releases accelerated through 2025, with several major labs folding vision directly into their flagship open weight model families.
Google DeepMind's Gemma 3 (announced March 12, 2025) made the Gemma line multimodal for the first time. The 4B, 12B, and 27B variants share a single frozen SigLIP based vision encoder, handle 128K token contexts, and can answer questions about images and read text inside them, while the 1B variant stays text only.[^gemma3]
Meta's Llama 4 herd (April 5, 2025) was the first Llama generation built as natively multimodal, with an early fusion design that mixes text and vision tokens during joint pretraining, and a mixture of experts architecture. Llama 4 Scout used 17B active parameters with 16 experts and an unusually long context window, while Llama 4 Maverick used 17B active parameters with 128 experts.[^llama4]
Mistral AI folded vision into its small open model line with Mistral Small 3.1 (March 17, 2025), an Apache 2.0 licensed 24B model adding image understanding and a 128K token context window.[^mistralsmall31] Moonshot AI released Kimi-VL (April 2025), a mixture of experts VLM that activates only about 2.8B parameters in its language decoder (the Kimi-VL-A3B configuration) yet competes with larger efficient VLMs; a long thinking variant, Kimi-VL-A3B-Thinking, followed in June 2025.[^kimivl] These joined the continuing Qwen3-VL and InternVL3 families described above as the leading open weight options of 2025.
OpenAI added vision input to GPT-4 with the GPT-4V release on September 25, 2023, and rolled it out to ChatGPT Plus users a few weeks later. GPT-4o ("omni"), released May 13, 2024, was OpenAI's first natively multimodal model, trained end to end on text, image, and audio tokens in one network. GPT-4o was followed by GPT-4o-mini (July 2024) and o1 with vision (December 2024). OpenAI's later flagship GPT-5 (released August 7, 2025) continued to accept text, image, and file input as a unified multimodal model.[^gpt5]
Google introduced Gemini 1.0 Ultra, Pro, and Nano in December 2023, with Gemini 1.0 Pro Vision being the first widely available variant. Gemini was described as natively multimodal from the start. Gemini 1.5 Pro (February 2024) launched with a 1 million token context window, allowing it to process around an hour of video or 11 hours of audio in a single prompt. Gemini 2.0 Flash (December 2024) added native image generation as well as native image understanding. The Gemini 2.5 family (March to June 2025) added thinking modes and 2 million token contexts.
Anthropic's Claude 3 family (March 2024) added image input to the Opus, Sonnet, and Haiku tiers. Claude 3.5 Sonnet (June 2024), Claude 3.7 Sonnet (February 2025), the Claude 4 models (Opus and Sonnet, May 2025), and Claude Opus 4.1 (August 3, 2025) all retained vision input.[^claude41] xAI's Grok 1.5V (April 2024) and later Grok 2 with vision introduced the RealWorldQA benchmark; Grok 4 (July 2025) continued the multimodal line. By 2025 the leading proprietary VLMs (GPT-5, the Gemini 2.5 family, and Claude Opus 4.1) and the strongest open weight models (Qwen3-VL, InternVL3) reported broadly comparable scores on multimodal reasoning benchmarks such as MMMU.
| Model | Released | Org | Size | Vision encoder | Language backbone | Notes |
|---|---|---|---|---|---|---|
| BLIP | Jan 2022 | Salesforce | 224M to 446M | ViT-B/L | BERT-large | Bootstrapped caption filtering |
| Flamingo | Apr 2022 | DeepMind | 80B | NFNet-F6 | Chinchilla 70B | Closed, but defined cross attention design |
| BLIP-2 | Jan 2023 | Salesforce | 188M Q-Former | ViT-g/14 | OPT, FLAN-T5 | Q-Former bridge to frozen LLM |
| MiniGPT-4 | Apr 2023 | KAUST | 13B | ViT-g/14 | Vicuna | Open Q-Former plus Vicuna |
| LLaVA-1.0 | Apr 2023 | UW, Microsoft | 7B, 13B | CLIP ViT-L/14 | Vicuna | GPT-4 generated visual instructions |
| InstructBLIP | May 2023 | Salesforce | 4B to 13B | ViT-g/14 | Vicuna, FLAN-T5 | Instruction tuned BLIP-2 |
| IDEFICS | Aug 2023 | Hugging Face | 9B, 80B | OpenCLIP H | LLaMA | Open reproduction of Flamingo |
| Kosmos-2 | Jun 2023 | Microsoft | 1.6B | CLIP ViT-L/14 | from scratch | Grounding via bounding box tokens |
| Qwen-VL | Aug 2023 | Alibaba | 9.6B | OpenCLIP-bigG | Qwen-7B | First strong bilingual open VLM |
| CogVLM | Sep 2023 | Zhipu, Tsinghua | 17B | EVA-2-CLIP-E | Vicuna | Visual expert attention modules |
| LLaVA-1.5 | Oct 2023 | UW, Microsoft | 7B, 13B | CLIP ViT-L/14-336 | Vicuna 1.5 | MLP projector, academic VQA mix |
| Fuyu-8B | Oct 2023 | Adept | 8B | none, raw patches | Persimmon | No separate encoder |
| Yi-VL | Jan 2024 | 01.AI | 6B, 34B | CLIP ViT-H | Yi | Bilingual |
| LLaVA-NeXT (1.6) | Jan 2024 | UW, Microsoft | 7B, 13B, 34B | CLIP ViT-L/14 | Vicuna, Mistral, Nous-Hermes | Higher resolution via tiling |
| MoE-LLaVA | Jan 2024 | Peking U | 3B active | CLIP | Phi-2, StableLM | Sparse MoE |
| DeepSeek-VL | Mar 2024 | DeepSeek | 1.3B, 7B | SigLIP-L | DeepSeek LLM | Hybrid SAM and SigLIP encoder |
| Idefics2 | Apr 2024 | Hugging Face | 8B | SigLIP-SO400M | Mistral 7B | Replaced cross attn with perceiver |
| InternVL 1.5 | Apr 2024 | Shanghai AI Lab | 26B | InternViT-6B | InternLM2-20B | Dynamic resolution |
| Phi-3-Vision | May 2024 | Microsoft | 4.2B | CLIP ViT-L/14 | Phi-3-mini | Small device focused |
| MiniCPM-Llama3-V 2.5 | May 2024 | OpenBMB | 8B | SigLIP-400M | Llama 3 8B | Runs on phones |
| PaliGemma | May 2024 | 3B | SigLIP-So400m | Gemma 2B | Versatile transfer base model | |
| Chameleon | Jun 2024 | Meta FAIR | 7B, 34B | quantized image tokens | Llama-style | Early fusion, mixed modality output |
| Cambrian-1 | Jun 2024 | NYU, Meta | 8B to 34B | SVA spatial vision aggregator | Llama 3, Hermes | Vision centric design study |
| InternVL 2 | Jul 2024 | Shanghai AI Lab | 1B to 76B | InternViT-300M to 6B | Various | Progressive alignment |
| Pixtral 12B | Sep 2024 | Mistral | 12B | 400M from scratch | Mistral Nemo | First Mistral multimodal |
| Llama 3.2 Vision | Sep 2024 | Meta | 11B, 90B | self trained vision adapter | Llama 3.1 | Cross attention adapter |
| Molmo | Sep 2024 | Allen AI | 1B to 72B | OpenAI CLIP, SigLIP | OLMo, Qwen2 | PixMo dataset, fully open |
| Pixtral Large | Nov 2024 | Mistral | 124B | 1B Pixtral-L vision | Mistral Large 2 | Frontier scale open weights |
| InternVL 2.5 | Dec 2024 | Shanghai AI Lab | 1B to 78B | InternViT-300M to 6B | InternLM 2.5 | First open MMMU 70 |
| DeepSeek-VL2 | Dec 2024 | DeepSeek | 4.5B active MoE | SigLIP | DeepSeek-MoE | MoE backbone |
| PaliGemma 2 | Dec 2024 | 3B, 10B, 28B | SigLIP-So400m | Gemma 2 | Multi resolution, OCR and X-ray captions | |
| Qwen2.5-VL | Jan 2025 | Alibaba | 3B, 7B, 32B, 72B | dynamic ViT | Qwen2.5 | Strong document parsing |
| Gemma 3 | Mar 2025 | 4B, 12B, 27B | SigLIP | Gemma 3 | Single GPU, 128K context | |
| Mistral Small 3.1 | Mar 2025 | Mistral | 24B | Pixtral encoder | Mistral Small 3 | Vision added to small open model |
| Llama 4 Scout / Maverick | Apr 2025 | Meta | 17B active MoE | early fusion | Llama 4 | Native multimodal, MoE |
| Kimi-VL | Apr 2025 | Moonshot AI | 2.8B active MoE | MoonViT | Moonlight MoE | Efficient MoE, long thinking variant |
| InternVL3 | Apr 2025 | Shanghai AI Lab | 1B to 78B | InternViT | various | Native multimodal pretraining, MMMU 72.2 |
| Qwen3-VL | Sep 2025 | Alibaba | 2B to 235B (MoE) | dynamic ViT | Qwen3 | 256K context, 32-language OCR |
| Model | Released | Org | Notes |
|---|---|---|---|
| GPT-4V | Sep 25, 2023 | OpenAI | First closed VLM at frontier scale, rolled out via ChatGPT Plus |
| Gemini 1.0 Pro Vision | Dec 2023 | Marketed as natively multimodal from training | |
| Claude 3 Opus, Sonnet, Haiku | Mar 2024 | Anthropic | Image input added to entire Claude line |
| Gemini 1.5 Pro | Feb 2024 | 1M token context with video and audio | |
| Reka Core, Flash, Edge | Apr 2024 | Reka AI | Native multimodal trained from scratch |
| GPT-4o | May 13, 2024 | OpenAI | End to end text, image, and audio tokens |
| Claude 3.5 Sonnet | Jun 2024 | Anthropic | Strong chart and document VQA |
| Grok 1.5V | Apr 2024 | xAI | RealWorldQA benchmark released alongside |
| GPT-4o mini | Jul 2024 | OpenAI | Cheap vision tier |
| o1 | Dec 2024 | OpenAI | Reasoning model with vision |
| Gemini 2.0 Flash | Dec 2024 | Native image generation plus understanding | |
| Claude 3.7 Sonnet | Feb 2025 | Anthropic | Extended thinking, image input |
| Gemini 2.5 Pro | Mar 2025 | Thinking model, up to 2M context | |
| Claude Opus 4, Sonnet 4 | May 2025 | Anthropic | Vision retained in Claude 4 generation |
| Grok 4 | Jul 2025 | xAI | Multimodal reasoning model |
| GPT-5 | Aug 7, 2025 | OpenAI | Unified text, image, and file input |
| Claude Opus 4.1 | Aug 3, 2025 | Anthropic | Refined Claude 4 flagship with vision |
While general VLMs handle OCR as a side capability, several models target text recognition and document parsing as the primary task.
| Model | Released | Org | Approach |
|---|---|---|---|
| LayoutLM | 2019 | Microsoft | Joint text plus layout pretraining for forms |
| LayoutLMv2 | 2020 | Microsoft | Adds image features |
| LayoutLMv3 | 2022 | Microsoft | Unified text and image masking |
| Donut | 2022 | NAVER Clova | OCR free document understanding via end to end transformer |
| Pix2Struct | 2022 | Screenshot to structured text pretraining | |
| Nougat | 2023 | Meta | Academic PDF to markdown, math friendly |
| GOT-OCR2.0 | 2024 | Stepfun, USTC | Unified text, formula, chart, music score recognition |
| Florence-2 | 2024 | Microsoft | Unified detection, segmentation, captioning, OCR |
| Mistral OCR 3 | 2024 | Mistral | API based document OCR |
| olmOCR | Feb 2025 | Allen AI | 7B OCR model, fully open dataset and weights |
| DeepSeek-OCR | 2025 | DeepSeek | Open OCR with mixture of experts backbone |
Donut, published by Geewook Kim and collaborators (ECCV 2022), introduced the OCR free paradigm: instead of running a separate text detector and recognizer, a single transformer reads pixel patches and emits structured output such as JSON. Most modern VLMs follow this end to end approach implicitly.
Three dominant architectures account for most image-to-text models from 2022 onward.
A pretrained vision transformer encodes the image into a sequence of patch embeddings. A small adapter (a linear layer, an MLP, or a Q-Former) projects those embeddings into the language model's hidden size. The projected tokens are concatenated with the text tokens and fed to a standard decoder LLM. This is the dominant open source recipe, used by LLaVA, MiniGPT-4, Qwen-VL, InternVL, MiniCPM-V, Pixtral, and DeepSeek-VL. It is cheap to train because the vision encoder and often the LLM are frozen or only partially tuned.
The LLM is kept entirely frozen and unchanged. New gated cross attention layers are interleaved between the existing transformer blocks. These new layers attend from the text hidden states to a perceived sequence of image features. Because the new layers are gated and start at zero, the base LLM's text only behavior is preserved exactly at initialization. Flamingo, Idefics 1, and Llama 3.2 Vision use this design. It is more parameter efficient at inference but requires modifying the base LLM, which complicates deployment.
Every modality is tokenized at the input. Text uses byte pair encoding. Images are tokenized into discrete codes (Chameleon uses a VQ-VAE tokenizer; Gemini uses a custom image tokenizer). Audio uses a similar discrete code. All token streams are interleaved into a single sequence that a standard transformer processes from scratch. The model can generate any modality as output if it has been trained to predict tokens of that modality. Chameleon (Meta, May 2024), Janus (DeepSeek, October 2024), and Gemini are the public examples. Native multimodal models tend to require much more compute to train but can generate as well as understand multiple modalities with one network.
The Q-Former is a small transformer with a fixed set of learnable query embeddings (32 in BLIP-2). It cross attends to the frozen image encoder output, extracting a fixed length summary that the frozen LLM can ingest. The Q-Former trades some expressiveness for very low cost: only the Q-Former parameters (around 188M) are trained. Its successors in InstructBLIP and X-LLM keep this idea.
The benchmark landscape mirrors the task families described above.
| Benchmark | Released | Focus | Metric |
|---|---|---|---|
| Flickr30k | 2014 | Captioning | BLEU, CIDEr |
| COCO Captions | 2015 | Captioning | BLEU, METEOR, CIDEr, SPICE |
| VQAv2 | 2017 | General VQA | Accuracy |
| GQA | 2019 | Compositional VQA | Accuracy |
| OK-VQA | 2019 | VQA requiring external knowledge | Accuracy |
| TextVQA | 2019 | VQA over images with text | Accuracy |
| DocVQA | 2020 | Document VQA | ANLS |
| ChartQA | 2022 | Charts and plots VQA | Relaxed accuracy |
| InfographicVQA | 2022 | Infographics VQA | ANLS |
| ScienceQA | 2022 | K-12 science with images | Accuracy |
| POPE | 2023 | Object hallucination | F1 |
| MM-Vet | 2023 | Integrated capabilities | Open ended GPT-4 grading |
| MMBench | 2023 | Multiple choice multimodal | Accuracy |
| SEED-Bench | 2023 | Multiple choice multimodal | Accuracy |
| LLaVA-Bench (In-the-Wild) | 2023 | Open ended VQA | GPT-4 grading |
| MMMU | Nov 2023 | College multi discipline | Accuracy |
| MathVista | Oct 2023 | Mathematical visual reasoning | Accuracy |
| RealWorldQA | Apr 2024 | Spatial real world questions | Accuracy |
| VisualWebArena | 2024 | Web browsing agents | Task success |
| MMMU-Pro | Sep 2024 | Harder MMMU | Accuracy |
| MathVerse | 2024 | Diagram heavy math | Accuracy |
Key landmarks:
Leaderboards aggregated by OpenCompass, the Hugging Face Open VLM Leaderboard, and lmms-eval track most of these benchmarks across hundreds of open and closed models.
A mature stack supports running and fine tuning open VLMs.
AutoModelForVision2Seq, LlavaForConditionalGeneration, Qwen2VLForConditionalGeneration, MllamaForConditionalGeneration (Llama 3.2 Vision), Idefics3ForConditionalGeneration, PixtralForConditionalGeneration, and similar classes. The Hugging Face Hub hosts public weights for thousands of fine tuned VLM variants.mmproj adapter file pattern.Datasets used for instruction tuning of open VLMs include LLaVA-Instruct-150K, ShareGPT4V, the Cauldron (used for Idefics2), Cambrian-10M (used for Cambrian-1), PixMo (used for Molmo), the Idefics3 training mix, and various synthetic OCR and chart datasets generated by Qwen-VL and InternVL teams.