# Image-to-Text Models

> Source: https://aiwiki.ai/wiki/image-to-text_models
> Updated: 2026-07-16
> Categories: Machine Learning, Multimodal AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Multimodal Models](/wiki/multimodal_models) and Tasks*

**Image-to-text models** are machine learning systems that take an image as input and produce natural language text as output. The category covers image captioning (writing a sentence that describes what is in a picture), visual question answering (answering a question about an image), document understanding (reading scanned forms, receipts, or charts), [OCR](/wiki/ocr_models) (transcribing printed or handwritten text), and the broader class of [vision language models](/wiki/vision_language_model) (VLMs) that combine an image encoder with a [large language model](/wiki/large_language_model) to support open-ended visual reasoning. Image-to-text is one of the core task directions within the wider field of [multimodal models](/wiki/multimodal_model). Modern systems such as [GPT-4](/wiki/gpt-4) with vision, [Gemini](/wiki/gemini), [LLaVA](/wiki/llava), [Qwen-VL](/wiki/qwen), and [Pixtral](/wiki/pixtral) treat images as just another token stream that the language model can attend to alongside text. This article is a catalog of notable image-to-text models; for the architecture and theory of how image and text modalities are fused, see the dedicated vision language model page linked above.

## Overview

An image-to-text model accepts pixel data (one image, several images, or a video frame sequence) and emits a text string. The task family includes:

- **Image captioning.** Produce a short description of an image, often a single sentence. Classic benchmarks use the [COCO dataset](/wiki/coco_dataset) and report BLEU, METEOR, CIDEr, and SPICE scores.[54]
- **Visual question answering (VQA).** Given an image and a natural language question, return an answer. Datasets include VQAv2, GQA, OK-VQA, and ScienceQA.[5][6][7]
- **Document understanding.** Parse receipts, invoices, scientific papers, slide decks, and screenshots. Benchmarks include DocVQA, ChartQA, InfographicVQA, and TextVQA.[51][52]
- **OCR with reasoning.** Transcribe text and also act on it (translate it, summarize it, fill out a form, or extract structured data). Systems such as GOT-OCR2.0 and olmOCR specialize here.[45][46]
- **Visual chain-of-thought.** Step by step reasoning grounded in an image, often for math problems (MathVista), scientific diagrams (ScienceQA), or multi step puzzles.[48]
- **Open ended visual dialogue.** Multi turn conversation about one or more images, the format pioneered by LLaVA and now standard in commercial chat assistants.[15]

A single modern VLM typically handles all of these tasks at once. The same checkpoint that captions a photo can also read a receipt, answer a question about a chart, and explain a meme. This consolidation is recent. Before 2022, image captioning models, VQA models, and OCR systems were usually separate research lines with their own architectures.

## History

### Pre deep learning era

Early image captioning relied on hand crafted pipelines: detect objects, retrieve sentences from a fixed corpus, and stitch them together with templates. The Flickr8k and Flickr30k datasets, released by Hodosh, Young, and Hockenmaier in 2013 and 2014, were the standard benchmarks. The MS COCO Captions dataset (Chen et al. 2015) expanded the field with around 330,000 images, each paired with five human written captions, and remains the most cited captioning benchmark.

### Show and Tell era (2015 to 2017)

The first widely cited deep learning captioning model was "Show and Tell: A Neural Image Caption Generator" by Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan at Google (2015).[1] It paired a convolutional neural network image encoder (Inception) with a long short term memory (LSTM) decoder.[1]

A few months later, "Show, Attend and Tell" by Kelvin Xu and collaborators (2015) added soft and hard visual attention so the decoder could look at different image regions for different output words.[2] The bottom up top down attention model from Peter Anderson et al. (2018) replaced grid features with region features from Faster R-CNN and was the basis for several VQA Challenge winners.[3]

VQA itself became a benchmark with the VQA dataset (Antol et al. 2015) and its rebalanced successor VQAv2 (Goyal et al. 2017).[4][5] Models in this era used custom fusion modules such as MUTAN, BAN, and MCAN to combine image and question features.

### Contrastive era (2021)

[CLIP](/wiki/clip) (Contrastive Language Image Pretraining), released by OpenAI in January 2021 (Radford et al.), trained an image encoder and a text encoder jointly on 400 million image text pairs scraped from the web.[8] The model learned to match images and captions in a shared embedding space. CLIP itself does not generate captions, but its image encoder became the standard vision backbone for most later generative VLMs (LLaVA, BLIP-2, MiniGPT-4, and others all use a CLIP ViT-L/14 or SigLIP variant). ALIGN, published by Chao Jia and collaborators at Google in 2021, scaled the same contrastive recipe to roughly 1.8 billion noisy pairs.[9]

### Generative VLM era (2022)

In early 2022 Salesforce Research released **BLIP** (Bootstrapping Language Image Pre-training) by Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP introduced a unified vision language pretraining objective and a captioning filter that bootstrapped its own training data.[10] The successor **BLIP-2** (January 2023) was more influential. It froze a vision encoder (typically a CLIP ViT) and froze an LLM (OPT or FLAN-T5), then trained a small lightweight Querying Transformer (Q-Former) to bridge the two.[11] With only the Q-Former being trainable, BLIP-2 reached strong VQA and captioning numbers at a fraction of the training cost of end to end models.[11]

DeepMind's **Flamingo** (Alayrac et al., April 2022) took a different route. A frozen Chinchilla LLM was extended with new gated cross attention layers that attended to features from a frozen NFNet image encoder.[12] Flamingo was the first VLM to support interleaved image text prompts with strong few shot performance.[12]

Other 2022 entries included Microsoft's **GIT** (Generative Image to Text) by Jianfeng Wang and collaborators (May 2022),[13] Alibaba's **OFA** (Wang et al. 2022) which framed every vision language task as text to text,[14] and Google's **PaLI** (Chen et al. 2022). Google extended the PaLI line with **PaLI-X**, a 55 billion parameter multilingual model with an improved OCR training recipe (Chen et al., May 2023), and **PaLI-3** (Chen et al., October 2023), a much smaller 5 billion parameter model that swapped the classification pretrained vision backbone for a contrastively pretrained SigLIP encoder and matched models roughly ten times its size on many benchmarks.[^pali3]

### LLaVA and the open VLM explosion (2023)

The model that opened the floodgates was **[LLaVA](/wiki/llava)** (Large Language and Vision Assistant) by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee at the University of Wisconsin and Microsoft Research, first posted to arXiv in April 2023.[15] LLaVA combined a CLIP ViT-L/14 image encoder with the Vicuna LLM via a simple linear projection layer.[15] Crucially, the authors used GPT-4 to generate 158,000 multimodal instruction following examples from COCO captions and bounding box annotations, then fine tuned the model on this synthetic instruction data.[15] The recipe was simple, the code was open, the results were good, and within a few months dozens of similar systems appeared.

**LLaVA-1.5** (October 2023) replaced the linear projector with a two layer MLP, added academic VQA datasets to the training mix, and reached state of the art numbers among open 7B and 13B models.[16] **LLaVA-NeXT** (also called LLaVA-1.6, January 2024) added higher input resolution via image tiling.[17] **LLaVA-OneVision** (August 2024) extended the family to single image, multi image, and video inputs in one checkpoint.[18]

MiniGPT-4 by Deyao Zhu and collaborators (April 2023) followed a similar recipe with a Q-Former plug to Vicuna.[19] MiniGPT-v2 (October 2023) generalized it to multiple visual tasks.

Microsoft Research pursued a parallel "multimodal large language model" line under the **Kosmos** name. **Kosmos-1** (Huang et al., February 2023), subtitled "Language Is Not All You Need," trained a language model from scratch on interleaved image and text data and could perform captioning, VQA, and even a nonverbal IQ style test (Raven's Progressive Matrices) in a zero shot setting. **Kosmos-2** (Peng et al., June 2023) added grounding: it represented referring expressions as markdown style links carrying bounding box location tokens, trained on a large grounded image text dataset called GrIT, so the model could point to the specific image region it was describing.[^kosmos]

[Hugging Face](/wiki/hugging_face) released **Idefics** (Image aware Decoder Enhanced a la Flamingo) in August 2023, an open reproduction of Flamingo in 9B and 80B sizes.[20] **Idefics2** (April 2024) replaced the cross attention design with a more modern "perceiver resampler plus LLM" stack and pushed open benchmark scores significantly higher.[21] **Idefics3** (August 2024) reached parity with much larger closed models on document and chart benchmarks.[22]

Alibaba's **Qwen-VL** (Bai et al., August 2023) and its instruction tuned variant Qwen-VL-Chat were the first credibly bilingual (English and Chinese) open VLMs.[23] **Qwen2-VL** (August 2024) introduced dynamic resolution, allowing the model to process images of varying sizes natively, and added long video support.[24] **Qwen2.5-VL** (January 2025) extended the lineup further with stronger document parsing across 3B, 7B, 32B, and 72B sizes;[25] a refreshed **Qwen2.5-VL-32B** followed on March 24, 2025 with outputs tuned more closely to human preferences.[^qwen25vl] In September 2025 Alibaba released **Qwen3-VL**, available in both dense (2B, 4B, 8B, 32B) and mixture of experts (30B-A3B, 235B-A22B) variants, with native context up to 256K tokens (extendable toward 1M), OCR across 32 languages, and long video understanding.[^qwen3vl]

**[InternVL](/wiki/internvl)** from the Shanghai AI Lab and OpenGVLab (Chen et al., December 2023) scaled the vision encoder itself to 6 billion parameters, far larger than the ViT-L/14 used by most VLMs.[26] InternVL 1.5 (April 2024) added native dynamic resolution and a 26B parameter combined model.[27] InternVL 2 (July 2024) introduced progressive alignment training. InternVL 2.5 (December 2024) became one of the first open models to surpass 70% on the MMMU benchmark.[28] **InternVL3** (Zhu et al., April 2025) switched to a native multimodal pretraining recipe that jointly learns vision and language in a single stage rather than bolting vision onto a finished text model; the 78B variant reached 72.2 on MMMU, a state of the art among open weight models at release.[^internvl3]

### 2024 wave of small open VLMs

Microsoft released **Phi-3-Vision** (4.2 billion parameters) in May 2024, the multimodal version of its Phi-3 small language model family.[29] **Phi-3.5-Vision** (August 2024) added multi image support. The Phi-Vision models were aimed at on device and edge use cases.

OpenBMB and Tsinghua released the **MiniCPM-V** family. **MiniCPM-Llama3-V 2.5** (May 2024) packed a Llama 3 8B LLM with a SigLIP vision encoder into a model that ran on a phone.[37] **MiniCPM-V 2.6** (August 2024) added single image, multi image, and video understanding in one 8B model.[37] The line later extended to MiniCPM-o for on device multimodal interaction.

DeepSeek released **DeepSeek-VL** in March 2024 and **DeepSeek-VL2** in December 2024.[38][39] The VL2 release used a mixture of experts (MoE) language model backbone, with the largest active expert configuration around 4.5B active parameters.[39]

Mistral AI's first multimodal release, **[Pixtral](/wiki/pixtral) 12B**, appeared on September 11, 2024.[35] Pixtral introduced a 400M parameter vision encoder trained from scratch and used a Mistral Nemo 12B language backbone.[35] **Pixtral Large** (November 18, 2024) scaled the same architecture with a Mistral Large 2 backbone to roughly 124B parameters.[36]

Meta released **[Llama 3.2](/wiki/llama_3_2) 11B Vision** and **Llama 3.2 90B Vision** on September 25, 2024, alongside the smaller text only Llama 3.2 1B and 3B models.[30] The vision variants used cross attention adapters in the style of Flamingo rather than a projector and concat design, and were the first Llama models with native image input.[30]

The Allen Institute for AI released **[Molmo](/wiki/molmo)** (Multimodal Open Language Model) by Matt Deitke, Christopher Clark, and collaborators in September 2024.[40] Molmo was trained on a fully open new dataset called PixMo that included spoken caption transcripts collected from human annotators.[40] The 72B model reached numbers competitive with proprietary systems while keeping the entire data, weights, and code open.[40] NYU and Meta's Cambrian-1, led by Yann LeCun's group (June 2024), focused on vision centric design choices and benchmarked many vision backbones in one study.[41]

Google also entered the open VLM space with **PaliGemma** (Beyer et al., released May 2024, paper July 2024), a versatile sub-3B base model that paired a 400M SigLIP-So400m vision encoder with a 2B Gemma language model and was meant to be fine tuned for transfer to specific tasks.[^paligemma] **PaliGemma 2** (December 2024) kept the SigLIP encoder but upgraded to Gemma 2 backbones, offered 3B, 10B, and 28B sizes at 224, 448, and 896 pixel resolutions, and reported strong results on specialized tasks such as chemical formula recognition, music score recognition, and chest X-ray report generation.[^paligemma2]

Other notable 2024 open releases included 01.AI's Yi-VL (January 2024), Kuaishou's KwaiVGI, Shanghai AI Lab's InternLM-XComposer line, Tencent's HunyuanVL, and ByteDance's CogVLM and CogVLM2.

### 2025 wave of open VLMs

Open multimodal releases accelerated through 2025, with several major labs folding vision directly into their flagship open weight model families.

Google DeepMind's **Gemma 3** (announced March 12, 2025) made the Gemma line multimodal for the first time. The 4B, 12B, and 27B variants share a single frozen SigLIP based vision encoder, handle 128K token contexts, and can answer questions about images and read text inside them, while the 1B variant stays text only.[^gemma3]

Meta's **[Llama 4](/wiki/llama_3_2)** herd (April 5, 2025) was the first Llama generation built as natively multimodal, with an early fusion design that mixes text and vision tokens during joint pretraining, and a mixture of experts architecture. Llama 4 Scout used 17B active parameters with 16 experts and an unusually long context window, while Llama 4 Maverick used 17B active parameters with 128 experts.[^llama4]

Mistral AI folded vision into its small open model line with **Mistral Small 3.1** (March 17, 2025), an Apache 2.0 licensed 24B model adding image understanding and a 128K token context window.[^mistralsmall31] Moonshot AI released **Kimi-VL** (April 2025), a mixture of experts VLM that activates only about 2.8B parameters in its language decoder (the Kimi-VL-A3B configuration) yet competes with larger efficient VLMs; a long thinking variant, Kimi-VL-A3B-Thinking, followed in June 2025.[^kimivl] These joined the continuing Qwen3-VL and InternVL3 families described above as the leading open weight options of 2025.

### Closed frontier vision models (2023 to 2025)

OpenAI added vision input to GPT-4 with the **GPT-4V** release on September 25, 2023, and rolled it out to ChatGPT Plus users a few weeks later.[31] **GPT-4o** ("omni"), released May 13, 2024, was OpenAI's first natively multimodal model, trained end to end on text, image, and audio tokens in one network.[32] GPT-4o was followed by GPT-4o-mini (July 2024) and o1 with vision (December 2024). OpenAI's later flagship **GPT-5** (released August 7, 2025) continued to accept text, image, and file input as a unified multimodal model.[^gpt5]

Google introduced **Gemini 1.0 Ultra, Pro, and Nano** in December 2023, with Gemini 1.0 Pro Vision being the first widely available variant.[33] Gemini was described as natively multimodal from the start.[33] **Gemini 1.5 Pro** (February 2024) launched with a 1 million token context window, allowing it to process around an hour of video or 11 hours of audio in a single prompt.[34] **Gemini 2.0 Flash** (December 2024) added native image generation as well as native image understanding. The **Gemini 2.5** family (March to June 2025) added thinking modes and 2 million token contexts.

Anthropic's [Claude 3 family](/wiki/anthropic) (March 2024) added image input to the Opus, Sonnet, and Haiku tiers. Claude 3.5 Sonnet (June 2024), Claude 3.7 Sonnet (February 2025), the Claude 4 models (Opus and Sonnet, May 2025), and Claude Opus 4.1 (August 3, 2025) all retained vision input.[^claude41] xAI's Grok 1.5V (April 2024) and later Grok 2 with vision introduced the RealWorldQA benchmark;[53] Grok 4 (July 2025) continued the multimodal line. By 2025 the leading proprietary VLMs (GPT-5, the Gemini 2.5 family, and Claude Opus 4.1) and the strongest open weight models (Qwen3-VL, InternVL3) reported broadly comparable scores on multimodal reasoning benchmarks such as MMMU.

## Notable open vision language models

| Model | Released | Org | Size | Vision encoder | Language backbone | Notes |
|---|---|---|---|---|---|---|
| BLIP | Jan 2022 | Salesforce | 224M to 446M | ViT-B/L | BERT-large | Bootstrapped caption filtering[10] |
| Flamingo | Apr 2022 | DeepMind | 80B | NFNet-F6 | Chinchilla 70B | Closed, but defined cross attention design[12] |
| BLIP-2 | Jan 2023 | Salesforce | 188M Q-Former | ViT-g/14 | OPT, FLAN-T5 | Q-Former bridge to frozen LLM[11] |
| MiniGPT-4 | Apr 2023 | KAUST | 13B | ViT-g/14 | Vicuna | Open Q-Former plus Vicuna[19] |
| LLaVA-1.0 | Apr 2023 | UW, Microsoft | 7B, 13B | CLIP ViT-L/14 | Vicuna | GPT-4 generated visual instructions[15] |
| InstructBLIP | May 2023 | Salesforce | 4B to 13B | ViT-g/14 | Vicuna, FLAN-T5 | Instruction tuned BLIP-2 |
| IDEFICS | Aug 2023 | Hugging Face | 9B, 80B | OpenCLIP H | LLaMA | Open reproduction of Flamingo[20] |
| Kosmos-2 | Jun 2023 | Microsoft | 1.6B | CLIP ViT-L/14 | from scratch | Grounding via bounding box tokens |
| Qwen-VL | Aug 2023 | Alibaba | 9.6B | OpenCLIP-bigG | Qwen-7B | First strong bilingual open VLM[23] |
| CogVLM | Sep 2023 | Zhipu, Tsinghua | 17B | EVA-2-CLIP-E | Vicuna | Visual expert attention modules |
| LLaVA-1.5 | Oct 2023 | UW, Microsoft | 7B, 13B | CLIP ViT-L/14-336 | Vicuna 1.5 | MLP projector, academic VQA mix[16] |
| Fuyu-8B | Oct 2023 | Adept | 8B | none, raw patches | Persimmon | No separate encoder |
| Yi-VL | Jan 2024 | 01.AI | 6B, 34B | CLIP ViT-H | Yi | Bilingual |
| LLaVA-NeXT (1.6) | Jan 2024 | UW, Microsoft | 7B, 13B, 34B | CLIP ViT-L/14 | Vicuna, Mistral, Nous-Hermes | Higher resolution via tiling[17] |
| MoE-LLaVA | Jan 2024 | Peking U | 3B active | CLIP | Phi-2, StableLM | Sparse MoE |
| DeepSeek-VL | Mar 2024 | DeepSeek | 1.3B, 7B | SigLIP-L | DeepSeek LLM | Hybrid SAM and SigLIP encoder[38] |
| Idefics2 | Apr 2024 | Hugging Face | 8B | SigLIP-SO400M | Mistral 7B | Replaced cross attn with perceiver[21] |
| InternVL 1.5 | Apr 2024 | Shanghai AI Lab | 26B | InternViT-6B | InternLM2-20B | Dynamic resolution[27] |
| Phi-3-Vision | May 2024 | Microsoft | 4.2B | CLIP ViT-L/14 | Phi-3-mini | Small device focused[29] |
| MiniCPM-Llama3-V 2.5 | May 2024 | OpenBMB | 8B | SigLIP-400M | Llama 3 8B | Runs on phones[37] |
| PaliGemma | May 2024 | Google | 3B | SigLIP-So400m | Gemma 2B | Versatile transfer base model |
| Chameleon | Jun 2024 | Meta FAIR | 7B, 34B | quantized image tokens | Llama-style | Early fusion, mixed modality output |
| Cambrian-1 | Jun 2024 | NYU, Meta | 8B to 34B | SVA spatial vision aggregator | Llama 3, Hermes | Vision centric design study[41] |
| InternVL 2 | Jul 2024 | Shanghai AI Lab | 1B to 76B | InternViT-300M to 6B | Various | Progressive alignment |
| Pixtral 12B | Sep 2024 | Mistral | 12B | 400M from scratch | Mistral Nemo | First Mistral multimodal[35] |
| Llama 3.2 Vision | Sep 2024 | Meta | 11B, 90B | self trained vision adapter | Llama 3.1 | Cross attention adapter[30] |
| Molmo | Sep 2024 | Allen AI | 1B to 72B | OpenAI CLIP, SigLIP | OLMo, Qwen2 | PixMo dataset, fully open[40] |
| Pixtral Large | Nov 2024 | Mistral | 124B | 1B Pixtral-L vision | Mistral Large 2 | Frontier scale open weights[36] |
| InternVL 2.5 | Dec 2024 | Shanghai AI Lab | 1B to 78B | InternViT-300M to 6B | InternLM 2.5 | First open MMMU 70[28] |
| DeepSeek-VL2 | Dec 2024 | DeepSeek | 4.5B active MoE | SigLIP | DeepSeek-MoE | MoE backbone[39] |
| PaliGemma 2 | Dec 2024 | Google | 3B, 10B, 28B | SigLIP-So400m | Gemma 2 | Multi resolution, OCR and X-ray captions |
| Qwen2.5-VL | Jan 2025 | Alibaba | 3B, 7B, 32B, 72B | dynamic ViT | Qwen2.5 | Strong document parsing[25] |
| Gemma 3 | Mar 2025 | Google | 4B, 12B, 27B | SigLIP | Gemma 3 | Single GPU, 128K context |
| Mistral Small 3.1 | Mar 2025 | Mistral | 24B | Pixtral encoder | Mistral Small 3 | Vision added to small open model |
| Llama 4 Scout / Maverick | Apr 2025 | Meta | 17B active MoE | early fusion | Llama 4 | Native multimodal, MoE |
| Kimi-VL | Apr 2025 | Moonshot AI | 2.8B active MoE | MoonViT | Moonlight MoE | Efficient MoE, long thinking variant |
| InternVL3 | Apr 2025 | Shanghai AI Lab | 1B to 78B | InternViT | various | Native multimodal pretraining, MMMU 72.2 |
| Qwen3-VL | Sep 2025 | Alibaba | 2B to 235B (MoE) | dynamic ViT | Qwen3 | 256K context, 32-language OCR |

## Notable closed vision language models

| Model | Released | Org | Notes |
|---|---|---|---|
| GPT-4V | Sep 25, 2023 | OpenAI | First closed VLM at frontier scale, rolled out via ChatGPT Plus[31] |
| Gemini 1.0 Pro Vision | Dec 2023 | Google | Marketed as natively multimodal from training[33] |
| Claude 3 Opus, Sonnet, Haiku | Mar 2024 | Anthropic | Image input added to entire Claude line |
| Gemini 1.5 Pro | Feb 2024 | Google | 1M token context with video and audio[34] |
| Reka Core, Flash, Edge | Apr 2024 | Reka AI | Native multimodal trained from scratch |
| GPT-4o | May 13, 2024 | OpenAI | End to end text, image, and audio tokens[32] |
| Claude 3.5 Sonnet | Jun 2024 | Anthropic | Strong chart and document VQA |
| Grok 1.5V | Apr 2024 | xAI | RealWorldQA benchmark released alongside[53] |
| GPT-4o mini | Jul 2024 | OpenAI | Cheap vision tier |
| o1 | Dec 2024 | OpenAI | Reasoning model with vision |
| Gemini 2.0 Flash | Dec 2024 | Google | Native image generation plus understanding |
| Claude 3.7 Sonnet | Feb 2025 | Anthropic | Extended thinking, image input |
| Gemini 2.5 Pro | Mar 2025 | Google | Thinking model, up to 2M context |
| Claude Opus 4, Sonnet 4 | May 2025 | Anthropic | Vision retained in Claude 4 generation |
| Grok 4 | Jul 2025 | xAI | Multimodal reasoning model |
| GPT-5 | Aug 7, 2025 | OpenAI | Unified text, image, and file input |
| Claude Opus 4.1 | Aug 3, 2025 | Anthropic | Refined Claude 4 flagship with vision |

## OCR focused models

While general VLMs handle OCR as a side capability, several models target text recognition and document parsing as the primary task.

| Model | Released | Org | Approach |
|---|---|---|---|
| [LayoutLM](/wiki/layoutlm) | 2019 | Microsoft | Joint text plus layout pretraining for forms[43] |
| LayoutLMv2 | 2020 | Microsoft | Adds image features |
| LayoutLMv3 | 2022 | Microsoft | Unified text and image masking[44] |
| Donut | 2022 | NAVER Clova | OCR free document understanding via end to end transformer[42] |
| Pix2Struct | 2022 | Google | Screenshot to structured text pretraining |
| Nougat | 2023 | Meta | Academic PDF to markdown, math friendly |
| GOT-OCR2.0 | 2024 | Stepfun, USTC | Unified text, formula, chart, music score recognition[45] |
| [Florence-2](/wiki/florence_2) | 2024 | Microsoft | Unified detection, segmentation, captioning, OCR |
| [Mistral OCR 3](/wiki/mistral_ocr_3) | 2024 | Mistral | API based document OCR |
| olmOCR | Feb 2025 | Allen AI | 7B OCR model, fully open dataset and weights[46] |
| [DeepSeek-OCR](/wiki/deepseek-ocr) | 2025 | DeepSeek | Open OCR with mixture of experts backbone |

Donut, published by Geewook Kim and collaborators (ECCV 2022), introduced the OCR free paradigm: instead of running a separate text detector and recognizer, a single transformer reads pixel patches and emits structured output such as JSON.[42] Most modern VLMs follow this end to end approach implicitly.

## Architectures

Three dominant architectures account for most image-to-text models from 2022 onward.

### Vision encoder plus projection plus LLM (the LLaVA pattern)

A pretrained vision transformer encodes the image into a sequence of patch embeddings. A small adapter (a linear layer, an MLP, or a Q-Former) projects those embeddings into the language model's hidden size. The projected tokens are concatenated with the text tokens and fed to a standard decoder LLM. This is the dominant open source recipe, used by LLaVA, MiniGPT-4, Qwen-VL, InternVL, MiniCPM-V, Pixtral, and DeepSeek-VL. It is cheap to train because the vision encoder and often the LLM are frozen or only partially tuned.

### Cross attention adapter (the Flamingo pattern)

The LLM is kept entirely frozen and unchanged. New gated cross attention layers are interleaved between the existing transformer blocks. These new layers attend from the text hidden states to a perceived sequence of image features. Because the new layers are gated and start at zero, the base LLM's text only behavior is preserved exactly at initialization. Flamingo, Idefics 1, and Llama 3.2 Vision use this design.[12][20][30] It is more parameter efficient at inference but requires modifying the base LLM, which complicates deployment.

### Native multimodal early fusion (the Chameleon and Gemini pattern)

Every modality is tokenized at the input. Text uses byte pair encoding. Images are tokenized into discrete codes (Chameleon uses a VQ-VAE tokenizer; Gemini uses a custom image tokenizer). Audio uses a similar discrete code. All token streams are interleaved into a single sequence that a standard transformer processes from scratch. The model can generate any modality as output if it has been trained to predict tokens of that modality. Chameleon (Meta, May 2024), Janus (DeepSeek, October 2024), and Gemini are the public examples. Native multimodal models tend to require much more compute to train but can generate as well as understand multiple modalities with one network.

### Q-Former (BLIP-2)

The Q-Former is a small transformer with a fixed set of learnable query embeddings (32 in BLIP-2).[11] It cross attends to the frozen image encoder output, extracting a fixed length summary that the frozen LLM can ingest. The Q-Former trades some expressiveness for very low cost: only the Q-Former parameters (around 188M) are trained.[11] Its successors in InstructBLIP and X-LLM keep this idea.

## Evaluation benchmarks

The benchmark landscape mirrors the task families described above.

| Benchmark | Released | Focus | Metric |
|---|---|---|---|
| Flickr30k | 2014 | Captioning | BLEU, CIDEr |
| COCO Captions | 2015 | Captioning | BLEU, METEOR, CIDEr, SPICE |
| VQAv2 | 2017 | General VQA | Accuracy[5] |
| GQA | 2019 | Compositional VQA | Accuracy[6] |
| OK-VQA | 2019 | VQA requiring external knowledge | Accuracy[7] |
| TextVQA | 2019 | VQA over images with text | Accuracy |
| DocVQA | 2020 | Document VQA | ANLS[51] |
| ChartQA | 2022 | Charts and plots VQA | Relaxed accuracy[52] |
| InfographicVQA | 2022 | Infographics VQA | ANLS |
| ScienceQA | 2022 | K-12 science with images | Accuracy |
| POPE | 2023 | Object hallucination | F1[50] |
| MM-Vet | 2023 | Integrated capabilities | Open ended GPT-4 grading[49] |
| MMBench | 2023 | Multiple choice multimodal | Accuracy |
| SEED-Bench | 2023 | Multiple choice multimodal | Accuracy |
| LLaVA-Bench (In-the-Wild) | 2023 | Open ended VQA | GPT-4 grading |
| [MMMU](/wiki/mmmu) | Nov 2023 | College multi discipline | Accuracy[47] |
| [MathVista](/wiki/mathvista) | Oct 2023 | Mathematical visual reasoning | Accuracy[48] |
| RealWorldQA | Apr 2024 | Spatial real world questions | Accuracy[53] |
| VisualWebArena | 2024 | Web browsing agents | Task success |
| MMMU-Pro | Sep 2024 | Harder MMMU | Accuracy |
| MathVerse | 2024 | Diagram heavy math | Accuracy |

Key landmarks:

- **VQAv2** (Goyal et al. 2017) rebalanced the original VQA dataset to remove language priors and is the closest thing to a universal sanity check.[5]
- **GQA** (Hudson and Manning 2019) used scene graphs from Visual Genome to generate compositional questions that test reasoning over object attributes and relations.[6]
- **OK-VQA** (Marino et al. 2019) requires external world knowledge to answer.[7]
- **MMMU** (Yue et al., November 2023) introduced college level multi discipline questions covering art, business, science, health, humanities, and engineering, with around 11,500 questions and explicit image grounding.[47] As of early 2025, GPT-4o, Gemini 2.0, Claude 3.5 Sonnet, and InternVL 2.5 78B all report scores in the high 60s to mid 70s on the validation set.
- **MathVista** (Lu et al., October 2023) combines algebra, geometry, statistics, and scientific figure questions.[48] It surfaced large gaps between models on visual mathematical reasoning.
- **MM-Vet** (Yu et al. 2023) uses GPT-4 as a judge to grade integrated capabilities (recognition, OCR, knowledge, language generation, spatial reasoning, math).[49]
- **RealWorldQA**, released by xAI in April 2024 with Grok 1.5V, focuses on physical world spatial questions photographed from cars and rooms.[53]

Leaderboards aggregated by OpenCompass, the Hugging Face Open VLM Leaderboard, and lmms-eval track most of these benchmarks across hundreds of open and closed models.[55][56]

## Open source ecosystem

A mature stack supports running and fine tuning open VLMs.

- **[Hugging Face Transformers](/wiki/transformers_library)** ships first class support for most open VLMs via `AutoModelForVision2Seq`, `LlavaForConditionalGeneration`, `Qwen2VLForConditionalGeneration`, `MllamaForConditionalGeneration` (Llama 3.2 Vision), `Idefics3ForConditionalGeneration`, `PixtralForConditionalGeneration`, and similar classes. The Hugging Face Hub hosts public weights for thousands of fine tuned VLM variants.
- **[vLLM](/wiki/vllm)** added vision model support in 2024 and now serves LLaVA, Qwen-VL, Llama 3.2 Vision, Pixtral, InternVL, Phi-3-Vision, MiniCPM-V, and most other major open VLMs via OpenAI compatible APIs.
- **SGLang** by the LMSYS group supports VLMs with structured generation primitives.
- **[Ollama](/wiki/ollama)** packages open VLMs (LLaVA, BakLLaVA, MiniCPM-V, Llama 3.2 Vision, Qwen2.5-VL, Granite Vision) as local quantized GGUF builds runnable on a laptop.
- **LM Studio** offers a desktop GUI for the same family of models.
- **MLX-VLM** runs VLMs natively on Apple Silicon via the MLX framework.
- **llama.cpp** supports LLaVA style models via the `mmproj` adapter file pattern.
- **lmms-eval** is the de facto open evaluation harness for VLMs.

Datasets used for instruction tuning of open VLMs include LLaVA-Instruct-150K, ShareGPT4V, the Cauldron (used for Idefics2), Cambrian-10M (used for Cambrian-1), PixMo (used for Molmo), the Idefics3 training mix, and various synthetic OCR and chart datasets generated by Qwen-VL and InternVL teams.

## Applications

- **Accessibility.** Generating alt text for images, describing scenes for blind and low vision users (Be My Eyes integrated GPT-4V in 2023 for this purpose).[31]
- **Document workflows.** Receipt and invoice extraction, contract analysis, scanned form digitization, slide and PDF parsing in tools such as Anthropic Claude's document features and Google's Document AI.
- **Coding from screenshots.** Tools such as v0 by Vercel and Cursor accept screenshots and generate matching HTML, CSS, and React code.
- **Robotics and embodied AI.** [Vision language action models](/wiki/vision-language-action_model) build on VLMs to produce robot motor commands. RT-2 (Google DeepMind 2023), [PaLM-E](/wiki/palm-e_an_embodied_multimodal_language_model), and OpenVLA all use a VLM backbone.
- **Web and GUI agents.** Anthropic's Claude with computer use (October 2024), OpenAI's Operator (January 2025), and Google's Project Astra read screen pixels to control computers and phones.
- **Content moderation.** Automated flagging of unsafe images using textual policy descriptions.
- **E-commerce.** Product attribute extraction from listing photos, visual search, automatic product description writing.
- **Medical imaging.** Med-PaLM M (Google 2023), LLaVA-Med (Microsoft 2023), and CheXagent (Stanford 2024) generate radiology reports from chest X-rays.
- **Education.** Math homework helpers, science diagram tutors, and visual chain of thought tools.
- **Search.** Google Lens, Bing Visual Search, and Perplexity's image search use VLM components.

## Limitations

- **Hallucinated objects.** Open VLMs often describe objects that are not in the image, especially when the prompt suggests certain objects. The POPE benchmark (Li et al. 2023) specifically measures this failure mode.[50]
- **Counting.** Even frontier models miscount objects past four or five, and struggle with crowd estimation.
- **Fine grained text in images.** OCR quality drops sharply at small font sizes or rotated text; specialized OCR models still outperform general VLMs on dense documents.
- **Spatial reasoning.** Locations, distances, and 3D relationships are hard. RealWorldQA highlighted this in 2024.[53]
- **Charts with subtle numeric differences.** Chart VQA degrades when models must read exact values rather than relative trends.
- **Bias from web pretraining.** CLIP and successors inherited biases from web image text pairs, including gender and racial associations with occupations and demographic labels.[8]
- **Resolution constraints.** Most VLMs internally downsample images to a fixed grid (often 336 by 336 or 448 by 448 pixels per tile), which limits the readable detail. Dynamic resolution methods (Qwen2-VL, InternVL 1.5) partially address this.[24][27]
- **Adversarial images.** Imperceptible pixel perturbations can flip model outputs.
- **Visual prompt injection.** Text instructions embedded inside an image can override the user's prompt, a security risk for any agent that reads screenshots.
- **Privacy.** A photograph contains far more identifying detail than the user usually realizes (timestamps in EXIF data, license plates, signage, faces). VLMs can extract all of it.

## See also

- [Multimodal Model](/wiki/multimodal_model)
- [Vision language model](/wiki/vision_language_model)
- [Large Language Model](/wiki/large_language_model)
- [CLIP (Contrastive Language-Image Pre-training)](/wiki/clip)
- [LLaVA (Large Language and Vision Assistant)](/wiki/llava)
- [InternVL](/wiki/internvl)
- [Qwen](/wiki/qwen)
- [Molmo](/wiki/molmo)
- [Florence-2](/wiki/florence_2)
- [Pixtral](/wiki/pixtral)
- [Llama 3.2](/wiki/llama_3_2)
- [Gemini (language model)](/wiki/gemini)
- [GPT-4](/wiki/gpt-4)
- [Vision Transformer (ViT)](/wiki/vision_transformer_vit)
- [MMMU](/wiki/mmmu)
- [MathVista](/wiki/mathvista)
- [COCO dataset](/wiki/coco_dataset)
- [OCR Models](/wiki/ocr_models)
- [Computer Vision Models](/wiki/computer_vision_models)
- [Vision-language-action model](/wiki/vision-language-action_model)
- [PaLM-E](/wiki/palm-e_an_embodied_multimodal_language_model)
- [Mistral OCR 3](/wiki/mistral_ocr_3)
- [DeepSeek-OCR](/wiki/deepseek-ocr)
- [Hugging Face Transformers](/wiki/transformers_library)
- [vLLM](/wiki/vllm)
- [Ollama](/wiki/ollama)

## References

1. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). "Show and Tell: A Neural Image Caption Generator." CVPR 2015. arXiv:1411.4555.
2. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015. arXiv:1502.03044.
3. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering." CVPR 2018. arXiv:1707.07998.
4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). "VQA: Visual Question Answering." ICCV 2015. arXiv:1505.00468.
5. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). "Making the V in VQA Matter." CVPR 2017. arXiv:1612.00837.
6. Hudson, D. A., & Manning, C. D. (2019). "GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering." CVPR 2019. arXiv:1902.09506.
7. Marino, K., Rastegari, M., Farhadi, A., & Mottaghi, R. (2019). "OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge." CVPR 2019.
8. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). arXiv:2103.00020.
9. Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., et al. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision" (ALIGN). ICML 2021. arXiv:2102.05918.
10. Li, J., Li, D., Xiong, C., & Hoi, S. (2022). "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation." arXiv:2201.12086.
11. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." arXiv:2301.12597.
12. Alayrac, J. B., Donahue, J., Luc, P., Miech, A., et al. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning." arXiv:2204.14198.
13. Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., & Wang, L. (2022). "GIT: A Generative Image-to-text Transformer for Vision and Language." arXiv:2205.14100.
14. Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., et al. (2022). "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework." arXiv:2202.03052.
15. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). "Visual Instruction Tuning" (LLaVA). NeurIPS 2023. arXiv:2304.08485.
16. Liu, H., Li, C., Li, Y., & Lee, Y. J. (2023). "Improved Baselines with Visual Instruction Tuning" (LLaVA-1.5). arXiv:2310.03744.
17. Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y. J. (2024). "LLaVA-NeXT: Improved reasoning, OCR, and world knowledge." Blog post, January 30, 2024.
18. Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., et al. (2024). "LLaVA-OneVision: Easy Visual Task Transfer." arXiv:2408.03326.
19. Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models." arXiv:2304.10592.
20. Hugging Face team (2023). "Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model." Hugging Face blog, August 22, 2023.
21. Laurençon, H., Tronchon, L., Cord, M., & Sanh, V. (2024). "What matters when building vision-language models?" (Idefics2). arXiv:2405.02246.
22. Laurençon, H., Marafioti, A., Sanh, V., & Tronchon, L. (2024). "Building and better understanding vision-language models: insights and future directions" (Idefics3). arXiv:2408.12637.
23. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., & Zhou, J. (2023). "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv:2308.12966.
24. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., et al. (2024). "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution." arXiv:2409.12191.
25. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., et al. (2025). "Qwen2.5-VL Technical Report." arXiv:2502.13923.
26. Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., et al. (2023). "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks." arXiv:2312.14238.
27. Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., et al. (2024). "How Far Are We to GPT-4V? Closing the Gap with Commercial Multimodal Models with Open-Source Suites" (InternVL 1.5). arXiv:2404.16821.
28. Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., et al. (2024). "Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling" (InternVL 2.5). arXiv:2412.05271.
29. Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., et al. (2024). "Phi-3 Technical Report." arXiv:2404.14219.
30. Meta AI (2024). "Llama 3.2: Revolutionizing edge AI and vision with open, customizable models." Meta AI blog, September 25, 2024.
31. OpenAI (2023). "GPT-4V(ision) System Card." OpenAI, September 25, 2023.
32. OpenAI (2024). "Hello GPT-4o." OpenAI blog, May 13, 2024.
33. Gemini Team, Google (2023). "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805.
34. Gemini Team, Google (2024). "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530.
35. Mistral AI (2024). "Pixtral 12B." Mistral AI blog, September 11, 2024.
36. Mistral AI (2024). "Pixtral Large." Mistral AI blog, November 18, 2024.
37. Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., et al. (2024). "MiniCPM-V: A GPT-4V Level MLLM on Your Phone." arXiv:2408.01800.
38. Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., et al. (2024). "DeepSeek-VL: Towards Real-World Vision-Language Understanding." arXiv:2403.05525.
39. Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., et al. (2024). "DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding." arXiv:2412.10302.
40. Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J. S., et al. (2024). "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models." arXiv:2409.17146.
41. Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S. C., et al. (2024). "Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs." arXiv:2406.16860.
42. Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., et al. (2022). "OCR-free Document Understanding Transformer" (Donut). ECCV 2022. arXiv:2111.15664.
43. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). "LayoutLM: Pre-training of Text and Layout for Document Image Understanding." KDD 2020. arXiv:1912.13318.
44. Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022). "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking." arXiv:2204.08387.
45. Wei, H., Liu, C., Chen, J., Wang, J., Kong, L., Xu, Y., et al. (2024). "General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model" (GOT-OCR2.0). arXiv:2409.01704.
46. Poznanski, J., Borchardt, J., Dunkelberger, J., Tran, R., Lochab, R., Hossain, B., et al. (2025). "olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models." arXiv:2502.18443.
47. Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., et al. (2023). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." arXiv:2311.16502.
48. Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., et al. (2023). "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts." arXiv:2310.02255.
49. Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., & Wang, L. (2023). "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities." arXiv:2308.02490.
50. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., & Wen, J. R. (2023). "Evaluating Object Hallucination in Large Vision-Language Models" (POPE). arXiv:2305.10355.
51. Mathew, M., Karatzas, D., & Jawahar, C. V. (2021). "DocVQA: A Dataset for VQA on Document Images." WACV 2021. arXiv:2007.00398.
52. Masry, A., Long, D. X., Tan, J. Q., Joty, S., & Hoque, E. (2022). "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning." ACL 2022. arXiv:2203.10244.
53. xAI (2024). "Grok-1.5 Vision Preview." xAI blog, April 12, 2024 (includes RealWorldQA release).
54. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). "Microsoft COCO: Common Objects in Context." ECCV 2014. arXiv:1405.0312.
55. Hugging Face. "Open VLM Leaderboard." Hugging Face Spaces.
56. OpenCompass. "OpenCompass Multi-modal Leaderboard."

[^kosmos]: Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., & Wei, F. (2023). "Kosmos-2: Grounding Multimodal Large Language Models to the World." arXiv:2306.14824. https://arxiv.org/abs/2306.14824 (Kosmos-1: Huang et al., "Language Is Not All You Need," arXiv:2302.14045). Accessed 2026-05-31.
[^pali3]: Chen, X., Wang, X., Beyer, L., Kolesnikov, A., Wu, J., Voigtlaender, P., et al. (2023). "PaLI-3 Vision Language Models: Smaller, Faster, Stronger." arXiv:2310.09199. https://arxiv.org/abs/2310.09199 (PaLI-X: Chen et al., arXiv:2305.18565). Accessed 2026-05-31.
[^paligemma]: Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., et al. (2024). "PaliGemma: A versatile 3B VLM for transfer." arXiv:2407.07726. https://arxiv.org/abs/2407.07726 Accessed 2026-05-31.
[^paligemma2]: Steiner, A., Susano Pinto, A., Tschannen, M., Keysers, D., Wang, X., Bitton, Y., et al. (2024). "PaliGemma 2: A Family of Versatile VLMs for Transfer." arXiv:2412.03555. https://arxiv.org/abs/2412.03555 Accessed 2026-05-31.
[^qwen25vl]: Alibaba Qwen Team (2025). "Qwen2.5-VL Technical Report." arXiv:2502.13923; "Qwen2.5-VL-32B-Instruct" released March 24, 2025. https://arxiv.org/abs/2502.13923 Accessed 2026-05-31.
[^qwen3vl]: Alibaba Qwen Team (2025). "Qwen3-VL Technical Report." arXiv:2511.21631; Qwen3-VL repository. https://github.com/QwenLM/Qwen3-VL Accessed 2026-05-31.
[^internvl3]: Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., et al. (2025). "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models." arXiv:2504.10479. https://arxiv.org/abs/2504.10479 Accessed 2026-05-31.
[^gemma3]: Gemma Team, Google DeepMind (2025). "Gemma 3 Technical Report." arXiv:2503.19786; announced March 12, 2025. https://arxiv.org/abs/2503.19786 Accessed 2026-05-31.
[^llama4]: Meta AI (2025). "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ Accessed 2026-05-31.
[^kimivl]: Kimi Team, Moonshot AI (2025). "Kimi-VL Technical Report." arXiv:2504.07491. https://arxiv.org/abs/2504.07491 Accessed 2026-05-31.
[^mistralsmall31]: Mistral AI (2025). "Mistral Small 3.1." Mistral AI news, March 17, 2025. https://mistral.ai/news/mistral-small-3-1/ Accessed 2026-05-31.
[^gpt5]: OpenAI (2025). "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/ Accessed 2026-05-31.
[^claude41]: Anthropic (2025). "Claude Opus 4.1." Anthropic news, August 3, 2025. https://www.anthropic.com/news/claude-opus-4-1 Accessed 2026-05-31.