Image-to-Text Models
Last reviewed
May 13, 2026
Sources
56 citations
Review status
Source-backed
Revision
v2 · 5,881 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
56 citations
Review status
Source-backed
Revision
v2 · 5,881 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Multimodal Models and Tasks
Image-to-text models are machine learning systems that take an image as input and produce natural language text as output. The category covers image captioning (writing a sentence that describes what is in a picture), visual question answering (answering a question about an image), document understanding (reading scanned forms, receipts, or charts), OCR (transcribing printed or handwritten text), and the broader class of vision language models (VLMs) that combine an image encoder with a large language model to support open-ended visual reasoning. Modern systems such as GPT-4 with vision, Gemini, LLaVA, Qwen-VL, and Pixtral treat images as just another token stream that the language model can attend to alongside text.
An image-to-text model accepts pixel data (one image, several images, or a video frame sequence) and emits a text string. The task family includes:
A single modern VLM typically handles all of these tasks at once. The same checkpoint that captions a photo can also read a receipt, answer a question about a chart, and explain a meme. This consolidation is recent. Before 2022, image captioning models, VQA models, and OCR systems were usually separate research lines with their own architectures.
Early image captioning relied on hand crafted pipelines: detect objects, retrieve sentences from a fixed corpus, and stitch them together with templates. The Flickr8k and Flickr30k datasets, released by Hodosh, Young, and Hockenmaier in 2013 and 2014, were the standard benchmarks. The MS COCO Captions dataset (Chen et al. 2015) expanded the field with around 330,000 images, each paired with five human written captions, and remains the most cited captioning benchmark.
The first widely cited deep learning captioning model was "Show and Tell: A Neural Image Caption Generator" by Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan at Google (2015). It paired a convolutional neural network image encoder (Inception) with a long short term memory (LSTM) decoder.
A few months later, "Show, Attend and Tell" by Kelvin Xu and collaborators (2015) added soft and hard visual attention so the decoder could look at different image regions for different output words. The bottom up top down attention model from Peter Anderson et al. (2018) replaced grid features with region features from Faster R-CNN and was the basis for several VQA Challenge winners.
VQA itself became a benchmark with the VQA dataset (Antol et al. 2015) and its rebalanced successor VQAv2 (Goyal et al. 2017). Models in this era used custom fusion modules such as MUTAN, BAN, and MCAN to combine image and question features.
CLIP (Contrastive Language Image Pretraining), released by OpenAI in January 2021 (Radford et al.), trained an image encoder and a text encoder jointly on 400 million image text pairs scraped from the web. The model learned to match images and captions in a shared embedding space. CLIP itself does not generate captions, but its image encoder became the standard vision backbone for most later generative VLMs (LLaVA, BLIP-2, MiniGPT-4, and others all use a CLIP ViT-L/14 or SigLIP variant). ALIGN, published by Chao Jia and collaborators at Google in 2021, scaled the same contrastive recipe to roughly 1.8 billion noisy pairs.
In early 2022 Salesforce Research released BLIP (Bootstrapping Language Image Pre-training) by Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP introduced a unified vision language pretraining objective and a captioning filter that bootstrapped its own training data. The successor BLIP-2 (January 2023) was more influential. It froze a vision encoder (typically a CLIP ViT) and froze an LLM (OPT or FLAN-T5), then trained a small lightweight Querying Transformer (Q-Former) to bridge the two. With only the Q-Former being trainable, BLIP-2 reached strong VQA and captioning numbers at a fraction of the training cost of end to end models.
DeepMind's Flamingo (Alayrac et al., April 2022) took a different route. A frozen Chinchilla LLM was extended with new gated cross attention layers that attended to features from a frozen NFNet image encoder. Flamingo was the first VLM to support interleaved image text prompts with strong few shot performance.
Other 2022 entries included Microsoft's GIT (Generative Image to Text) by Jianfeng Wang and collaborators (May 2022), Alibaba's OFA (Wang et al. 2022) which framed every vision language task as text to text, and Google's PaLI (Chen et al. 2022).
The model that opened the floodgates was LLaVA (Large Language and Vision Assistant) by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee at the University of Wisconsin and Microsoft Research, first posted to arXiv in April 2023. LLaVA combined a CLIP ViT-L/14 image encoder with the Vicuna LLM via a simple linear projection layer. Crucially, the authors used GPT-4 to generate 158,000 multimodal instruction following examples from COCO captions and bounding box annotations, then fine tuned the model on this synthetic instruction data. The recipe was simple, the code was open, the results were good, and within a few months dozens of similar systems appeared.
LLaVA-1.5 (October 2023) replaced the linear projector with a two layer MLP, added academic VQA datasets to the training mix, and reached state of the art numbers among open 7B and 13B models. LLaVA-NeXT (also called LLaVA-1.6, January 2024) added higher input resolution via image tiling. LLaVA-OneVision (August 2024) extended the family to single image, multi image, and video inputs in one checkpoint.
MinGPT-4 by Deyao Zhu and collaborators (April 2023) followed a similar recipe with a Q-Former plug to Vicuna. MiniGPT-v2 (October 2023) generalized it to multiple visual tasks.
Hugging Face released Idefics (Image aware Decoder Enhanced a la Flamingo) in August 2023, an open reproduction of Flamingo in 9B and 80B sizes. Idefics2 (April 2024) replaced the cross attention design with a more modern "perceiver resampler plus LLM" stack and pushed open benchmark scores significantly higher. Idefics3 (August 2024) reached parity with much larger closed models on document and chart benchmarks.
Alibaba's Qwen-VL (Bai et al., August 2023) and its instruction tuned variant Qwen-VL-Chat were the first credibly bilingual (English and Chinese) open VLMs. Qwen2-VL (August 2024) introduced dynamic resolution, allowing the model to process images of varying sizes natively, and added long video support. Qwen2.5-VL (January 2025) extended the lineup further with stronger document parsing and a 72B variant that competed with frontier closed models.
InternVL from the Shanghai AI Lab and OpenGVLab (Chen et al., December 2023) scaled the vision encoder itself to 6 billion parameters, far larger than the ViT-L/14 used by most VLMs. InternVL 1.5 (April 2024) added native dynamic resolution and a 26B parameter combined model. InternVL 2 (July 2024) introduced progressive alignment training. InternVL 2.5 (December 2024) became one of the first open models to surpass 70% on the MMMU benchmark.
Microsoft released Phi-3-Vision (4.2 billion parameters) in May 2024, the multimodal version of its Phi-3 small language model family. Phi-3.5-Vision (August 2024) added multi image support. The Phi-Vision models were aimed at on device and edge use cases.
OpenBMB and Tsinghua released the MiniCPM-V family. MiniCPM-Llama3-V 2.5 (May 2024) packed a Llama 3 8B LLM with a SigLIP vision encoder into a model that ran on a phone. MiniCPM-V 2.6 (August 2024) added single image, multi image, and video understanding in one 8B model. The line later extended to MiniCPM-o for on device multimodal interaction.
DeepSeek released DeepSeek-VL in March 2024 and DeepSeek-VL2 in December 2024. The VL2 release used a mixture of experts (MoE) language model backbone, with the largest active expert configuration around 4.5B active parameters.
Mistral AI's first multimodal release, Pixtral 12B, appeared on September 11, 2024. Pixtral introduced a 400M parameter vision encoder trained from scratch and used a Mistral Nemo 12B language backbone. Pixtral Large (November 18, 2024) scaled the same architecture with a Mistral Large 2 backbone to roughly 124B parameters.
Meta released Llama 3.2 11B Vision and Llama 3.2 90B Vision on September 25, 2024, alongside the smaller text only Llama 3.2 1B and 3B models. The vision variants used cross attention adapters in the style of Flamingo rather than a projector and concat design, and were the first Llama models with native image input.
The Allen Institute for AI released Molmo (Multimodal Open Language Model) by Matt Deitke, Christopher Clark, and collaborators in September 2024. Molmo was trained on a fully open new dataset called PixMo that included spoken caption transcripts collected from human annotators. The 72B model reached numbers competitive with proprietary systems while keeping the entire data, weights, and code open. NYU and Meta's Cambrian-1, led by Yann LeCun's group (June 2024), focused on vision centric design choices and benchmarked many vision backbones in one study.
Other notable 2024 open releases included 01.AI's Yi-VL (January 2024), Kuaishou's KwaiVGI, Shanghai AI Lab's InternLM-XComposer line, Tencent's HunyuanVL, and ByteDance's CogVLM and CogVLM2.
OpenAI added vision input to GPT-4 with the GPT-4V release on September 25, 2023, and rolled it out to ChatGPT Plus users a few weeks later. GPT-4o ("omni"), released May 13, 2024, was OpenAI's first natively multimodal model, trained end to end on text, image, and audio tokens in one network. GPT-4o was followed by GPT-4o-mini (July 2024) and o1 with vision (December 2024).
Google introduced Gemini 1.0 Ultra, Pro, and Nano in December 2023, with Gemini 1.0 Pro Vision being the first widely available variant. Gemini was described as natively multimodal from the start. Gemini 1.5 Pro (February 2024) launched with a 1 million token context window, allowing it to process around an hour of video or 11 hours of audio in a single prompt. Gemini 2.0 Flash (December 2024) added native image generation as well as native image understanding. The Gemini 2.5 family (March to June 2025) added thinking modes and 2 million token contexts.
Anthropic's Claude 3 family (March 2024) added image input to the Opus, Sonnet, and Haiku tiers. Claude 3.5 Sonnet (June 2024), Claude 3.7 Sonnet (February 2025), and Claude 4 (May 2025) all retained vision. xAI's Grok 1.5V (April 2024) and later Grok 2 with vision introduced the RealWorldQA benchmark.
| Model | Released | Org | Size | Vision encoder | Language backbone | Notes |
|---|---|---|---|---|---|---|
| BLIP | Jan 2022 | Salesforce | 224M to 446M | ViT-B/L | BERT-large | Bootstrapped caption filtering |
| Flamingo | Apr 2022 | DeepMind | 80B | NFNet-F6 | Chinchilla 70B | Closed, but defined cross attention design |
| BLIP-2 | Jan 2023 | Salesforce | 188M Q-Former | ViT-g/14 | OPT, FLAN-T5 | Q-Former bridge to frozen LLM |
| MiniGPT-4 | Apr 2023 | KAUST | 13B | ViT-g/14 | Vicuna | Open Q-Former plus Vicuna |
| LLaVA-1.0 | Apr 2023 | UW, Microsoft | 7B, 13B | CLIP ViT-L/14 | Vicuna | GPT-4 generated visual instructions |
| InstructBLIP | May 2023 | Salesforce | 4B to 13B | ViT-g/14 | Vicuna, FLAN-T5 | Instruction tuned BLIP-2 |
| IDEFICS | Aug 2023 | Hugging Face | 9B, 80B | OpenCLIP H | LLaMA | Open reproduction of Flamingo |
| Qwen-VL | Aug 2023 | Alibaba | 9.6B | OpenCLIP-bigG | Qwen-7B | First strong bilingual open VLM |
| CogVLM | Sep 2023 | Zhipu, Tsinghua | 17B | EVA-2-CLIP-E | Vicuna | Visual expert attention modules |
| LLaVA-1.5 | Oct 2023 | UW, Microsoft | 7B, 13B | CLIP ViT-L/14-336 | Vicuna 1.5 | MLP projector, academic VQA mix |
| Fuyu-8B | Oct 2023 | Adept | 8B | none, raw patches | Persimmon | No separate encoder |
| Yi-VL | Jan 2024 | 01.AI | 6B, 34B | CLIP ViT-H | Yi | Bilingual |
| LLaVA-NeXT (1.6) | Jan 2024 | UW, Microsoft | 7B, 13B, 34B | CLIP ViT-L/14 | Vicuna, Mistral, Nous-Hermes | Higher resolution via tiling |
| MoE-LLaVA | Jan 2024 | Peking U | 3B active | CLIP | Phi-2, StableLM | Sparse MoE |
| DeepSeek-VL | Mar 2024 | DeepSeek | 1.3B, 7B | SigLIP-L | DeepSeek LLM | Hybrid SAM and SigLIP encoder |
| Idefics2 | Apr 2024 | Hugging Face | 8B | SigLIP-SO400M | Mistral 7B | Replaced cross attn with perceiver |
| InternVL 1.5 | Apr 2024 | Shanghai AI Lab | 26B | InternViT-6B | InternLM2-20B | Dynamic resolution |
| Phi-3-Vision | May 2024 | Microsoft | 4.2B | CLIP ViT-L/14 | Phi-3-mini | Small device focused |
| MiniCPM-Llama3-V 2.5 | May 2024 | OpenBMB | 8B | SigLIP-400M | Llama 3 8B | Runs on phones |
| Chameleon | Jun 2024 | Meta FAIR | 7B, 34B | quantized image tokens | Llama-style | Early fusion, mixed modality output |
| Cambrian-1 | Jun 2024 | NYU, Meta | 8B to 34B | SVA spatial vision aggregator | Llama 3, Hermes | Vision centric design study |
| InternVL 2 | Jul 2024 | Shanghai AI Lab | 1B to 76B | InternViT-300M to 6B | Various | Progressive alignment |
| Pixtral 12B | Sep 2024 | Mistral | 12B | 400M from scratch | Mistral Nemo | First Mistral multimodal |
| Llama 3.2 Vision | Sep 2024 | Meta | 11B, 90B | self trained vision adapter | Llama 3.1 | Cross attention adapter |
| Molmo | Sep 2024 | Allen AI | 1B to 72B | OpenAI CLIP, SigLIP | OLMo, Qwen2 | PixMo dataset, fully open |
| Pixtral Large | Nov 2024 | Mistral | 124B | 1B Pixtral-L vision | Mistral Large 2 | Frontier scale open weights |
| InternVL 2.5 | Dec 2024 | Shanghai AI Lab | 1B to 78B | InternViT-300M to 6B | InternLM 2.5 | First open MMMU 70 |
| DeepSeek-VL2 | Dec 2024 | DeepSeek | 4.5B active MoE | SigLIP | DeepSeek-MoE | MoE backbone |
| Qwen2.5-VL | Jan 2025 | Alibaba | 3B, 7B, 72B | dynamic ViT | Qwen2.5 | Strong document parsing |
| Model | Released | Org | Notes |
|---|---|---|---|
| GPT-4V | Sep 25, 2023 | OpenAI | First closed VLM at frontier scale, rolled out via ChatGPT Plus |
| Gemini 1.0 Pro Vision | Dec 2023 | Marketed as natively multimodal from training | |
| Claude 3 Opus, Sonnet, Haiku | Mar 2024 | Anthropic | Image input added to entire Claude line |
| Gemini 1.5 Pro | Feb 2024 | 1M token context with video and audio | |
| Reka Core, Flash, Edge | Apr 2024 | Reka AI | Native multimodal trained from scratch |
| GPT-4o | May 13, 2024 | OpenAI | End to end text, image, and audio tokens |
| Claude 3.5 Sonnet | Jun 2024 | Anthropic | Strong chart and document VQA |
| Grok 1.5V | Apr 2024 | xAI | RealWorldQA benchmark released alongside |
| GPT-4o mini | Jul 2024 | OpenAI | Cheap vision tier |
| o1 | Dec 2024 | OpenAI | Reasoning model with vision |
| Gemini 2.0 Flash | Dec 2024 | Native image generation plus understanding | |
| Claude 3.7 Sonnet | Feb 2025 | Anthropic | Extended thinking, image input |
| Gemini 2.5 Pro | Mar 2025 | Thinking model, 2M context |
While general VLMs handle OCR as a side capability, several models target text recognition and document parsing as the primary task.
| Model | Released | Org | Approach |
|---|---|---|---|
| LayoutLM | 2019 | Microsoft | Joint text plus layout pretraining for forms |
| LayoutLMv2 | 2020 | Microsoft | Adds image features |
| LayoutLMv3 | 2022 | Microsoft | Unified text and image masking |
| Donut | 2022 | NAVER Clova | OCR free document understanding via end to end transformer |
| Pix2Struct | 2022 | Screenshot to structured text pretraining | |
| Nougat | 2023 | Meta | Academic PDF to markdown, math friendly |
| GOT-OCR2.0 | 2024 | Stepfun, USTC | Unified text, formula, chart, music score recognition |
| Florence-2 | 2024 | Microsoft | Unified detection, segmentation, captioning, OCR |
| Mistral OCR 3 | 2024 | Mistral | API based document OCR |
| olmOCR | Feb 2025 | Allen AI | 7B OCR model, fully open dataset and weights |
| DeepSeek-OCR | 2025 | DeepSeek | Open OCR with mixture of experts backbone |
Donut, published by Geewook Kim and collaborators (ECCV 2022), introduced the OCR free paradigm: instead of running a separate text detector and recognizer, a single transformer reads pixel patches and emits structured output such as JSON. Most modern VLMs follow this end to end approach implicitly.
Three dominant architectures account for most image-to-text models from 2022 onward.
A pretrained vision transformer encodes the image into a sequence of patch embeddings. A small adapter (a linear layer, an MLP, or a Q-Former) projects those embeddings into the language model's hidden size. The projected tokens are concatenated with the text tokens and fed to a standard decoder LLM. This is the dominant open source recipe, used by LLaVA, MiniGPT-4, Qwen-VL, InternVL, MiniCPM-V, Pixtral, and DeepSeek-VL. It is cheap to train because the vision encoder and often the LLM are frozen or only partially tuned.
The LLM is kept entirely frozen and unchanged. New gated cross attention layers are interleaved between the existing transformer blocks. These new layers attend from the text hidden states to a perceived sequence of image features. Because the new layers are gated and start at zero, the base LLM's text only behavior is preserved exactly at initialization. Flamingo, Idefics 1, and Llama 3.2 Vision use this design. It is more parameter efficient at inference but requires modifying the base LLM, which complicates deployment.
Every modality is tokenized at the input. Text uses byte pair encoding. Images are tokenized into discrete codes (Chameleon uses a VQ-VAE tokenizer; Gemini uses a custom image tokenizer). Audio uses a similar discrete code. All token streams are interleaved into a single sequence that a standard transformer processes from scratch. The model can generate any modality as output if it has been trained to predict tokens of that modality. Chameleon (Meta, May 2024), Janus (DeepSeek, October 2024), and Gemini are the public examples. Native multimodal models tend to require much more compute to train but can generate as well as understand multiple modalities with one network.
The Q-Former is a small transformer with a fixed set of learnable query embeddings (32 in BLIP-2). It cross attends to the frozen image encoder output, extracting a fixed length summary that the frozen LLM can ingest. The Q-Former trades some expressiveness for very low cost: only the Q-Former parameters (around 188M) are trained. Its successors in InstructBLIP and X-LLM keep this idea.
The benchmark landscape mirrors the task families described above.
| Benchmark | Released | Focus | Metric |
|---|---|---|---|
| Flickr30k | 2014 | Captioning | BLEU, CIDEr |
| COCO Captions | 2015 | Captioning | BLEU, METEOR, CIDEr, SPICE |
| VQAv2 | 2017 | General VQA | Accuracy |
| GQA | 2019 | Compositional VQA | Accuracy |
| OK-VQA | 2019 | VQA requiring external knowledge | Accuracy |
| TextVQA | 2019 | VQA over images with text | Accuracy |
| DocVQA | 2020 | Document VQA | ANLS |
| ChartQA | 2022 | Charts and plots VQA | Relaxed accuracy |
| InfographicVQA | 2022 | Infographics VQA | ANLS |
| ScienceQA | 2022 | K-12 science with images | Accuracy |
| POPE | 2023 | Object hallucination | F1 |
| MM-Vet | 2023 | Integrated capabilities | Open ended GPT-4 grading |
| MMBench | 2023 | Multiple choice multimodal | Accuracy |
| SEED-Bench | 2023 | Multiple choice multimodal | Accuracy |
| LLaVA-Bench (In-the-Wild) | 2023 | Open ended VQA | GPT-4 grading |
| MMMU | Nov 2023 | College multi discipline | Accuracy |
| MathVista | Oct 2023 | Mathematical visual reasoning | Accuracy |
| RealWorldQA | Apr 2024 | Spatial real world questions | Accuracy |
| VisualWebArena | 2024 | Web browsing agents | Task success |
| MMMU-Pro | Sep 2024 | Harder MMMU | Accuracy |
| MathVerse | 2024 | Diagram heavy math | Accuracy |
Key landmarks:
Leaderboards aggregated by OpenCompass, the Hugging Face Open VLM Leaderboard, and lmms-eval track most of these benchmarks across hundreds of open and closed models.
A mature stack supports running and fine tuning open VLMs.
AutoModelForVision2Seq, LlavaForConditionalGeneration, Qwen2VLForConditionalGeneration, MllamaForConditionalGeneration (Llama 3.2 Vision), Idefics3ForConditionalGeneration, PixtralForConditionalGeneration, and similar classes. The Hugging Face Hub hosts public weights for thousands of fine tuned VLM variants.mmproj adapter file pattern.Datasets used for instruction tuning of open VLMs include LLaVA-Instruct-150K, ShareGPT4V, the Cauldron (used for Idefics2), Cambrian-10M (used for Cambrian-1), PixMo (used for Molmo), the Idefics3 training mix, and various synthetic OCR and chart datasets generated by Qwen-VL and InternVL teams.