Visual Question Answering Models

AI Benchmarks AI Models Computer Vision Multimodal AI

38 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

69 citations

Revision

v5 · 7,567 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Vision Language Model and Multimodal Model

Visual question answering models are AI systems that take an image and a natural language question about that image and return a natural language answer. They combine computer vision and natural language processing so a single model can both see and reason: identifying objects, reading text, counting, judging spatial relations, and applying outside world knowledge to produce an answer. The first generation paired a convolutional neural network image encoder with a recurrent text encoder and predicted from a fixed answer vocabulary; today's systems are general vision language models that connect a vision encoder to a large language model and answer open ended questions in free form text.

Visual Question Answering (VQA) is the task of producing a natural language answer to a natural language question asked about an image. Introduced by Antol et al. at ICCV 2015 ^[1], VQA sits at the intersection of computer vision and natural language processing. It has become one of the standard probes for measuring whether a model can both see and reason. Early VQA systems combined a convolutional neural network image encoder with an LSTM question encoder ^[1]. Modern systems are vision language models: a vision transformer tied through a Transformer bridge to a large language model. The same multimodal architecture that answers questions about a photo also powers chart reading, document analysis, scientific figure interpretation, and general visual dialogue, so VQA has effectively merged into the broader study of multimodal models. The shift began when CLIP style image encoders made it cheap to feed natural language aligned visual features into a language model ^[31].

What is a visual question answering model?

The canonical VQA problem is simple to state. The input is a pair (I, Q) where I is an image and Q is a free form question in natural language. The output is an answer A, also in natural language. As the Antol et al. paper put it, "Given an image and a natural language question about the image, the task is to provide an accurate natural language answer" ^[1]. The paper framed this as an AI complete task because answering correctly can require object recognition, attribute recognition, counting, spatial reasoning, activity recognition, common sense, and outside world knowledge, often combined in a single question ^[1].

VQA has several variants in active use:

Open ended VQA. The model generates a free form answer. Evaluation typically uses the soft accuracy metric defined by Antol et al., which compares a candidate answer against ten human reference answers ^[1]^[63].
Multiple choice VQA. The model picks one of a fixed set of answer options. This format is used by Visual7W, AI2D, MMMU, ScienceQA, and MMBench, because it removes phrasing variability and makes scoring deterministic.
Knowledge based VQA. Questions cannot be answered from the pixels alone. The model must combine perception with stored or retrieved world knowledge. OK-VQA and A-OKVQA are the standard benchmarks for this variant ^[7]^[11].
Compositional VQA. Questions are produced by programs over a scene graph and test multi step reasoning. GQA is the largest example ^[8].
Text in image VQA. Questions require reading text inside the image. Examples include TextVQA, ST-VQA, OCR-VQA, and InfographicVQA ^[6].
Document VQA. Questions are about scanned documents and forms. DocVQA is the canonical benchmark ^[9].
Chart and diagram VQA. ChartQA and AI2D ask questions about plots, tables, and schematic figures ^[12]^[5].
Math in image VQA. MathVista evaluates mathematical reasoning over visual inputs ^[14].
Embodied and video VQA. The agent answers questions about a 3D environment or a video stream. These are usually treated as separate research lines, although models that score well on still image VQA increasingly handle them too.

A typical VQA pipeline has three components. A visual encoder maps the image to a sequence of feature vectors. A text encoder maps the question to a sequence of tokens. A fusion or cross modal module combines the two and produces an answer, either by classification over a fixed vocabulary or by autoregressive generation. The history of VQA models is, in large part, the history of how this fusion is performed.

How do VQA models work?

Every VQA model, from the 2015 baselines to the 2025 frontier systems, follows the same three stage template: encode the image, encode the question, and fuse the two to produce an answer. What changed over a decade is the implementation of each stage. The visual encoder moved from frozen VGG-Net features to Faster R-CNN region features and then to vision transformer patch tokens. The question encoder moved from an LSTM to a BERT style Transformer and then to a decoder only large language model. The fusion module moved from a simple element wise product, through attention and bilinear pooling, to heavy cross modal Transformers, and finally to a small adapter (an MLP or a Q-Former) that projects visual features into the embedding space of a frozen or lightly tuned LLM. The output moved from a softmax over a few thousand frequent answers to open ended autoregressive text generation, which is why a modern VQA model is also a chatbot, a captioner, and a document reader. The sections below trace this evolution and then catalogue the models that defined each stage.

What datasets are used for visual question answering?

The field has accumulated dozens of benchmarks. The table below lists the most influential ones in roughly chronological order.

Dataset	Year	Approximate size	Focus
VQA v1 (Antol et al.)	2015	254K images, 760K questions, 10M answers	Open ended VQA over COCO and abstract scenes
Visual7W (Zhu et al.)	2016	47K images, 328K QA pairs	Grounded multiple choice QA, seven Ws
Visual Genome (Krishna et al.)	2017	108K images, 1.7M QA pairs	Dense scene graphs and region QA
VQAv2 (Goyal et al.)	2017	Roughly 2x VQAv1, 1.1M QA pairs	Balanced pairs that reduce language priors
AI2D (Kembhavi et al.)	2016	4,903 diagrams, 15K questions	Primary school science diagrams
TextVQA (Singh et al.)	2019	28,408 images, 45,336 questions	Questions that require reading scene text
OK-VQA (Marino et al.)	2019	14,055 questions	Open ended QA requiring outside knowledge
GQA (Hudson and Manning)	2019	113K images, 22M questions	Programmatic compositional reasoning
ST-VQA, OCR-VQA	2019	Tens of thousands of questions	Scene text and book cover OCR QA
DocVQA (Mathew et al.)	2021	12K+ document images, 50K questions	Industry documents and forms
ScienceQA (Lu et al.)	2022	21,208 questions	Multimodal multiple choice science with rationales
A-OKVQA (Schwenk et al.)	2022	About 25K questions	World knowledge plus commonsense reasoning
ChartQA (Masry et al.)	2022	9.6K human questions, 23.1K augmented	Charts with visual and logical reasoning
InfographicVQA	2022	5K infographics, 30K questions	Infographic layout and reading
MM-Vet (Yu et al.)	2023	218 questions, 16 capability subsets	Integrated capability evaluation, GPT judged
MMBench (Liu et al.)	2023	2,974 multiple choice questions	Twenty fine grained ability dimensions, bilingual
SEED-Bench (Li et al.)	2023	About 19K multiple choice questions	Twelve image and video understanding dimensions
POPE (Li et al.)	2023	Binary existence probes	Object hallucination
LLaVA-Bench (in the wild)	2023	24 images, 60 prompts	GPT 4 graded open ended chat
MMMU (Yue et al.)	2024	11.5K college level questions	Six disciplines, thirty subjects, expert reasoning
MathVista (Lu et al.)	2024	6,141 examples from 31 datasets	Math reasoning in visual contexts
HallusionBench (Guan et al.)	2024	346 images, 1,129 questions	Language hallucination and visual illusion
RealWorldQA (xAI)	2024	More than 700 images	Spatial reasoning from vehicle and real world scenes
CV-Bench (Cambrian)	2024	About 2,600 questions	Vision centric tasks repurposed for VLMs
MMMU-Pro	2024	Robust subset of MMMU	Shortcut resistant version of MMMU

In 2017, Goyal et al. showed that the original VQA dataset had strong language priors. A model could answer many questions correctly without even looking at the image ^[2]. VQAv2 paired each question with a second image where the answer is different. The balanced version cut the language only baseline accuracy from around 48 percent on the original VQA test set to about 26 percent on VQAv2, while a model that uses the image scored 54 percent ^[2]. Most papers since 2018 report VQAv2 numbers rather than VQAv1.

GQA, released by Drew Hudson and Christopher Manning in 2019, built questions programmatically from Visual Genome scene graphs ^[8]. Each question has a corresponding functional program that captures its semantics. The dataset includes new metrics for consistency, grounding, and plausibility. Human accuracy on GQA is 89.3 percent. Strong VQA models from 2019 reached 54.1 percent ^[8].

OK-VQA and A-OKVQA, both from the Allen Institute for AI, target knowledge that lives outside the image ^[7]^[11]. An OK-VQA question might show a photo of a horse and ask what farm animal eats hay. A-OKVQA adds rationales and pairs each question with multiple choice options and ten free form answer references, encouraging methods that use either retrieval or implicit knowledge from large pretrained models ^[11].

The document, chart, and diagram benchmarks fill a gap that the early COCO based datasets did not cover. DocVQA was constructed from the UCSF industry documents library and the Tobacco corpus ^[9]. ChartQA was scraped from Statista and OurWorldInData and includes both human written questions and questions generated from human written chart summaries ^[12]. AI2D contains 4,903 primary school science diagrams drawn from textbooks, with object segmentations and a Diagram Parse Graph that encodes element relationships ^[5].

MMMU, released by Yue and 21 co authors in late 2023 and presented at CVPR 2024, became the de facto reference benchmark for frontier VLMs ^[13]. It collects 11.5K questions from college exams across six disciplines, including chemical structures, sheet music, anatomy diagrams, and engineering schematics. When MMMU launched, GPT-4V scored 56 percent and Gemini Ultra scored 59 percent, well below human expert performance ^[13].

How did VQA models evolve over time?

Early VQA: CNN plus LSTM with attention

The Antol et al. ICCV 2015 paper that defined the modern task also released the first deep learning baselines ^[1]. The strongest of those used a VGG-Net image encoder, an LSTM question encoder, and an element wise product to fuse them before a softmax over the 1,000 most common answers. The paper showed that humans achieve 83 percent accuracy on the open ended evaluation, while the best baseline reached 58 percent ^[1].

Three main lines of research followed. The first focused on attention. Yang et al. introduced Stacked Attention Networks at CVPR 2016 ^[21]. SANs query the image multiple times with the question embedding, producing iterative attention maps that progressively focus on the relevant region. The two layer SAN improved over a plain LSTM Q+I baseline by 4.8 absolute points on VQA ^[21]. A parallel line was Hierarchical Question Image Co-Attention (Lu et al. NeurIPS 2016), which attended jointly to words and image regions.

The second line was bilinear pooling. Fukui et al. (EMNLP 2016) proposed Multimodal Compact Bilinear pooling ^[22]. MCB approximates the outer product of visual and textual feature vectors by projecting them with random Count-Sketch vectors and multiplying in the frequency domain, capturing richer interactions than concatenation or element wise product. Follow up methods, MUTAN, MFB, and MFH, made bilinear pooling more parameter efficient.

The third line was object centric features. Anderson et al. published Bottom-Up and Top-Down Attention in 2017, the model that won the 2017 VQA Challenge ^[23]. The bottom up component uses Faster R-CNN trained on Visual Genome to propose salient object regions. The top down component computes an attention distribution over those regions conditioned on the question. The same model achieved a then state of the art CIDEr of 117.9 on COCO captioning, demonstrating that region features were a strong general visual representation ^[23].

Transformer fusion era: ViLBERT and LXMERT

In 2019, the success of BERT in language tasks pushed the field toward transformer based cross modal encoders. ViLBERT (Lu, Batra, Parikh, Lee NeurIPS 2019) used a two stream architecture with co attention layers between vision and language ^[25]. LXMERT (Tan and Bansal EMNLP 2019) used three encoders, one for the question, one for image regions, and a cross modality encoder, pretrained on five tasks including masked language modeling, masked region prediction, and image question answering ^[24]. LXMERT improved NLVR2 from 54 percent to 76 percent absolute ^[24].

VL-BERT (Su et al. ICLR 2020) merged vision and language into a single stream Transformer ^[28]. UNITER (Chen et al. ECCV 2020) added Word Region Alignment via optimal transport on top of standard MLM and image text matching ^[26]. OSCAR (Li et al. ECCV 2020) used object tags as anchor points to ease alignment, on the observation that salient objects appear in both modalities and are easy to detect ^[27]. VinVL (Microsoft 2021) showed that a stronger object detector trained on combined detection corpora dominated over architectural changes ^[29]. METER and ALBEF (Li et al. NeurIPS 2021) refined the recipe, with ALBEF introducing image text contrastive loss to align features before they are fused and momentum distillation to learn from noisy web pairs ^[30].

These models pushed VQAv2 test-std accuracy from the high 60s to about 76 percent over three years, but they shared a structural limit. They were classifiers that predicted one of a few thousand frequent answers. They could not write a sentence, follow an instruction, or reason in chains.

Contrastive and generative era: CLIP, BLIP, BLIP-2, and Flamingo

CLIP, released by OpenAI in February 2021 (Radford et al.), trained a separate image encoder and text encoder with a contrastive objective on 400 million web image text pairs ^[31]. CLIP itself is not a VQA model, but its image encoder, especially the ViT-L/14 and later ViT-H/14 variants, became the visual front end for almost every subsequent VLM. The reason is simple: CLIP encoders produce features that already align with natural language descriptions, which means downstream multimodal models do not need to learn that alignment from scratch.

BLIP (Li et al. ICML 2022, Salesforce) combined a vision encoder with an encoder decoder language head in a Multimodal mixture of Encoder-Decoder design ^[32]. It introduced Captioning and Filtering (CapFilt), which uses a captioner to generate synthetic captions and a filter to remove noisy image text pairs, bootstrapping clean training data from the noisy web. BLIP set new state of the art numbers on image text retrieval, captioning, and VQA ^[32].

BLIP-2 (Li et al. ICML 2023) was the architectural template for almost every subsequent open VLM ^[33]. It froze both a pretrained image encoder (EVA-CLIP ViT-g) and a pretrained LLM (OPT or FlanT5) and inserted a small trainable Querying Transformer (Q-Former) between them. The Q-Former extracts a fixed number of visual query tokens that the LLM can consume directly. The paper reports that BLIP-2 "outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters" ^[33]. The pattern of a frozen encoder, a small adapter, and a frozen or lightly fine tuned LLM has been copied and refined ever since.

Google DeepMind released Flamingo (Alayrac et al. April 2022, NeurIPS 2022) before BLIP-2 ^[34]. Flamingo introduced gated cross attention layers inserted into a frozen Chinchilla LLM and supported arbitrarily interleaved sequences of images and text, which enabled in context few shot prompting. The authors reported that "a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples" ^[34]. The 80B variant set few shot state of the art on most VQA benchmarks of the time. Flamingo was never open sourced, but OpenFlamingo and Idefics later reproduced the recipe.

LLM augmented VLM era: LLaVA, MiniGPT-4, and InstructBLIP

The modern era of VQA began in April 2023 with two papers a few weeks apart. MiniGPT-4 (Zhu et al. April 2023) aligned a frozen image encoder with a frozen Vicuna LLM through a single projection layer and trained on a small curated image text dataset ^[39]. LLaVA (Liu et al. NeurIPS 2023) followed the same architectural idea, with a CLIP ViT-L/14 vision encoder, a projection matrix, and a Vicuna LLM, but added a key training step ^[35]. The authors used text only GPT-4 to generate 158K multimodal instruction following examples from COCO captions and object bounding boxes. The resulting model reached 85.1 percent of GPT-4 score on a synthetic multimodal benchmark and 92.53 percent accuracy on ScienceQA when combined with GPT-4 grading ^[35].

LLaVA-1.5 (Liu et al. October 2023) replaced the projection matrix with a two layer MLP, switched to a 336 px CLIP encoder, and added academic VQA datasets to the instruction tuning mix ^[36]. The 13B model achieved state of the art numbers across eleven benchmarks while training in about one day on a single eight GPU node, using 1.2M publicly available examples ^[36]. LLaVA-NeXT, also called LLaVA-1.6, arrived in January 2024 with higher resolution support (672x672 to 1344x336) and stronger world knowledge from a larger instruction mix ^[37]. LLaVA-OneVision (August 2024) consolidated the design choices from the LLaVA-NeXT blog series into a single recipe that handles images, multi image inputs, and video ^[38].

InstructBLIP (Dai et al. NeurIPS 2023, Salesforce) extended BLIP-2 with an instruction aware Q-Former that takes the user instruction as a side input ^[40]. Trained on 26 datasets converted to instruction format, InstructBLIP set state of the art zero shot numbers on thirteen held out VQA and image captioning benchmarks ^[40].

Alibaba released Qwen-VL in August 2023 with a custom visual receptor and a position aware vision language adapter ^[41]. Qwen2-VL followed on August 30, 2024, with three sizes (2B, 7B, 72B), Naive Dynamic Resolution that maps arbitrary input resolutions to a dynamic number of visual tokens, and Multimodal Rotary Position Embedding (M-RoPE) that decomposes positions into 1D text, 2D image, and 3D video components ^[42]. Qwen2.5-VL, released in January 2025, sharpened the same architecture and led the open weight charts on RealWorldQA and OCRBench through 2025 ^[43].

InternVL (Chen et al. CVPR 2024 oral), from Shanghai AI Lab and OpenGVLab, scaled the vision encoder to 6 billion parameters and tied it to an 8B language middleware ^[44]. Subsequent versions InternVL 1.5, 2, and 2.5 closed the gap to closed source frontier models on MMMU. Idefics (April 2023), Idefics2 (April 2024, 8B parameters), and Idefics3 (August 2024) from Hugging Face replicated the Flamingo recipe in open weights and shipped The Cauldron, a consolidated multimodal instruction dataset ^[45].

Microsoft released Phi-3-Vision in May 2024 as a 4.2 billion parameter VLM in the Phi-3 family, optimized for OCR, charts, and diagrams on edge devices ^[46]. Phi-3.5-Vision arrived later in 2024 with multi image support. Meta released Llama 3.2 Vision in 11B and 90B variants on September 25, 2024, using a separately trained vision adapter that integrates with the Llama 3.1 language model ^[47]. DeepSeek released DeepSeek-VL2 on December 13, 2024, a Mixture of Experts VLM with 1.0B, 2.8B, and 4.5B activated parameter variants that scored 93.3 percent on DocVQA ^[48]. Mistral released Pixtral 12B (Pixtral-12B-2409) in September 2024 with a 400M parameter dedicated vision encoder, native resolution input, and a 128K token context window, hitting 52.5 percent on MMMU ^[49].

Google released PaliGemma on May 14, 2024, a deliberately small and versatile 3B model that pairs a SigLIP-So400m vision encoder with a Gemma-2B language model through a linear projection ^[65]. Inspired by PaLI-3, it was designed less as a chatbot than as a strong base model for transfer to VQA, captioning, referring expression segmentation, and document tasks. PaliGemma 2, released on December 5, 2024, swapped in Gemma 2 backbones to produce 3B, 10B, and 28B variants at 224, 448, and 896 pixel resolutions, improving OCR, table, and chart reading ^[66].

The Allen Institute for AI released Molmo on September 24, 2024 alongside the PixMo dataset ^[50]. PixMo includes detailed image captions collected through speech based descriptions and a 2D pointing dataset, which lets Molmo answer with pointing coordinates as well as text. DeepSeek released Janus and Janus-Pro on January 27, 2025, a unified architecture that decouples visual encoding paths for understanding and generation, with SigLIP-L for understanding and LlamaGen tokenizer for generation ^[53]. OpenBMB and the Tsinghua NLP Lab maintain the MiniCPM-V series, edge friendly VLMs that have repeatedly matched GPT-4o on single image OCR tasks despite using under 10B parameters ^[51].

A recurring theme through 2024 and 2025 was vision centric tuning. The Cambrian-1 paper (Tong et al. NeurIPS 2024 oral) from NYU studied more than 15 visual representations under a unified VLM recipe and introduced the Spatial Vision Aggregator, which integrates high resolution vision features without exploding the token count ^[52]. Cambrian-1 also released CV-Bench, a benchmark constructed from canonical computer vision datasets to test perceptual abilities the existing VQA benchmarks underweight ^[52].

By 2025 the strongest open weight families narrowed the distance to closed frontier models. InternVL3, released by OpenGVLab on April 15, 2025, replaced the earlier "adapt a text LLM" recipe with native multimodal pretraining, learning language and vision jointly in a single stage, and added Variable Visual Position Encoding (V2PE) for longer multimodal contexts ^[67]. The team reported that InternVL3-78B reached 72.2 percent on MMMU, which it described as a new state of the art among open source models at release ^[67]. Alibaba released Qwen3-VL on September 23, 2025 in dense (2B, 4B, 8B, 32B) and Mixture of Experts (30B-A3B, 235B-A22B) sizes under an Apache 2.0 license, each shipped in both an instruction tuned and a reasoning ("Thinking") variant ^[68]. Qwen3-VL natively handles interleaved text, image, and video over a 256K token context and expands OCR to 32 languages. Alibaba reported that the Qwen3-VL-235B Instruct model matches or exceeds Gemini 2.5 Pro on major visual perception benchmarks, while the Thinking variant targets multimodal reasoning tasks ^[68].

What are the main open source VQA models?

Model	Release	Vision encoder	Language model	Trainable bridge
LXMERT	2019	Faster R-CNN	Transformer (scratch)	Cross modality encoder
ViLBERT	2019	Faster R-CNN	BERT base	Co attention
UNITER	2020	Faster R-CNN	BERT base	Single stream Transformer
OSCAR	2020	Faster R-CNN + object tags	BERT	Single stream Transformer
VinVL	2021	Large detector	Various	Single stream Transformer
ALBEF	2021	ViT-B/16	BERT base	Image text contrastive plus matching
BLIP	2022	ViT	BERT decoder	Multimodal mixture of Encoder-Decoder
Flamingo	2022	NFNet	Chinchilla 70B	Gated cross attention, Perceiver Resampler
BLIP-2	2023	EVA-CLIP ViT-g	OPT or FlanT5	Q-Former
LLaVA	2023	CLIP ViT-L/14	Vicuna 7B/13B	Single linear projection
MiniGPT-4	2023	EVA-CLIP ViT-g	Vicuna	Single linear projection
InstructBLIP	2023	EVA-CLIP ViT-g	Vicuna or FlanT5	Instruction aware Q-Former
Qwen-VL	2023	ViT-bigG	Qwen 7B	Position aware adapter
LLaVA-1.5	2023	CLIP ViT-L/14 336px	Vicuna 7B/13B	Two layer MLP
InternVL	2023	InternViT-6B	Vicuna or LLaMA	Cross attention
LLaVA-NeXT (1.6)	2024	CLIP ViT-L/14	Multiple options	MLP, multi tile
Idefics2	2024	SigLIP	Mistral 7B	Perceiver pooling and MLP
Phi-3-Vision	2024	CLIP ViT-L/14	Phi-3 (4.2B)	MLP
Cambrian-1	2024	Multiple, ensembled	LLaMA or Vicuna	Spatial Vision Aggregator
MiniCPM-V 2.6	2024	SigLIP-400M	Qwen2-7B	Resampler
Llama 3.2 Vision	2024	Vision adapter	Llama 3.1	Cross attention
Pixtral 12B	2024	Custom 400M ViT	Mistral Nemo 12B	Native resolution patches
Molmo	2024	CLIP ViT-L/14	OLMo, Qwen2, Mistral, Gemma2, Phi-3	MLP, pointing head
PaliGemma	2024	SigLIP-So400m	Gemma 2B	Single linear projection
PaliGemma 2	2024	SigLIP-So400m	Gemma 2 (2B/9B/27B)	Single linear projection
Qwen2-VL	2024	Dynamic resolution ViT	Qwen2 (2B/7B/72B)	M-RoPE, Naive Dynamic Resolution
LLaVA-OneVision	2024	SigLIP	Qwen2	MLP
DeepSeek-VL2	2024	Dynamic tiling vision	DeepSeekMoE	MoE adapter
Janus-Pro	2025	SigLIP-L (understand), LlamaGen (gen)	Decoder LLM	Decoupled paths
Qwen2.5-VL	2025	Dynamic resolution ViT	Qwen2.5	M-RoPE
InternVL 2.5	2025	InternViT	InternLM 2.5	MLP
InternVL3	2025	InternViT (V2PE)	Native multimodal	MLP
Qwen3-VL	2025	Dynamic resolution ViT	Qwen3 dense and MoE	M-RoPE

The trend is clear. Vision encoders have moved from Faster R-CNN region features to ViT style patch tokens that scale with resolution. Language backbones have moved from BERT base to 7B+ instruction tuned decoder only LLMs. The bridge has gone from heavy multimodal Transformers to small MLPs or Q-Formers that map a fixed or dynamic number of visual tokens into the LLM's embedding space.

What are the leading closed source frontier VLMs?

OpenAI introduced GPT-4 with Vision (GPT-4V) to ChatGPT subscribers on September 25, 2023 ^[54]. The system card emphasized red team work on person identification, medical advice, and CAPTCHA solving rather than benchmark numbers, but third party evaluations placed GPT-4V at 56.8 percent on MMMU and 49.9 percent on MathVista at launch, both human verified at the high end of any model at the time ^[13]^[14]. GPT-4o (omni) followed on May 13, 2024, as a single natively multimodal model that takes text, audio, image, and video input and emits text, audio, and image output ^[55]. GPT-4o-mini and successive GPT-4o snapshots improved chart and document numbers through late 2024. GPT-5 launched on August 7, 2025 as a natively multimodal model with state of the art numbers on visual, video, spatial, and scientific reasoning benchmarks ^[56].

Google DeepMind released Gemini 1.0 Pro Vision in December 2023 as part of the original Gemini family. Gemini 1.5 Pro arrived on February 15, 2024 with a 1 million token context window that ingests hour long videos ^[57]. Gemini 2.0 Flash launched on December 11, 2024 as an agentic multimodal model with native image and audio output ^[58]. Gemini 2.5 Pro experimental shipped on March 25, 2025 with thinking traces, native multimodality, and a 1 million token context that Google announced would expand to 2 million ^[59].

Anthropic added vision to the Claude family on March 4, 2024 with Claude 3 Haiku, Sonnet, and Opus, which accept photos, charts, graphs, and technical diagrams ^[60]. Claude 3.5 Sonnet (June 2024) substantially improved chart and document reading ^[61], and Claude 4 Sonnet and Opus launched on May 22, 2025 with multimodal input and computer use ^[62]. xAI's Grok-1.5 Vision (preview) was announced in April 2024 alongside the RealWorldQA benchmark, which xAI built and released to evaluate basic spatial understanding in real world scenes ^[20].

These frontier models do not publish architectural details, but their behavior suggests the same broad recipe as the open source line: a large vision encoder, a very large LLM trunk, and extensive instruction and RLHF data including multimodal preferences. The key differences are scale, the amount of high quality OCR and chart data, and proprietary tooling around safety classifiers and image preprocessing.

How are VQA models evaluated?

The Antol et al. soft accuracy metric is the standard for short answer open ended VQA ^[1]^[63]. Given a candidate answer a and ten human reference answers h1 to h10, accuracy is the average over leave one out subsets of min(num matches / 3, 1). This rewards answers that match at least three humans without penalizing alternatives that exactly match a smaller subset. The metric assumes lower cased, normalized text and is robust to minor phrasing differences.

Multiple choice benchmarks use exact match accuracy. MMBench introduces CircularEval, which feeds the same question with shuffled options multiple times and counts a model correct only if it agrees with itself across all permutations, controlling for option order bias ^[15].

Generative benchmarks for long answers borrow from captioning: BLEU, METEOR, ROUGE-L, and CIDEr. For VQA, BLEU and ROUGE correlate weakly with human judgment because correct answers are short, so most modern open ended benchmarks use a GPT judge. LLaVA-Bench uses GPT-4 to score model answers on a 1 to 10 scale against a reference ^[35]. MM-Vet uses GPT-4 with a structured rubric across six capabilities (recognition, knowledge, spatial, language generation, OCR, math) ^[17]. The MathVista and MMMU leaderboards combine accuracy on multiple choice subsets with GPT graded numerical or short answer subsets ^[13]^[14].

Leaderboards aggregate these scores. OpenCompass and OpenVLM maintain rolling rankings on more than 40 benchmarks ^[64]. Hugging Face's Open Leaderboard for vision language models tracks open weight releases and exposes per benchmark scores. The standard reporting bundle for a new model in 2024 and 2025 is some subset of MMMU, MathVista, MMBench, AI2D, ChartQA, DocVQA, OCRBench, RealWorldQA, HallusionBench, MM-Vet, and SEED-Bench.

Which benchmarks matter for frontier VLMs?

MMMU stands at the center of frontier evaluation ^[13]. The benchmark spans Art and Design, Business, Science, Health and Medicine, Humanities and Social Science, and Tech and Engineering, covering 30 subjects, 183 subfields, and 30 heterogeneous image types from chemical structures to musical scores. The questions come from college level exams. When MMMU launched in late 2023, the best open model scored in the high 30s. By late 2025, frontier closed models had climbed into the mid 80s on the validation split, approaching the MMMU medium expert human baseline of 82.6 percent and the best human score of 88.6 percent, while the strongest open weight models cleared 70 percent and the best passed into the high 70s and low 80s ^[13]^[67]. Because the headline accuracy gap to humans has narrowed, attention has shifted toward MMMU-Pro and other shortcut resistant variants. MMMU-Pro, released in September 2024, hardens the benchmark against shortcut behaviors by removing text only solvable questions and reformatting questions as image embedded options.

MathVista combines 28 existing math and visual datasets with three new ones (IQTest, FunctionQA, PaperQA), for a total of 6,141 examples ^[14]. At launch, GPT-4V led at 49.9 percent, well below the human reference of 60.3 percent ^[14]. Frontier models in 2025 routinely score above 70 percent on MathVista, although the gap to human accuracy on the function and geometry subsets remains.

MMBench probes 20 specific abilities, including object localization, attribute recognition, relation reasoning, future prediction, and structuralized image text understanding ^[15]. The CircularEval metric makes MMBench harder than naive multiple choice scoring because models that anchor on option A or B without reasoning fail under permutation. The bilingual English and Chinese splits also expose VLMs that have memorized English captions rather than truly learned cross modal alignment.

SEED-Bench provides 19K multiple choice questions across 12 dimensions, separated into spatial and temporal axes, and was the first large benchmark to push video VQA evaluation alongside image VQA ^[16]. SEED-Bench-2 (November 2023) and SEED-Bench-2-Plus (April 2024) extended the benchmark to interleaved generation and text rich images.

MM-Vet (Yu et al. ICML 2024) targets compositional capabilities ^[17]. Each of the 218 questions exercises a specific combination of recognition, knowledge, spatial awareness, language generation, OCR, and math, and a GPT-4 judge scores the full free form answer. The small size keeps GPT scoring cheap, and the rubric makes it harder to game with answer length.

HallusionBench (Guan et al. CVPR 2024) targets failure modes rather than success ^[19]. The benchmark uses paired Visual Dependent (VD) and Visual Supplement (VS) questions to separate language hallucination from visual illusion. At launch, GPT-4V scored 31.42 percent on question pair accuracy. Every other model scored under 16 percent ^[19]. Frontier models in 2025 have closed the gap somewhat, but the benchmark remains a useful stress test for the failure modes that benign accuracy metrics hide.

How do models handle document and chart VQA?

Document and chart VQA have become a distinct cluster of benchmarks because the visual statistics differ from natural images. Documents have very dense text, low semantic redundancy, and structure that matters (forms, tables, layouts). Charts encode numerical data that the model must decode and reason over.

DocVQA images come from the UCSF industry documents library, mostly tobacco litigation records ^[9]. Questions ask for entity values (names, dates, amounts) and table cells. State of the art systems combine a high resolution vision encoder, OCR, and a strong language decoder. Pixtral, DeepSeek-VL2, and Qwen2-VL all post DocVQA scores in the low to mid 90s ^[48]^[49].

ChartQA includes human written questions and a larger augmented split generated from human written chart summaries ^[12]. Questions involve arithmetic over chart values (which year had the largest increase, what is the difference between two bars). The benchmark exposes models that read text from a chart but fail to extract numerical values reliably.

InfographicVQA targets long format visual layouts that mix text, illustration, and data visualization. AI2D adds primary school diagrams with explicit Diagram Parse Graphs, useful for evaluating models that need to follow arrows, labels, and topological structure ^[5].

A practical observation from the 2024 to 2025 releases is that VLM document scores correlate more with image resolution and OCR data mix than with model size. Pixtral 12B, with its 400M parameter dedicated vision encoder and native resolution input, often beats much larger models on DocVQA ^[49]. Qwen2-VL's Naive Dynamic Resolution similarly outperforms fixed resolution larger models ^[42]. The lesson is that for text rich images, the bottleneck is visual detail, not language sophistication.

What are the limitations of VQA models?

Hallucination remains the most prominent failure. The POPE benchmark (Li et al. EMNLP 2023) probes whether a model says yes to objects that are not in the image ^[18]. Even strong models fail under adversarial settings where the prompt suggests objects that frequently co occur with the actual scene. HallusionBench shows that VLMs hallucinate both about content (saying there is a dog when there is not) and about absent context (assuming a sport context that does not exist in the photo) ^[19].

Spatial reasoning is the second persistent weakness. xAI's RealWorldQA benchmark was specifically designed to expose this ^[20]. Questions ask basic things, like which lane is the car in, or which way will the door swing, that humans solve easily but multimodal models often miss. At launch in April 2024, GPT-4V reached 68 percent on RealWorldQA, while random guessing scored 37.7 percent ^[20]. The Cambrian-1 study found that performance on canonical computer vision tasks (depth, surface normal estimation, relative position) is uncorrelated with MMMU score, which means strong reasoning on MMMU is not sufficient for grounded spatial understanding ^[52].

Counting is the third recurring failure. VQAv2 has dedicated number questions (how many people are in the photo) and models from 2018 onward have struggled to exceed 50 percent on this subset ^[2]. The TallyQA benchmark and the counting split of MMBench show that even frontier 2025 models fail on counts above five, with accuracy collapsing on cluttered scenes.

OCR errors propagate. When the visual encoder misreads a digit or letter, the downstream answer is wrong. OCRBench and the text rich splits of DocVQA and ChartQA expose this. The shared root is resolution: models compress the image to a fixed token budget that loses fine grained text. Native resolution models (Pixtral, Qwen2-VL, MiniCPM-V) mitigate this but increase token cost ^[42]^[49].

Multi image and long video reasoning is the fourth limitation. Most VQA benchmarks use a single image. Real world tasks (compare two charts, summarize a deck, answer questions about a phone screen recording) require interleaved multi image and temporal reasoning. MMMU multi image subsets and benchmarks like MileBench, MV-Bench, and Video-MME show that performance drops sharply as the number of images grows. Long context VLMs (Gemini 1.5, GPT-4o, Claude 4) handle multi image reasoning better, but evaluation is still maturing ^[57].

Finally, VLMs inherit the biases of their image data. Faces, medical content, and culturally specific scenes (food, clothing, religious imagery) are unevenly represented. GPT-4V's system card explicitly disclaims medical use and notes residual bias in person identification ^[54]. Independent audits have shown that many open VLMs over generate Western objects and underperform on long tail object recognition. As VQA models migrate into agentic systems (browsing the web, operating GUIs, driving robots), these failure modes have direct safety implications, which is why benchmarks like HallusionBench, RealWorldQA, and POPE matter more than aggregate MMMU scores alone ^[18]^[19]^[20].

ELI5: What is a visual question answering model?

Imagine you show a friend a photograph and ask, "How many dogs are in this picture?" Your friend looks at the photo, thinks about it, and tells you the answer. A visual question answering model does the same thing, except the friend is a computer program. You give it a picture and a question typed in plain English, and it gives you back an answer in plain English. The earliest versions could only pick from a short list of common answers, like a multiple choice test. Today's versions can write a full sentence, read the small print on a sign, explain a chart, or describe what is happening, because they are built from the same kind of AI that powers chatbots, with eyes bolted on.

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D. (2015). VQA: Visual Question Answering. ICCV 2015. arXiv:1505.00468. ↩
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D. (2017). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. CVPR 2017. arXiv:1612.00837. ↩
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L. (2016). Visual7W: Grounded Question Answering in Images. CVPR 2016. arXiv:1511.03416.
Krishna, R. et al. (2017). Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV 123(1).
Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A. (2016). A Diagram Is Worth A Dozen Images. ECCV 2016. (AI2D dataset) ↩
Singh, A. et al. (2019). Towards VQA Models That Can Read. CVPR 2019. (TextVQA) ↩
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R. (2019). OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. CVPR 2019. arXiv:1906.00067. ↩
Hudson, D.A., Manning, C.D. (2019). GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. CVPR 2019. arXiv:1902.09506. ↩
Mathew, M., Karatzas, D., Jawahar, C.V. (2021). DocVQA: A Dataset for VQA on Document Images. WACV 2021. ↩
Lu, P. et al. (2022). Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. NeurIPS 2022. arXiv:2209.09513. (ScienceQA)
Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R. (2022). A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. ECCV 2022. arXiv:2206.01718. ↩
Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E. (2022). ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of ACL 2022. arXiv:2203.10244. ↩
Yue, X. et al. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. CVPR 2024. arXiv:2311.16502. ↩
Lu, P. et al. (2024). MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. ICLR 2024. arXiv:2310.02255. ↩
Liu, Y. et al. (2024). MMBench: Is Your Multi-modal Model an All-around Player? ECCV 2024. arXiv:2307.06281. ↩
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y. (2024). SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. CVPR 2024. arXiv:2307.16125. ↩
Yu, W. et al. (2024). MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. ICML 2024. arXiv:2308.02490. ↩
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R. (2023). Evaluating Object Hallucination in Large Vision-Language Models. EMNLP 2023. (POPE) ↩
Guan, T. et al. (2024). HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. CVPR 2024. arXiv:2310.14566. ↩
xAI (April 2024). Grok-1.5 Vision Preview and RealWorldQA benchmark. x.ai/news/grok-1.5v. ↩
Yang, Z., He, X., Gao, J., Deng, L., Smola, A. (2016). Stacked Attention Networks for Image Question Answering. CVPR 2016. ↩
Fukui, A. et al. (2016). Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP 2016. arXiv:1606.01847. ↩
Anderson, P. et al. (2018). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. CVPR 2018. arXiv:1707.07998. ↩
Tan, H., Bansal, M. (2019). LXMERT: Learning Cross-Modality Encoder Representations from Transformers. EMNLP-IJCNLP 2019. arXiv:1908.07490. ↩
Lu, J., Batra, D., Parikh, D., Lee, S. (2019). ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS 2019. arXiv:1908.02265. ↩
Chen, Y.C. et al. (2020). UNITER: UNiversal Image-TExt Representation Learning. ECCV 2020. arXiv:1909.11740. ↩
Li, X. et al. (2020). Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. ECCV 2020. arXiv:2004.06165. ↩
Su, W. et al. (2020). VL-BERT: Pre-training of Generic Visual-Linguistic Representations. ICLR 2020. ↩
Zhang, P. et al. (2021). VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR 2021. ↩
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S. (2021). Align before Fuse (ALBEF): Vision and Language Representation Learning with Momentum Distillation. NeurIPS 2021. arXiv:2107.07651. ↩
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. arXiv:2103.00020. (CLIP) ↩
Li, J., Li, D., Xiong, C., Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. ICML 2022. arXiv:2201.12086. ↩
Li, J., Li, D., Savarese, S., Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023. arXiv:2301.12597. ↩
Alayrac, J.B. et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022. arXiv:2204.14198. ↩
Liu, H., Li, C., Wu, Q., Lee, Y.J. (2023). Visual Instruction Tuning (LLaVA). NeurIPS 2023 Oral. arXiv:2304.08485. ↩
Liu, H., Li, C., Li, Y., Lee, Y.J. (2023). Improved Baselines with Visual Instruction Tuning (LLaVA-1.5). arXiv:2310.03744. ↩
LLaVA Team (2024). LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. llava-vl.github.io/blog/2024-01-30-llava-next. ↩
Li, B. et al. (2024). LLaVA-OneVision: Easy Visual Task Transfer. arXiv:2408.03326. ↩
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M. (2023). MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592. ↩
Dai, W. et al. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. NeurIPS 2023. arXiv:2305.06500. ↩
Bai, J. et al. (2023). Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966. ↩
Wang, P. et al. (2024). Qwen2-VL: To See the World More Clearly. qwenlm.github.io/blog/qwen2-vl/, August 30, 2024. ↩
Qwen Team (2025). Qwen2.5-VL Release. qwenlm.github.io/blog/qwen2.5-vl/. ↩
Chen, Z. et al. (2024). InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. CVPR 2024 Oral. arXiv:2312.14238. ↩
Laurençon, H. et al. (April 2024). Introducing Idefics2: A Powerful 8B Vision-Language Model for the community. Hugging Face blog. ↩
Microsoft Research (May 2024). Phi-3 Technical Report and Phi-3-Vision announcement at Build 2024. ↩
Meta AI (September 25, 2024). Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. ↩
DeepSeek-AI (2024). DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding. arXiv:2412.10302. ↩
Mistral AI (September 2024). Pixtral 12B. mistral.ai/news/pixtral-12b. ↩
Deitke, M. et al. (2024). Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models. Allen Institute for AI. arXiv:2409.17146. ↩
OpenBMB and Tsinghua NLP Lab (2024). MiniCPM-V series. github.com/OpenBMB/MiniCPM-V. ↩
Tong, S. et al. (2024). Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. NeurIPS 2024 Oral. arXiv:2406.16860. ↩
Chen, X. et al. (January 2025). Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. arXiv:2501.17811. ↩
OpenAI (September 25, 2023). GPT-4V(ision) System Card. openai.com/index/gpt-4v-system-card. ↩
OpenAI (May 13, 2024). Hello GPT-4o. openai.com/index/hello-gpt-4o. ↩
OpenAI (August 7, 2025). Introducing GPT-5. openai.com/index/introducing-gpt-5. ↩
Google DeepMind (February 15, 2024). Introducing Gemini 1.5, Google's next-generation AI model. ↩
Google DeepMind (December 11, 2024). Google introduces Gemini 2.0: A new AI model for the agentic era. ↩
Google DeepMind (March 25, 2025). Gemini 2.5: Our newest Gemini model with thinking. ↩
Anthropic (March 4, 2024). Introducing the next generation of Claude (Claude 3 family). ↩
Anthropic (June 2024). Introducing Claude 3.5 Sonnet. ↩
Anthropic (May 22, 2025). Introducing Claude 4. anthropic.com/news/claude-4. ↩
visualqa.org/evaluation.html (Antol et al. soft accuracy metric). ↩
OpenCompass and OpenVLM leaderboards. opencompass.org and rank.opencompass.org. ↩
Beyer, L. et al. (2024). PaliGemma: A versatile 3B VLM for transfer. arXiv:2407.07726. https://arxiv.org/abs/2407.07726 Accessed 2026-05-31. ↩
Google (December 5, 2024). Welcome PaliGemma 2: New vision language models by Google. Hugging Face blog. https://huggingface.co/blog/paligemma2 Accessed 2026-05-31. ↩
Zhu, J. et al. (2025). InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models. arXiv:2504.10479. https://arxiv.org/abs/2504.10479 Accessed 2026-05-31. ↩
Qwen Team (2025). Qwen3-VL Technical Report. arXiv:2511.21631. https://arxiv.org/abs/2511.21631 Accessed 2026-05-31. ↩
Yue, X. et al. MMMU leaderboard. https://mmmu-benchmark.github.io/ Accessed 2026-05-31.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

DeepSeek-VL2 Fox (benchmark)Grad-CAM MMStar MathVista Question answering

What is a visual question answering model?

How do VQA models work?

What datasets are used for visual question answering?

How did VQA models evolve over time?

Early VQA: CNN plus LSTM with attention

Transformer fusion era: ViLBERT and LXMERT

Contrastive and generative era: CLIP, BLIP, BLIP-2, and Flamingo

LLM augmented VLM era: LLaVA, MiniGPT-4, and InstructBLIP

What are the main open source VQA models?

What are the leading closed source frontier VLMs?

How are VQA models evaluated?

Which benchmarks matter for frontier VLMs?

How do models handle document and chart VQA?

What are the limitations of VQA models?

ELI5: What is a visual question answering model?

See also

References

Improve this article

Related Articles

Fox (benchmark)

CLIP Score

MMMU-Pro

EgoSchema

ZeroBench

Video-MME

What links here

Related Articles

Fox (benchmark)

CLIP Score

MMMU-Pro

EgoSchema

ZeroBench

Video-MME

What links here