Visual Question Answering Models
Last reviewed
May 13, 2026
Sources
64 citations
Review status
Source-backed
Revision
v2 · 6,421 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
64 citations
Review status
Source-backed
Revision
v2 · 6,421 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Multimodal Models and Tasks
Visual Question Answering (VQA) is the task of producing a natural language answer to a natural language question asked about an image. Introduced by Antol et al. at ICCV 2015, VQA sits at the intersection of computer vision and natural language processing. It has become one of the standard probes for measuring whether a model can both see and reason. Early VQA systems combined a convolutional neural network image encoder with an LSTM question encoder. Modern systems use a vision transformer tied to a large language model, and the same architecture powers chart reading, document analysis, scientific figure interpretation, and general visual dialogue.
The canonical VQA problem is simple to state. The input is a pair (I, Q) where I is an image and Q is a free form question in natural language. The output is an answer A, also in natural language. The Antol et al. paper framed this as an AI complete task because answering correctly can require object recognition, attribute recognition, counting, spatial reasoning, activity recognition, common sense, and outside world knowledge, often combined in a single question.
VQA has several variants in active use:
A typical VQA pipeline has three components. A visual encoder maps the image to a sequence of feature vectors. A text encoder maps the question to a sequence of tokens. A fusion or cross modal module combines the two and produces an answer, either by classification over a fixed vocabulary or by autoregressive generation. The history of VQA models is, in large part, the history of how this fusion is performed.
The field has accumulated dozens of benchmarks. The table below lists the most influential ones in roughly chronological order.
| Dataset | Year | Approximate size | Focus |
|---|---|---|---|
| VQA v1 (Antol et al.) | 2015 | 254K images, 760K questions, 10M answers | Open ended VQA over COCO and abstract scenes |
| Visual7W (Zhu et al.) | 2016 | 47K images, 328K QA pairs | Grounded multiple choice QA, seven Ws |
| Visual Genome (Krishna et al.) | 2017 | 108K images, 1.7M QA pairs | Dense scene graphs and region QA |
| VQAv2 (Goyal et al.) | 2017 | Roughly 2x VQAv1, 1.1M QA pairs | Balanced pairs that reduce language priors |
| AI2D (Kembhavi et al.) | 2016 | 4,903 diagrams, 15K questions | Primary school science diagrams |
| TextVQA (Singh et al.) | 2019 | 28,408 images, 45,336 questions | Questions that require reading scene text |
| OK-VQA (Marino et al.) | 2019 | 14,055 questions | Open ended QA requiring outside knowledge |
| GQA (Hudson and Manning) | 2019 | 113K images, 22M questions | Programmatic compositional reasoning |
| ST-VQA, OCR-VQA | 2019 | Tens of thousands of questions | Scene text and book cover OCR QA |
| DocVQA (Mathew et al.) | 2021 | 12K+ document images, 50K questions | Industry documents and forms |
| ScienceQA (Lu et al.) | 2022 | 21,208 questions | Multimodal multiple choice science with rationales |
| A-OKVQA (Schwenk et al.) | 2022 | About 25K questions | World knowledge plus commonsense reasoning |
| ChartQA (Masry et al.) | 2022 | 9.6K human questions, 23.1K augmented | Charts with visual and logical reasoning |
| InfographicVQA | 2022 | 5K infographics, 30K questions | Infographic layout and reading |
| MM-Vet (Yu et al.) | 2023 | 218 questions, 16 capability subsets | Integrated capability evaluation, GPT judged |
| MMBench (Liu et al.) | 2023 | 2,974 multiple choice questions | Twenty fine grained ability dimensions, bilingual |
| SEED-Bench (Li et al.) | 2023 | About 19K multiple choice questions | Twelve image and video understanding dimensions |
| POPE (Li et al.) | 2023 | Binary existence probes | Object hallucination |
| LLaVA-Bench (in the wild) | 2023 | 24 images, 60 prompts | GPT 4 graded open ended chat |
| MMMU (Yue et al.) | 2024 | 11.5K college level questions | Six disciplines, thirty subjects, expert reasoning |
| MathVista (Lu et al.) | 2024 | 6,141 examples from 31 datasets | Math reasoning in visual contexts |
| HallusionBench (Guan et al.) | 2024 | 346 images, 1,129 questions | Language hallucination and visual illusion |
| RealWorldQA (xAI) | 2024 | More than 700 images | Spatial reasoning from vehicle and real world scenes |
| CV-Bench (Cambrian) | 2024 | About 2,600 questions | Vision centric tasks repurposed for VLMs |
| MMMU-Pro | 2024 | Robust subset of MMMU | Shortcut resistant version of MMMU |
In 2017, Goyal et al. showed that the original VQA dataset had strong language priors. A model could answer many questions correctly without even looking at the image. VQAv2 paired each question with a second image where the answer is different. The balanced version cut the language only baseline accuracy from around 48 percent on the original VQA test set to about 26 percent on VQAv2, while a model that uses the image scored 54 percent. Most papers since 2018 report VQAv2 numbers rather than VQAv1.
GQA, released by Drew Hudson and Christopher Manning in 2019, built questions programmatically from Visual Genome scene graphs. Each question has a corresponding functional program that captures its semantics. The dataset includes new metrics for consistency, grounding, and plausibility. Human accuracy on GQA is 89.3 percent. Strong VQA models from 2019 reached 54.1 percent.
OK-VQA and A-OKVQA, both from the Allen Institute for AI, target knowledge that lives outside the image. An OK-VQA question might show a photo of a horse and ask what farm animal eats hay. A-OKVQA adds rationales and pairs each question with multiple choice options and ten free form answer references, encouraging methods that use either retrieval or implicit knowledge from large pretrained models.
The document, chart, and diagram benchmarks fill a gap that the early COCO based datasets did not cover. DocVQA was constructed from the UCSF industry documents library and the Tobacco corpus. ChartQA was scraped from Statista and OurWorldInData and includes both human written questions and questions generated from human written chart summaries. AI2D contains 4,903 primary school science diagrams drawn from textbooks, with object segmentations and a Diagram Parse Graph that encodes element relationships.
MMMU, released by Yue and 21 co authors in late 2023 and presented at CVPR 2024, became the de facto reference benchmark for frontier VLMs. It collects 11.5K questions from college exams across six disciplines, including chemical structures, sheet music, anatomy diagrams, and engineering schematics. When MMMU launched, GPT-4V scored 56 percent and Gemini Ultra scored 59 percent, well below human expert performance.
The Antol et al. ICCV 2015 paper that defined the modern task also released the first deep learning baselines. The strongest of those used a VGG-Net image encoder, an LSTM question encoder, and an element wise product to fuse them before a softmax over the 1,000 most common answers. The paper showed that humans achieve 83 percent accuracy on the open ended evaluation, while the best baseline reached 58 percent.
Three main lines of research followed. The first focused on attention. Yang et al. introduced Stacked Attention Networks at CVPR 2016. SANs query the image multiple times with the question embedding, producing iterative attention maps that progressively focus on the relevant region. The two layer SAN improved over a plain LSTM Q+I baseline by 4.8 absolute points on VQA. A parallel line was Hierarchical Question Image Co-Attention (Lu et al. NeurIPS 2016), which attended jointly to words and image regions.
The second line was bilinear pooling. Fukui et al. (EMNLP 2016) proposed Multimodal Compact Bilinear pooling. MCB approximates the outer product of visual and textual feature vectors by projecting them with random Count-Sketch vectors and multiplying in the frequency domain, capturing richer interactions than concatenation or element wise product. Follow up methods, MUTAN, MFB, and MFH, made bilinear pooling more parameter efficient.
The third line was object centric features. Anderson et al. published Bottom-Up and Top-Down Attention in 2017, the model that won the 2017 VQA Challenge. The bottom up component uses Faster R-CNN trained on Visual Genome to propose salient object regions. The top down component computes an attention distribution over those regions conditioned on the question. The same model achieved a then state of the art CIDEr of 117.9 on COCO captioning, demonstrating that region features were a strong general visual representation.
In 2019, the success of BERT in language tasks pushed the field toward transformer based cross modal encoders. ViLBERT (Lu, Batra, Parikh, Lee NeurIPS 2019) used a two stream architecture with co attention layers between vision and language. LXMERT (Tan and Bansal EMNLP 2019) used three encoders, one for the question, one for image regions, and a cross modality encoder, pretrained on five tasks including masked language modeling, masked region prediction, and image question answering. LXMERT improved NLVR2 from 54 percent to 76 percent absolute.
VL-BERT (Su et al. ICLR 2020) merged vision and language into a single stream Transformer. UNITER (Chen et al. ECCV 2020) added Word Region Alignment via optimal transport on top of standard MLM and image text matching. OSCAR (Li et al. ECCV 2020) used object tags as anchor points to ease alignment, on the observation that salient objects appear in both modalities and are easy to detect. VinVL (Microsoft 2021) showed that a stronger object detector trained on combined detection corpora dominated over architectural changes. METER and ALBEF (Li et al. NeurIPS 2021) refined the recipe, with ALBEF introducing image text contrastive loss to align features before they are fused and momentum distillation to learn from noisy web pairs.
These models pushed VQAv2 test-std accuracy from the high 60s to about 76 percent over three years, but they shared a structural limit. They were classifiers that predicted one of a few thousand frequent answers. They could not write a sentence, follow an instruction, or reason in chains.
CLIP, released by OpenAI in February 2021 (Radford et al.), trained a separate image encoder and text encoder with a contrastive objective on 400 million web image text pairs. CLIP itself is not a VQA model, but its image encoder, especially the ViT-L/14 and later ViT-H/14 variants, became the visual front end for almost every subsequent VLM. The reason is simple: CLIP encoders produce features that already align with natural language descriptions, which means downstream multimodal models do not need to learn that alignment from scratch.
BLIP (Li et al. ICML 2022, Salesforce) combined a vision encoder with an encoder decoder language head in a Multimodal mixture of Encoder-Decoder design. It introduced Captioning and Filtering (CapFilt), which uses a captioner to generate synthetic captions and a filter to remove noisy image text pairs, bootstrapping clean training data from the noisy web. BLIP set new state of the art numbers on image text retrieval, captioning, and VQA.
BLIP-2 (Li et al. ICML 2023) was the architectural template for almost every subsequent open VLM. It froze both a pretrained image encoder (EVA-CLIP ViT-g) and a pretrained LLM (OPT or FlanT5) and inserted a small trainable Querying Transformer (Q-Former) between them. The Q-Former extracts a fixed number of visual query tokens that the LLM can consume directly. With this design, BLIP-2 outperformed Flamingo 80B on zero shot VQAv2 with 54x fewer trainable parameters. The pattern of a frozen encoder, a small adapter, and a frozen or lightly fine tuned LLM has been copied and refined ever since.
Google DeepMind released Flamingo (Alayrac et al. April 2022, NeurIPS 2022) before BLIP-2. Flamingo introduced gated cross attention layers inserted into a frozen Chinchilla LLM and supported arbitrarily interleaved sequences of images and text, which enabled in context few shot prompting. The 80B variant set few shot state of the art on most VQA benchmarks of the time. Flamingo was never open sourced, but OpenFlamingo and Idefics later reproduced the recipe.
The modern era of VQA began in April 2023 with two papers a few weeks apart. MiniGPT-4 (Zhu et al. April 2023) aligned a frozen image encoder with a frozen Vicuna LLM through a single projection layer and trained on a small curated image text dataset. LLaVA (Liu et al. NeurIPS 2023) followed the same architectural idea, with a CLIP ViT-L/14 vision encoder, a projection matrix, and a Vicuna LLM, but added a key training step. The authors used text only GPT-4 to generate 158K multimodal instruction following examples from COCO captions and object bounding boxes. The resulting model reached 85.1 percent of GPT-4 score on a synthetic multimodal benchmark and 92.53 percent accuracy on ScienceQA when combined with GPT-4 grading.
LLaVA-1.5 (Liu et al. October 2023) replaced the projection matrix with a two layer MLP, switched to a 336 px CLIP encoder, and added academic VQA datasets to the instruction tuning mix. The 13B model achieved state of the art numbers across eleven benchmarks while training in about one day on a single eight GPU node, using 1.2M publicly available examples. LLaVA-NeXT, also called LLaVA-1.6, arrived in January 2024 with higher resolution support (672x672 to 1344x336) and stronger world knowledge from a larger instruction mix. LLaVA-OneVision (August 2024) consolidated the design choices from the LLaVA-NeXT blog series into a single recipe that handles images, multi image inputs, and video.
InstructBLIP (Dai et al. NeurIPS 2023, Salesforce) extended BLIP-2 with an instruction aware Q-Former that takes the user instruction as a side input. Trained on 26 datasets converted to instruction format, InstructBLIP set state of the art zero shot numbers on thirteen held out VQA and image captioning benchmarks.
Alibaba released Qwen-VL in August 2023 with a custom visual receptor and a position aware vision language adapter. Qwen2-VL followed on August 30, 2024, with three sizes (2B, 7B, 72B), Naive Dynamic Resolution that maps arbitrary input resolutions to a dynamic number of visual tokens, and Multimodal Rotary Position Embedding (M-RoPE) that decomposes positions into 1D text, 2D image, and 3D video components. Qwen2.5-VL, released in January 2025, sharpened the same architecture and led the open weight charts on RealWorldQA and OCRBench through 2025.
InternVL (Chen et al. CVPR 2024 oral), from Shanghai AI Lab and OpenGVLab, scaled the vision encoder to 6 billion parameters and tied it to an 8B language middleware. Subsequent versions InternVL 1.5, 2, and 2.5 closed the gap to closed source frontier models on MMMU. Idefics (April 2023), Idefics2 (April 2024, 8B parameters), and Idefics3 (August 2024) from Hugging Face replicated the Flamingo recipe in open weights and shipped The Cauldron, a consolidated multimodal instruction dataset.
Microsoft released Phi-3-Vision in May 2024 as a 4.2 billion parameter VLM in the Phi-3 family, optimized for OCR, charts, and diagrams on edge devices. Phi-3.5-Vision arrived later in 2024 with multi image support. Meta released Llama 3.2 Vision in 11B and 90B variants on September 25, 2024, using a separately trained vision adapter that integrates with the Llama 3.1 language model. DeepSeek released DeepSeek-VL2 on December 13, 2024, a Mixture of Experts VLM with 1.0B, 2.8B, and 4.5B activated parameter variants that scored 93.3 percent on DocVQA. Mistral released Pixtral 12B (Pixtral-12B-2409) in September 2024 with a 400M parameter dedicated vision encoder, native resolution input, and a 128K token context window, hitting 52.5 percent on MMMU.
The Allen Institute for AI released Molmo on September 24, 2024 alongside the PixMo dataset. PixMo includes detailed image captions collected through speech based descriptions and a 2D pointing dataset, which lets Molmo answer with pointing coordinates as well as text. DeepSeek released Janus and Janus-Pro on January 27, 2025, a unified architecture that decouples visual encoding paths for understanding and generation, with SigLIP-L for understanding and LlamaGen tokenizer for generation. OpenBMB and the Tsinghua NLP Lab maintain the MiniCPM-V series, edge friendly VLMs that have repeatedly matched GPT-4o on single image OCR tasks despite using under 10B parameters.
A recurring theme through 2024 and 2025 was vision centric tuning. The Cambrian-1 paper (Tong et al. NeurIPS 2024 oral) from NYU studied more than 15 visual representations under a unified VLM recipe and introduced the Spatial Vision Aggregator, which integrates high resolution vision features without exploding the token count. Cambrian-1 also released CV-Bench, a benchmark constructed from canonical computer vision datasets to test perceptual abilities the existing VQA benchmarks underweight.
| Model | Release | Vision encoder | Language model | Trainable bridge |
|---|---|---|---|---|
| LXMERT | 2019 | Faster R-CNN | Transformer (scratch) | Cross modality encoder |
| ViLBERT | 2019 | Faster R-CNN | BERT base | Co attention |
| UNITER | 2020 | Faster R-CNN | BERT base | Single stream Transformer |
| OSCAR | 2020 | Faster R-CNN + object tags | BERT | Single stream Transformer |
| VinVL | 2021 | Large detector | Various | Single stream Transformer |
| ALBEF | 2021 | ViT-B/16 | BERT base | Image text contrastive plus matching |
| BLIP | 2022 | ViT | BERT decoder | Multimodal mixture of Encoder-Decoder |
| Flamingo | 2022 | NFNet | Chinchilla 70B | Gated cross attention, Perceiver Resampler |
| BLIP-2 | 2023 | EVA-CLIP ViT-g | OPT or FlanT5 | Q-Former |
| LLaVA | 2023 | CLIP ViT-L/14 | Vicuna 7B/13B | Single linear projection |
| MiniGPT-4 | 2023 | EVA-CLIP ViT-g | Vicuna | Single linear projection |
| InstructBLIP | 2023 | EVA-CLIP ViT-g | Vicuna or FlanT5 | Instruction aware Q-Former |
| Qwen-VL | 2023 | ViT-bigG | Qwen 7B | Position aware adapter |
| LLaVA-1.5 | 2023 | CLIP ViT-L/14 336px | Vicuna 7B/13B | Two layer MLP |
| InternVL | 2023 | InternViT-6B | Vicuna or LLaMA | Cross attention |
| LLaVA-NeXT (1.6) | 2024 | CLIP ViT-L/14 | Multiple options | MLP, multi tile |
| Idefics2 | 2024 | SigLIP | Mistral 7B | Perceiver pooling and MLP |
| Phi-3-Vision | 2024 | CLIP ViT-L/14 | Phi-3 (4.2B) | MLP |
| Cambrian-1 | 2024 | Multiple, ensembled | LLaMA or Vicuna | Spatial Vision Aggregator |
| MiniCPM-V 2.6 | 2024 | SigLIP-400M | Qwen2-7B | Resampler |
| Llama 3.2 Vision | 2024 | Vision adapter | Llama 3.1 | Cross attention |
| Pixtral 12B | 2024 | Custom 400M ViT | Mistral Nemo 12B | Native resolution patches |
| Molmo | 2024 | CLIP ViT-L/14 | OLMo, Qwen2, Mistral, Gemma2, Phi-3 | MLP, pointing head |
| Qwen2-VL | 2024 | Dynamic resolution ViT | Qwen2 (2B/7B/72B) | M-RoPE, Naive Dynamic Resolution |
| LLaVA-OneVision | 2024 | SigLIP | Qwen2 | MLP |
| DeepSeek-VL2 | 2024 | Dynamic tiling vision | DeepSeekMoE | MoE adapter |
| Janus-Pro | 2025 | SigLIP-L (understand), LlamaGen (gen) | Decoder LLM | Decoupled paths |
| Qwen2.5-VL | 2025 | Dynamic resolution ViT | Qwen2.5 | M-RoPE |
| InternVL 2.5 | 2025 | InternViT | InternLM 2.5 | MLP |
The trend is clear. Vision encoders have moved from Faster R-CNN region features to ViT style patch tokens that scale with resolution. Language backbones have moved from BERT base to 7B+ instruction tuned decoder only LLMs. The bridge has gone from heavy multimodal Transformers to small MLPs or Q-Formers that map a fixed or dynamic number of visual tokens into the LLM's embedding space.
OpenAI introduced GPT-4 with Vision (GPT-4V) to ChatGPT subscribers on September 25, 2023. The system card emphasized red team work on person identification, medical advice, and CAPTCHA solving rather than benchmark numbers, but third party evaluations placed GPT-4V at 56.8 percent on MMMU and 49.9 percent on MathVista at launch, both human verified at the high end of any model at the time. GPT-4o (omni) followed on May 13, 2024, as a single natively multimodal model that takes text, audio, image, and video input and emits text, audio, and image output. GPT-4o-mini and successive GPT-4o snapshots improved chart and document numbers through late 2024. GPT-5 launched on August 7, 2025 as a natively multimodal model with state of the art numbers on visual, video, spatial, and scientific reasoning benchmarks.
Google DeepMind released Gemini 1.0 Pro Vision in December 2023 as part of the original Gemini family. Gemini 1.5 Pro arrived on February 15, 2024 with a 1 million token context window that ingests hour long videos. Gemini 2.0 Flash launched on December 11, 2024 as an agentic multimodal model with native image and audio output. Gemini 2.5 Pro experimental shipped on March 25, 2025 with thinking traces, native multimodality, and a 1 million token context that Google announced would expand to 2 million.
Anthropic added vision to the Claude family on March 4, 2024 with Claude 3 Haiku, Sonnet, and Opus, which accept photos, charts, graphs, and technical diagrams. Claude 3.5 Sonnet (June 2024) substantially improved chart and document reading, and Claude 4 Sonnet and Opus launched on May 22, 2025 with multimodal input and computer use. xAI's Grok-1.5 Vision (preview) was announced in April 2024 alongside the RealWorldQA benchmark, which xAI built and released to evaluate basic spatial understanding in real world scenes.
These frontier models do not publish architectural details, but their behavior suggests the same broad recipe as the open source line: a large vision encoder, a very large LLM trunk, and extensive instruction and RLHF data including multimodal preferences. The key differences are scale, the amount of high quality OCR and chart data, and proprietary tooling around safety classifiers and image preprocessing.
The Antol et al. soft accuracy metric is the standard for short answer open ended VQA. Given a candidate answer a and ten human reference answers h1 to h10, accuracy is the average over leave one out subsets of min(num matches / 3, 1). This rewards answers that match at least three humans without penalizing alternatives that exactly match a smaller subset. The metric assumes lower cased, normalized text and is robust to minor phrasing differences.
Multiple choice benchmarks use exact match accuracy. MMBench introduces CircularEval, which feeds the same question with shuffled options multiple times and counts a model correct only if it agrees with itself across all permutations, controlling for option order bias.
Generative benchmarks for long answers borrow from captioning: BLEU, METEOR, ROUGE-L, and CIDEr. For VQA, BLEU and ROUGE correlate weakly with human judgment because correct answers are short, so most modern open ended benchmarks use a GPT judge. LLaVA-Bench uses GPT-4 to score model answers on a 1 to 10 scale against a reference. MM-Vet uses GPT-4 with a structured rubric across six capabilities (recognition, knowledge, spatial, language generation, OCR, math). The MathVista and MMMU leaderboards combine accuracy on multiple choice subsets with GPT graded numerical or short answer subsets.
Leaderboards aggregate these scores. OpenCompass and OpenVLM maintain rolling rankings on more than 40 benchmarks. Hugging Face's Open Leaderboard for vision language models tracks open weight releases and exposes per benchmark scores. The standard reporting bundle for a new model in 2024 and 2025 is some subset of MMMU, MathVista, MMBench, AI2D, ChartQA, DocVQA, OCRBench, RealWorldQA, HallusionBench, MM-Vet, and SEED-Bench.
MMMU stands at the center of frontier evaluation. The benchmark spans Art and Design, Business, Science, Health and Medicine, Humanities and Social Science, and Tech and Engineering, covering 30 subjects, 183 subfields, and 30 heterogeneous image types from chemical structures to musical scores. The questions come from college level exams. When MMMU launched in late 2023, the best open model scored in the high 30s. By mid 2025, frontier closed models had moved into the high 70s, and the strongest open weight models cleared 70 percent on the validation split. MMMU-Pro, released in September 2024, hardens the benchmark against shortcut behaviors by removing text only solvable questions and reformatting questions as image embedded options.
MathVista combines 28 existing math and visual datasets with three new ones (IQTest, FunctionQA, PaperQA), for a total of 6,141 examples. At launch, GPT-4V led at 49.9 percent, well below the human reference of 60.3 percent. Frontier models in 2025 routinely score above 70 percent on MathVista, although the gap to human accuracy on the function and geometry subsets remains.
MMBench probes 20 specific abilities, including object localization, attribute recognition, relation reasoning, future prediction, and structuralized image text understanding. The CircularEval metric makes MMBench harder than naive multiple choice scoring because models that anchor on option A or B without reasoning fail under permutation. The bilingual English and Chinese splits also expose VLMs that have memorized English captions rather than truly learned cross modal alignment.
SEED-Bench provides 19K multiple choice questions across 12 dimensions, separated into spatial and temporal axes, and was the first large benchmark to push video VQA evaluation alongside image VQA. SEED-Bench-2 (November 2023) and SEED-Bench-2-Plus (April 2024) extended the benchmark to interleaved generation and text rich images.
MM-Vet (Yu et al. ICML 2024) targets compositional capabilities. Each of the 218 questions exercises a specific combination of recognition, knowledge, spatial awareness, language generation, OCR, and math, and a GPT-4 judge scores the full free form answer. The small size keeps GPT scoring cheap, and the rubric makes it harder to game with answer length.
HallusionBench (Guan et al. CVPR 2024) targets failure modes rather than success. The benchmark uses paired Visual Dependent (VD) and Visual Supplement (VS) questions to separate language hallucination from visual illusion. At launch, GPT-4V scored 31.42 percent on question pair accuracy. Every other model scored under 16 percent. Frontier models in 2025 have closed the gap somewhat, but the benchmark remains a useful stress test for the failure modes that benign accuracy metrics hide.
Document and chart VQA have become a distinct cluster of benchmarks because the visual statistics differ from natural images. Documents have very dense text, low semantic redundancy, and structure that matters (forms, tables, layouts). Charts encode numerical data that the model must decode and reason over.
DocVQA images come from the UCSF industry documents library, mostly tobacco litigation records. Questions ask for entity values (names, dates, amounts) and table cells. State of the art systems combine a high resolution vision encoder, OCR, and a strong language decoder. Pixtral, DeepSeek-VL2, and Qwen2-VL all post DocVQA scores in the low to mid 90s.
ChartQA includes human written questions and a larger augmented split generated from human written chart summaries. Questions involve arithmetic over chart values (which year had the largest increase, what is the difference between two bars). The benchmark exposes models that read text from a chart but fail to extract numerical values reliably.
InfographicVQA targets long format visual layouts that mix text, illustration, and data visualization. AI2D adds primary school diagrams with explicit Diagram Parse Graphs, useful for evaluating models that need to follow arrows, labels, and topological structure.
A practical observation from the 2024 to 2025 releases is that VLM document scores correlate more with image resolution and OCR data mix than with model size. Pixtral 12B, with its 400M parameter dedicated vision encoder and native resolution input, often beats much larger models on DocVQA. Qwen2-VL's Naive Dynamic Resolution similarly outperforms fixed resolution larger models. The lesson is that for text rich images, the bottleneck is visual detail, not language sophistication.
Hallucination remains the most prominent failure. The POPE benchmark (Li et al. EMNLP 2023) probes whether a model says yes to objects that are not in the image. Even strong models fail under adversarial settings where the prompt suggests objects that frequently co occur with the actual scene. HallusionBench shows that VLMs hallucinate both about content (saying there is a dog when there is not) and about absent context (assuming a sport context that does not exist in the photo).
Spatial reasoning is the second persistent weakness. xAI's RealWorldQA benchmark was specifically designed to expose this. Questions ask basic things, like which lane is the car in, or which way will the door swing, that humans solve easily but multimodal models often miss. At launch in April 2024, GPT-4V reached 68 percent on RealWorldQA, while random guessing scored 37.7 percent. The Cambrian-1 study found that performance on canonical computer vision tasks (depth, surface normal estimation, relative position) is uncorrelated with MMMU score, which means strong reasoning on MMMU is not sufficient for grounded spatial understanding.
Counting is the third recurring failure. VQAv2 has dedicated number questions (how many people are in the photo) and models from 2018 onward have struggled to exceed 50 percent on this subset. The TallyQA benchmark and the counting split of MMBench show that even frontier 2025 models fail on counts above five, with accuracy collapsing on cluttered scenes.
OCR errors propagate. When the visual encoder misreads a digit or letter, the downstream answer is wrong. OCRBench and the text rich splits of DocVQA and ChartQA expose this. The shared root is resolution: models compress the image to a fixed token budget that loses fine grained text. Native resolution models (Pixtral, Qwen2-VL, MiniCPM-V) mitigate this but increase token cost.
Multi image and long video reasoning is the fourth limitation. Most VQA benchmarks use a single image. Real world tasks (compare two charts, summarize a deck, answer questions about a phone screen recording) require interleaved multi image and temporal reasoning. MMMU multi image subsets and benchmarks like MileBench, MV-Bench, and Video-MME show that performance drops sharply as the number of images grows. Long context VLMs (Gemini 1.5, GPT-4o, Claude 4) handle multi image reasoning better, but evaluation is still maturing.
Finally, VLMs inherit the biases of their image data. Faces, medical content, and culturally specific scenes (food, clothing, religious imagery) are unevenly represented. GPT-4V's system card explicitly disclaims medical use and notes residual bias in person identification. Independent audits have shown that many open VLMs over generate Western objects and underperform on long tail object recognition. As VQA models migrate into agentic systems (browsing the web, operating GUIs, driving robots), these failure modes have direct safety implications, which is why benchmarks like HallusionBench, RealWorldQA, and POPE matter more than aggregate MMMU scores alone.