# Vision language model

> Source: https://aiwiki.ai/wiki/vision_language_model
> Updated: 2026-06-20
> Categories: Computer Vision, Large Language Models, Multimodal AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **vision-language model** (**VLM**) is a class of [multimodal](/wiki/multimodal_model) artificial intelligence model that jointly processes visual inputs (typically still images, sometimes video frames or documents) and natural-language text, most often producing text as output. VLMs range from contrastive image-text encoders such as [CLIP](/wiki/clip), which learns a shared embedding space and can match the ImageNet accuracy of a ResNet-50 with zero labeled training examples,[^1] to generative systems such as [LLaVA](/wiki/llava), BLIP-2, Flamingo, GPT-4V, Claude 3, and Gemini that answer free-form questions about an image, and the closely related vision-language-action models used in robotics. Both the full phrase "vision-language model" and the acronym "VLM" appear interchangeably in the research literature, which is why the aiwiki maintains separate slugs that resolve to the same content. Broader coverage of the field, including audio and video modalities and the general theory of multimodal fusion, is at [multimodal model](/wiki/multimodal_model); the robotics extension is documented at [vision-language-action model](/wiki/vision_language_action_model) and [VLA](/wiki/vla).

## What is a vision-language model used for?

The label "vision-language model" covers any neural network whose forward pass conditions on at least one image and at least one text token sequence. In current practice researchers distinguish two broad families.

The first family is **contrastive VLMs**, which learn a shared embedding space where matching image-text pairs have high similarity and non-matching pairs have low similarity. CLIP (Contrastive Language-Image Pre-training) from OpenAI established this template in 2021 by training on 400 million image-text pairs scraped from the public web (a corpus OpenAI called WebImageText).[^1] After this pretraining, in the authors' words, "natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks," and the resulting classifier could "match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples."[^1] ALIGN from Google scaled the same recipe to roughly 1.8 billion noisy alt-text pairs.[^2] SigLIP from Google DeepMind in 2023 replaced the standard softmax contrastive loss with a per-pair sigmoid loss, allowing larger batch sizes and stronger small-batch performance.[^3] Contrastive VLMs do not generate text; they score similarity, which makes them ideal as zero-shot classifiers, retrieval systems, and frozen vision encoders for the larger generative models described below.

The second family is **generative VLMs**, sometimes called **large multimodal models (LMMs)** or multimodal large language models, which produce free-form text output conditioned on an image plus a prompt. DeepMind's Flamingo (2022) was an early example demonstrating few-shot in-context learning over interleaved image-text sequences; the paper reported that "Flamingo outperforms models fine-tuned on thousands of times more task-specific data" on numerous benchmarks.[^4] BLIP-2 (Salesforce, 2023) bootstrapped from frozen image encoders and frozen language models, connected by a lightweight "Querying Transformer."[^5] LLaVA (Visual Instruction Tuning, 2023) popularized the simpler "vision encoder plus projector plus LLM" pattern that now dominates open-source releases.[^6] MiniGPT-4 appeared the same month with a similar frozen-encoder-plus-projection design.[^7] Proprietary entries include GPT-4V from OpenAI (September 2023),[^8] Claude 3 from Anthropic (March 2024),[^9] and Gemini from Google DeepMind (December 2023).[^10]

A third category, **vision-language-action models (VLAs)**, extends the generative VLM stack with an action head that emits robot motor commands or trajectories. These systems are treated as a distinct topic at [vision-language-action model](/wiki/vision_language_action_model).

## Architecture

Most modern generative VLMs converge on a three-part pipeline: a vision encoder that turns an image into a sequence of visual features, a connector (also called a bridge, adapter, or projector) that maps those features into a form the language model can consume, and a [large language model](/wiki/large_language_model) backbone that autoregressively generates text conditioned on the combined visual and text tokens. The design choices at each stage define the major architectural families.

### Vision encoder

The vision encoder is almost always a [Vision Transformer](/wiki/vision_transformer) (ViT), and in most open systems it is a frozen, pretrained encoder reused from a contrastive model. The two most common choices are a CLIP-trained ViT-L/14 (often at 336 by 336 pixel resolution) and a SigLIP encoder.[^6][^3] LLaVA-1.5, for example, uses OpenAI's CLIP ViT-L/14 at 336 pixels and keeps it frozen.[^11] Some systems train a bespoke high-capacity encoder: InternVL pairs its language model with InternViT-6B, a roughly six-billion-parameter vision transformer.[^12]

Because a fixed 224 or 336 pixel input is too coarse for dense text and documents, recent systems support higher and variable resolution. InternVL-1.5 introduced a dynamic tiling scheme that splits an image into 448 by 448 pixel tiles (1 to 12 tiles at training time, scalable to 40 tiles, roughly 4K, at inference) and applies a pixel-shuffle operation to compress each tile to 256 visual tokens.[^13] Qwen2-VL and Qwen2.5-VL instead train a native dynamic-resolution ViT from scratch so the encoder can ingest images at their original size, with window attention added in Qwen2.5-VL so that most layers scale linearly with the number of patches.[^14][^15] Pixtral 12B from Mistral uses a custom 400-million-parameter encoder, Pixtral-ViT, trained from scratch with 2D rotary position embeddings (RoPE-2D) to handle variable image sizes and aspect ratios.[^16]

### How does the connector pass images to the language model?

The connector determines how visual information reaches the language model. Three patterns dominate.

**Projection (MLP) connectors** flatten the encoder's patch features and pass them through a small linear layer or multilayer perceptron into the LLM's token embedding space, where they are concatenated with the text tokens. The original LLaVA used a single linear projection; LLaVA-1.5 upgraded it to a two-layer MLP with a GELU activation, which improved benchmark results.[^11] This approach is simple, cheap to train, and now the default for most open-weights models, including the InternVL family, which uses a randomly initialized MLP between InternViT and its language backbone.[^13]

**Query-based connectors** insert a small transformer between the frozen encoder and the LLM. BLIP-2's Querying Transformer (Q-Former) is the canonical example: it carries 32 learnable query embeddings of dimension 768 that attend to the frozen image features through cross-attention and to each other through self-attention, acting as an information bottleneck that extracts only the most text-relevant visual content.[^5] The BLIP-2 authors describe the module as a "lightweight Querying Transformer" that "bridges the modality gap" between a frozen encoder and a frozen LLM, and report that the resulting model "outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters."[^5] This compresses an image to a small fixed number of tokens regardless of resolution, which keeps the LLM context short.

**Cross-attention connectors** leave the visual tokens outside the LLM's main token stream and instead inject them through new attention layers added inside the frozen language model. Flamingo pioneered this with a Perceiver Resampler that maps variable visual features to a fixed set of tokens, followed by gated cross-attention dense layers interleaved between the frozen language-model blocks; a tanh gate initialized to zero ensures the model starts out identical to the original text-only LLM and learns to use vision gradually.[^4] Meta's Llama 3.2 Vision follows the same philosophy, training a vision adapter of cross-attention layers that feed image-encoder representations into a frozen Llama 3.1 text model (the 11B vision model is built on Llama 3.1 8B, and the 90B on Llama 3.1 70B).[^17]

| Connector | Representative model | Mechanism | Trade-off |
|---|---|---|---|
| Linear / MLP projection | LLaVA, LLaVA-1.5, InternVL | Project patch features into LLM token space, concatenate with text | Simple and cheap; token count grows with image resolution |
| Q-Former (query transformer) | BLIP-2 | 32 learnable queries cross-attend to frozen image features | Fixed small token count; extra module to pretrain |
| Perceiver resampler + gated cross-attention | Flamingo | Resample to fixed tokens, inject via gated cross-attention into frozen LM | Handles interleaved image-text; more architectural complexity |
| Cross-attention adapter | Llama 3.2 Vision | Cross-attention layers feed image features into frozen LLM | Keeps text weights intact; adapter must be trained on image-text pairs |

### Language model backbone

The backbone ranges from open models such as Vicuna, LLaMA, Qwen2, OLMo, and Mistral NeMo in research and open-weights systems to undisclosed proprietary frontier models in commercial products. Open systems frequently advertise their pairing explicitly: Molmo-7B-D is built on Qwen2 7B and Molmo-72B on Qwen2 72B, while MolmoE-1B uses the OLMoE mixture-of-experts backbone.[^18] In cross-attention designs the backbone weights are typically frozen and only the adapter is trained, whereas in projection designs the backbone is often unfrozen during instruction tuning.

## How are vision-language models trained?

Training a generative VLM typically proceeds in two or three stages, and contrastive encoders are trained separately beforehand.

**Contrastive image-text pretraining** produces the vision encoder. A dual-encoder model maximizes the cosine similarity of matching image-text pairs and minimizes it for non-matching pairs across a large web-scale corpus. CLIP used a symmetric softmax cross-entropy loss over each batch on 400 million pairs; SigLIP replaced this with a per-pair sigmoid loss that does not require a global normalization over the batch.[^1][^3] The resulting encoder is then reused, usually frozen, inside generative VLMs.

**Alignment (feature pretraining)** is the first stage of generative training. Only the connector is trained, on image-caption pairs, so that the language model learns to interpret visual tokens while the encoder and the LLM stay frozen. LLaVA used 558,000 image-text pairs for this stage.[^6]

**Multimodal instruction tuning** is the second stage. The connector and often the LLM are fine-tuned on conversational, visual-question-answering, and document data so the model can follow instructions about images. LLaVA's influential demonstration in 2023 used GPT-4-generated visual [instruction-tuning](/wiki/instruction_tuning) data (150,000 multimodal instruction-following examples), later augmented in LLaVA-1.5 with academic VQA datasets, to produce a capable assistant with modest compute.[^6][^11] On the synthetic multimodal instruction-following set used for evaluation, the original LLaVA reported an "85.1% relative score compared with GPT-4," and when combined with GPT-4 it reached 92.53 percent accuracy on ScienceQA.[^6]

**Preference optimization (RLHF and DPO)**, where applied, is a further stage aimed mainly at reducing hallucination and improving helpfulness. LLaVA-RLHF collected roughly 10,000 human preference pairs and applied a factually augmented reward to curb hallucinated content.[^19] RLHF-V instead gathered fine-grained, segment-level human corrections on hallucinated spans and applied dense direct preference optimization (DPO); using about 1,400 annotated samples it reported a 34.8 percent reduction in the base model's hallucination rate.[^20] DPO is attractive in this setting because it learns directly from preference pairs without training a separate reward model.

## Core capabilities

Generative VLMs support a recognizable set of tasks that also serve as the field's evaluation targets.

- **Visual question answering (VQA):** answering natural-language questions about an image, from simple object and attribute queries to multi-step reasoning. The VQAv2 dataset, with over one million questions on real photos, has long been a primary benchmark.[^21]
- **Image captioning:** generating fluent natural-language descriptions of an image, evaluated on datasets such as COCO Captions with metrics like CIDEr.
- **OCR and document understanding:** reading and reasoning over text embedded in images, scanned documents, receipts, forms, and screenshots. This is the focus of TextVQA, DocVQA, and OCRBench, and it is the capability that drove the move to high-resolution encoders.[^22][^23]
- **Chart, diagram, and table reasoning:** extracting values and relationships from charts and figures, measured by ChartQA and by the heterogeneous figure types in MMMU.[^24][^25]
- **Visual grounding and referring:** localizing the region an expression refers to, usually by outputting bounding-box coordinates, and conversely describing a referred region. Qwen-VL was an early generalist that accepted and produced bounding boxes and supported open-vocabulary grounding in both English and Chinese.[^26]
- **Pointing and counting:** emitting 2D point coordinates for referenced objects. Molmo was trained on the PixMo-Points data (about 2.3 million question-point pairs over 223,000 images) and counts by a chain-of-thought "point-then-count" procedure, reporting 89.4 on CountBenchQA.[^18]
- **Multi-image and video reasoning:** comparing several images or reasoning over sampled video frames, supported by models such as the Qwen2-VL and Qwen2.5-VL series.[^14][^15]

## Notable vision-language models

The table below lists widely cited VLMs with details verifiable from primary sources. Release years refer to the first public release of the named model.

| Model | Organization | Year | Type | Verifiable details |
|---|---|---|---|---|
| CLIP | OpenAI | 2021 | Contrastive | Dual-encoder trained on 400M image-text pairs; matches ResNet-50 ImageNet accuracy zero-shot[^1] |
| ALIGN | Google | 2021 | Contrastive | Dual-encoder (EfficientNet + BERT) scaled to ~1.8B noisy pairs[^2] |
| Flamingo | DeepMind | 2022 | Generative | Frozen Chinchilla LM + gated cross-attention; sizes 3B, 9B, 80B; outperforms models fine-tuned on thousands of times more data[^4] |
| BLIP-2 | Salesforce | 2023 | Generative | Q-Former with 32 queries; beats Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters[^5] |
| LLaVA | U. Wisconsin-Madison et al. | 2023 | Generative | CLIP ViT + projection + Vicuna; 92.53% on ScienceQA; visual instruction tuning[^6] |
| MiniGPT-4 | KAUST | 2023 | Generative | Frozen encoder + single projection layer to a frozen LLM[^7] |
| SigLIP | Google DeepMind | 2023 | Contrastive | Sigmoid loss for scalable contrastive pretraining[^3] |
| Qwen-VL | Alibaba | 2023 | Generative | ViT-bigG visual receptor + Qwen LLM; 448px input; grounding and OCR[^26] |
| GPT-4V | OpenAI | 2023 | Generative | Vision added to GPT-4; architecture undisclosed[^8] |
| Gemini | Google | 2023 | Generative | Natively multimodal; Ultra, Pro, Nano sizes[^10] |
| Claude 3 | Anthropic | 2024 | Generative | Vision across Haiku, Sonnet, Opus[^9] |
| GPT-4o | OpenAI | 2024 | Generative | "Omni" model over text, image, and audio; announced May 13, 2024[^27] |
| InternVL-1.5 | Shanghai AI Lab | 2024 | Generative | InternViT-6B + MLP + InternLM2; dynamic tiling to ~4K[^13] |
| Llama 3.2 Vision | Meta | 2024 | Generative | Cross-attention adapter on frozen Llama 3.1; 11B and 90B variants[^17] |
| Pixtral 12B | Mistral | 2024 | Generative | 400M Pixtral-ViT (RoPE-2D) + Mistral NeMo 12B decoder[^16] |
| Molmo | Allen Institute for AI | 2024 | Generative | CLIP ViT + MLP + OLMo/Qwen2 backbones; native pointing; PixMo data[^18] |
| Qwen2-VL / Qwen2.5-VL | Alibaba | 2024 / 2025 | Generative | Native dynamic-resolution ViT; long-video and grounding support[^14][^15] |
| InternVL2.5 / InternVL3 | Shanghai AI Lab | 2024 / 2025 | Generative | ViT-MLP-LLM; InternVL2.5-78B first open model above 70% on MMMU[^12] |

## How are vision-language models evaluated?

Evaluation of VLMs draws on a layered set of public benchmarks that probe perception, reading, reasoning, and robustness.

| Benchmark | Year | What it tests | Notable details |
|---|---|---|---|
| VQAv2 | 2017 | General visual question answering | Over 1M questions on real images; balanced to reduce language priors[^21] |
| TextVQA | 2019 | Reading text in natural images | 45,336 questions over 28,408 images requiring OCR[^22] |
| DocVQA | 2021 | Document understanding | ~50,000 questions over 12,000+ document images[^23] |
| ChartQA | 2022 | Chart question answering | Visual and logical reasoning over bar, line, and pie charts[^24] |
| MME | 2023 | Perception and cognition | 14 subtasks; max 2000 (perception) + 800 (cognition); yes/no format[^28] |
| MMBench | 2023 | Fine-grained ability suite | Multiple-choice across many abilities; CircularEval to reduce option bias[^29] |
| SEED-Bench | 2023 | Comprehensive MLLM evaluation | Multiple-choice across spatial and temporal dimensions[^30] |
| MMMU | 2023 | College-level multimodal reasoning | ~11,500 questions, 6 disciplines, 30 subjects, 183 subfields, 30 image types[^25] |
| MathVista | 2023 | Visual mathematical reasoning | 6,141 examples combining prior datasets with new ones[^31] |
| OCRBench | 2023 | OCR-centric capabilities | Aggregates text recognition and reading tasks across settings[^32] |

MMMU has become the headline benchmark for multimodal reasoning, and a harder, contamination-resistant follow-up, MMMU-Pro, was released afterward.[^25] By late 2024, InternVL2.5-78B became the first open-weights model to exceed 70 percent on MMMU, a level previously reached only by leading proprietary systems, and InternVL3-78B reported scores above 72 percent.[^12] These figures still sit below the expert human ceiling originally reported for the benchmark, which illustrates how far multimodal reasoning has to go.

## What are the limitations of vision-language models?

Despite rapid progress, VLMs share several persistent weaknesses.

**Hallucination.** Generative VLMs frequently describe objects, attributes, or text that are not present in the image. This is measured by benchmarks such as CHAIR, which scores hallucinated objects in captions, and POPE (Polling-based Object Probing Evaluation), which shows that models tend to confirm the presence of objects that merely co-occur frequently with what is actually shown.[^33] Hallucination arises from cross-modal misalignment, over-reliance on language priors that override visual evidence, and spurious correlations in training data. Preference-optimization methods such as RLHF-V and LLaVA-RLHF are among the leading mitigations.[^19][^20]

**Counting.** Accurately counting objects remains one of the weakest areas; counting is consistently among the lowest-accuracy tasks across evaluated VLMs. Molmo's explicit point-then-count strategy was introduced partly to address this gap.[^18]

**Spatial reasoning.** Judging relative positions, orientation, and geometric relationships (left of, behind, above) is unreliable, in part because contrastive encoders are trained to match images to holistic captions rather than to represent precise spatial layout.

**Fine-grained perception.** Reading small text, distinguishing visually similar objects, and resolving fine detail historically suffered from low-resolution inputs; the move to dynamic and native high resolution in InternVL, Qwen2-VL, and Pixtral was a direct response.[^13][^14][^16]

**Modality gap.** Even after contrastive training, image and text embeddings tend to occupy separate regions of the shared space rather than fully intermixing, which can bias cross-modal retrieval and classification. The general theory of this effect is discussed at [multimodal model](/wiki/multimodal_model).

## Applications

VLMs underpin a growing set of products and research tools: image and document question answering, automated captioning and alt-text generation for accessibility, OCR and intelligent document processing (invoices, forms, receipts), chart and screenshot understanding for analytics and UI agents, visual search and retrieval, content moderation, medical-image report drafting (under regulatory oversight), and the perception front end of robotic [vision-language-action models](/wiki/vision_language_action_model). Many of these reuse the same contrastive encoders and instruction-tuned backbones described above, so progress on core capabilities propagates quickly into applications.

## Why do two slugs exist?

The literature uses the full phrase "vision-language model," the unhyphenated "vision language model," and the acronym "VLM" interchangeably. Flamingo's 2022 title called itself "a Visual Language Model for Few-Shot Learning,"[^4] using yet another close variant. Subsequent papers, model cards, and benchmark write-ups settled on "VLM" as the dominant short form while keeping the spelled-out version in formal titles. The aiwiki maintains both slugs so that readers searching either form land on the same content, with broader multimodal coverage at [multimodal model](/wiki/multimodal_model).

## See also

- [NVLM](/wiki/nvlm)
- [Multimodal model](/wiki/multimodal_model)
- [Vision-language-action model](/wiki/vision_language_action_model)
- [VLA](/wiki/vla)
- [CLIP](/wiki/clip)
- [SigLIP](/wiki/siglip)
- [LLaVA](/wiki/llava)
- [Molmo](/wiki/molmo)
- [Pixtral](/wiki/pixtral)
- [Llama 3.2 Vision](/wiki/llama_3_2_vision)
- [Qwen2-VL](/wiki/qwen2_vl)
- [Qwen2.5-VL](/wiki/qwen2_5_vl)
- [InternVL](/wiki/internvl)
- [GPT-4o](/wiki/gpt_4o)
- [Claude 3 Opus](/wiki/claude_3_opus)
- [Gemini](/wiki/gemini)
- [Vision Transformer](/wiki/vision_transformer)
- [Transformer](/wiki/transformer)
- [Large language model](/wiki/large_language_model)
- [Instruction tuning](/wiki/instruction_tuning)
- [MMMU](/wiki/mmmu)
- [MMMU-Pro](/wiki/mmmu-pro)
- [MathVista](/wiki/mathvista)

## References

[^1]: Radford, Alec et al., "Learning Transferable Visual Models From Natural Language Supervision", arXiv, 2021-02-26. https://arxiv.org/abs/2103.00020. Accessed 2026-05-31.
[^2]: Jia, Chao et al., "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision", arXiv, 2021-02-11. https://arxiv.org/abs/2102.05918. Accessed 2026-05-31.
[^3]: Zhai, Xiaohua et al., "Sigmoid Loss for Language Image Pre-Training", arXiv, 2023-03-27. https://arxiv.org/abs/2303.15343. Accessed 2026-05-31.
[^4]: Alayrac, Jean-Baptiste et al., "Flamingo: a Visual Language Model for Few-Shot Learning", arXiv, 2022-04-29. https://arxiv.org/abs/2204.14198. Accessed 2026-05-31.
[^5]: Li, Junnan et al., "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models", arXiv, 2023-01-30. https://arxiv.org/abs/2301.12597. Accessed 2026-05-31.
[^6]: Liu, Haotian et al., "Visual Instruction Tuning", arXiv, 2023-04-17. https://arxiv.org/abs/2304.08485. Accessed 2026-05-31.
[^7]: Zhu, Deyao et al., "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models", arXiv, 2023-04-20. https://arxiv.org/abs/2304.10592. Accessed 2026-05-31.
[^8]: OpenAI, "GPT-4V(ision) system card", OpenAI, 2023-09-25. https://openai.com/index/gpt-4v-system-card/. Accessed 2026-05-31.
[^9]: Anthropic, "Introducing the next generation of Claude", Anthropic, 2024-03-04. https://www.anthropic.com/news/claude-3-family. Accessed 2026-05-31.
[^10]: Pichai, Sundar and Hassabis, Demis, "Introducing Gemini: our largest and most capable AI model", Google, 2023-12-06. https://blog.google/technology/ai/google-gemini-ai/. Accessed 2026-05-31.
[^11]: Liu, Haotian et al., "Improved Baselines with Visual Instruction Tuning", arXiv, 2023-10-05. https://arxiv.org/abs/2310.03744. Accessed 2026-05-31.
[^12]: Chen, Zhe et al., "Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling", arXiv, 2024-12-06. https://arxiv.org/abs/2412.05271. Accessed 2026-05-31.
[^13]: Chen, Zhe et al., "How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites", arXiv, 2024-04-25. https://arxiv.org/abs/2404.16821. Accessed 2026-05-31.
[^14]: Wang, Peng et al., "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution", arXiv, 2024-09-18. https://arxiv.org/abs/2409.12191. Accessed 2026-05-31.
[^15]: Bai, Shuai et al., "Qwen2.5-VL Technical Report", arXiv, 2025-02-19. https://arxiv.org/abs/2502.13923. Accessed 2026-05-31.
[^16]: Agrawal, Pravesh et al., "Pixtral 12B", arXiv, 2024-10-09. https://arxiv.org/abs/2410.07073. Accessed 2026-05-31.
[^17]: Meta AI, "Llama 3.2: Revolutionizing edge AI and vision with open, customizable models", Meta, 2024-09-25. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Accessed 2026-05-31.
[^18]: Deitke, Matt et al., "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models", arXiv, 2024-09-25. https://arxiv.org/abs/2409.17146. Accessed 2026-05-31.
[^19]: Sun, Zhiqing et al., "Aligning Large Multimodal Models with Factually Augmented RLHF", arXiv, 2023-09-25. https://arxiv.org/abs/2309.14525. Accessed 2026-05-31.
[^20]: Yu, Tianyu et al., "RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback", arXiv, 2023-12-01. https://arxiv.org/abs/2312.00849. Accessed 2026-05-31.
[^21]: Goyal, Yash et al., "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering", arXiv, 2016-12-02. https://arxiv.org/abs/1612.00837. Accessed 2026-05-31.
[^22]: Singh, Amanpreet et al., "Towards VQA Models That Can Read", arXiv, 2019-04-18. https://arxiv.org/abs/1904.08920. Accessed 2026-05-31.
[^23]: Mathew, Minesh et al., "DocVQA: A Dataset for VQA on Document Images", arXiv, 2020-07-01. https://arxiv.org/abs/2007.00398. Accessed 2026-05-31.
[^24]: Masry, Ahmed et al., "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning", ACL Anthology, 2022-05-01. https://aclanthology.org/2022.findings-acl.177/. Accessed 2026-05-31.
[^25]: Yue, Xiang et al., "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI", arXiv, 2023-11-27. https://arxiv.org/abs/2311.16502. Accessed 2026-05-31.
[^26]: Bai, Jinze et al., "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond", arXiv, 2023-08-24. https://arxiv.org/abs/2308.12966. Accessed 2026-05-31.
[^27]: OpenAI, "Hello GPT-4o", OpenAI, 2024-05-13. https://openai.com/index/hello-gpt-4o/. Accessed 2026-05-31.
[^28]: Fu, Chaoyou et al., "MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models", arXiv, 2023-06-23. https://arxiv.org/abs/2306.13394. Accessed 2026-05-31.
[^29]: Liu, Yuan et al., "MMBench: Is Your Multi-modal Model an All-around Player?", arXiv, 2023-07-12. https://arxiv.org/abs/2307.06281. Accessed 2026-05-31.
[^30]: Li, Bohao et al., "SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension", arXiv, 2023-07-30. https://arxiv.org/abs/2307.16125. Accessed 2026-05-31.
[^31]: Lu, Pan et al., "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts", arXiv, 2023-10-03. https://arxiv.org/abs/2310.02255. Accessed 2026-05-31.
[^32]: Liu, Yuliang et al., "OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models", arXiv, 2023-05-13. https://arxiv.org/abs/2305.07895. Accessed 2026-05-31.
[^33]: Li, Yifan et al., "Evaluating Object Hallucination in Large Vision-Language Models", arXiv, 2023-05-17. https://arxiv.org/abs/2305.10355. Accessed 2026-05-31.

