InternVL

Chinese AI Multimodal AI Open Source AI

23 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v4 · 4,603 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

InternVL is a family of open-source multimodal large language models developed by the OpenGVLab research group at the Shanghai Artificial Intelligence Laboratory in collaboration with academic partners including Nanjing University, the University of Hong Kong, the Chinese University of Hong Kong, Fudan University, SenseTime Research, and Tsinghua University.^[1] The original 2023 model scaled the vision encoder to 6 billion parameters and reached state-of-the-art results on 32 generic visual-linguistic benchmarks, and by December 2024 the InternVL 2.5 78B model became the first publicly released multimodal large language model to exceed 70 percent on the MMMU validation benchmark.^[1]^[4] The defining architectural choice is a roughly 6-billion-parameter vision transformer called InternViT, which is paired with an instruction-tuned language model and a lightweight projector (a multi-layer perceptron in later versions, or a learned-query "QLLaMA" middleware in the original release).^[1]^[3] Successive releases (InternVL 1.0 in December 2023, InternVL 1.5 in April 2024, InternVL 2 in July 2024, InternVL 2.5 in December 2024, InternVL3 in April 2025, and InternVL3.5 in August 2025) have iterated on data curation, dynamic high-resolution input, reinforcement learning, and native multimodal pre-training, with the InternVL3.5 mixture-of-experts flagship reaching 77.7 on MMMU and narrowing the gap with leading commercial models such as GPT-5.^[4]^[5]^[6]

Field	Value
Developer	OpenGVLab, Shanghai AI Laboratory and academic partners^[1]
Initial paper	arXiv:2312.14238, 21 December 2023^[1]
Conference	CVPR 2024 (Oral)^[7]
Latest major release	InternVL3.5 (25 August 2025, arXiv:2508.18265)^[6]
Code repository	github.com/OpenGVLab/InternVL^[7]
License	MIT for project code and InternViT weights; LLM weights inherit base-model licenses^[8]^[9]
Vision backbone	InternViT-6B (5.9B params, later trimmed to 5.54B)^[1]^[8]
Largest dense model	InternVL3-78B^[5]
Largest sparse model	InternVL3.5-241B-A28B (mixture-of-experts)^[6]^[7]

What is InternVL?

InternVL is an openly released multimodal system that takes images (and, in later versions, video and interface screenshots) plus text as input and produces text as output. The project's stated goal, in the words of the original 2023 paper, was to scale "the vision foundation model to 6 billion parameters and progressively [align] it with the LLM, using web-scale image-text data from various sources."^[1] That design philosophy, a large vision encoder rather than a small one bolted onto a giant language model, is what distinguishes InternVL from most earlier open vision-language systems and is the reason the project bills itself, on its GitHub repository, as "a pioneering open-source alternative to GPT-4o."^[7]

History and Background

Motivation: scaling vision to LLM size

When InternVL was first proposed in December 2023, the gap between language-model scale and vision-model scale was conspicuous. State-of-the-art language models had crossed the hundred-billion-parameter threshold, while open-source vision encoders used in multimodal systems were typically capped near 0.3 to 1 billion parameters (for example the CLIP ViT-L/14 at 0.3B or EVA-CLIP-g at 1B). Existing vision-language systems generally bolted such a comparatively small vision encoder onto a much larger language model via a thin projection layer.^[1] Zhe Chen and collaborators argued that this imbalance left visual representation as the bottleneck and proposed scaling a Vision Transformer to 6 billion parameters, comparable in spirit to Google's earlier ViT-22B but openly released, then aligning it with a supervised fine-tuning pipeline that used a frozen multilingual LLaMA-7B as both an encoder and a generation backbone.^[1]

When was InternVL released, and what versions exist?

The first InternVL paper, "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks", was submitted to arXiv on 21 December 2023 by Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai.^[1] It was accepted as an Oral presentation at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) in 2024.^[7] Code and pre-trained weights were published on GitHub under the OpenGVLab organisation, where the repository has accumulated more than ten thousand stars as of 2025.^[7]

The series moved quickly after its initial release:

InternVL 1.0 (December 2023): InternViT-6B paired with QLLaMA-8B middleware and a downstream LLM such as Vicuna-13B; introduced the three-stage contrastive, generative, supervised training schedule.^[1]
InternVL 1.2 (February 2024) and InternVL 1.2 Plus: incremental data and SFT improvements before the major 1.5 jump.^[7]
InternVL 1.5 (18 April 2024): introduced dynamic high-resolution tiling (1 to 40 tiles of 448 by 448 pixels) and a curated bilingual instruction set; described in arXiv:2404.16821, "How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites".^[2]^[10]
InternVL 2 (4 July 2024): seven public sizes from 1B to 76B, including InternVL2-Llama3-76B built on Hermes-2-Theta-Llama-3-70B.^[11]^[12]
InternVL 2.5 (5 December 2024, arXiv:2412.05271): same seven-tier family with substantially improved data and Chain-of-Thought test-time scaling; 78B variant first open MLLM to exceed 70 percent on MMMU validation.^[4]^[13]
InternVL3 (11 April 2025, arXiv:2504.10479): native multimodal pre-training in a single stage; Variable Visual Position Encoding; 72.2 percent on MMMU at 78B.^[5]^[14]
InternVL3.5 (25 August 2025, arXiv:2508.18265): Cascade Reinforcement Learning, Visual Resolution Router, Decoupled Vision-Language Deployment, and an MoE flagship at 241B total parameters with 28B activated.^[6]

Throughout the series the authorship and institutional affiliations remained centred on the OpenGVLab team at Shanghai AI Laboratory, with Zhe Chen leading the first three papers and Jinguo Zhu leading InternVL3.^[1]^[4]^[5]

Technical Details

How does InternViT differ from other vision encoders?

The InternViT-6B encoder is a vanilla, decoder-free Vision Transformer (ViT) architecture, designed to scale CLIP-style training to language-model scale. According to the original paper, the authors performed a hyperparameter sweep over depth in {32, 48, 64, 80}, head dimension in {64, 128}, and MLP ratio in {4, 8}, evaluating each candidate via contrastive learning on a 100-million subset of LAION-en for accuracy, throughput, and training stability.^[1] The selected configuration uses a depth of 48 transformer blocks, a hidden width of 3,200, an MLP intermediate dimension of 12,800, 25 attention heads, and a 14 by 14 patch tokenizer, yielding 5.9 billion parameters in total.^[1] This made InternViT-6B the first openly released ViT to enter the multi-billion-parameter regime, where most rivals (the CLIP and EVA-CLIP backbones used by systems such as LLaVA) sat at 0.3 to 1 billion parameters.^[1]

In the InternVL 1.5 update, the last three transformer blocks were discarded from the encoder, producing the InternViT-6B-448px-V1-5 variant with 45 blocks and 5.54 billion parameters. The patch size is fixed at 14, but the input resolution was retrained for 448 by 448 pixels and combined with dynamic tiling for high-resolution images.^[8] Subsequent versions (V2.5 and V3) continued to refine the encoder weights through continued pre-training on broader OCR-heavy datasets that included LAION-en, LAION-zh, COYO, GRIT, Objects365, OpenImages, All-Seeing, Wukong-OCR, LaionCOCO-OCR, and PaddleOCR-processed corpora.^[8] The encoder is released independently of the chat models, with hubs at OpenGVLab/InternViT-6B-224px, OpenGVLab/InternViT-6B-448px-V1-2, OpenGVLab/InternViT-6B-448px-V1-5, and OpenGVLab/InternViT-6B-448px-V2_5, all under the MIT License.^[8]

QLLaMA middleware (InternVL 1.0)

In the original InternVL paper, the vision encoder and the downstream LLM were not connected with a thin projection layer. Instead, the authors inserted an 8-billion-parameter middleware module called QLLaMA. QLLaMA is built on multilingual LLaMA-7B weights, augmented with 96 learnable query tokens and a stack of cross-attention layers that fuse visual features into the language hidden state, adding roughly 1 billion newly trainable parameters.^[1] The motivation for QLLaMA is twofold: it acts as a powerful glue layer (analogous in spirit to the Q-Former used in BLIP-2 but two orders of magnitude larger), and it can serve as a standalone vision encoder for tasks that do not require a full chat model, such as image-text retrieval.^[1]

From InternVL 1.5 onward, the QLLaMA module was retired in most public chat models in favour of a much simpler two-layer MLP projector that maps InternViT features into the embedding space of the downstream LLM. The simpler connector reduced model size while leaving most performance gains attributable to the encoder, data, and language backbone.^[2]^[11]

Three-stage training (InternVL 1.0)

The original InternVL pipeline used three explicit stages:

Vision-language contrastive training. The 5.9-billion-parameter InternViT and a randomly initialised text encoder were trained from scratch on 4.98 billion cleaned image-text pairs sourced from LAION-en, LAION-multi, LAION-COCO, COYO, Wukong, CC3M, CC12M, and SBU. Training used a symmetric cross-entropy loss in the style of CLIP for 175,000 iterations across 640 NVIDIA NVIDIA A100 GPUs, processing about 28.7 billion samples in total.^[1]
Vision-language generative training. The contrastively trained encoder was frozen and connected to QLLaMA with new cross-attention layers and learnable queries. Training was performed on 1.03 billion stringently filtered pairs for 80,000 steps on 160 A100 GPUs, with a combined image-text contrastive, image-text matching, and image-grounded text generation loss.^[1]
Supervised fine-tuning. Roughly 4 million instruction samples covering captioning, visual question answering, OCR question answering, visual grounding, and multimodal dialogue were used to align the system with instruction tuning formats and to attach a downstream LLM such as Vicuna-13B or InternLM.^[1]

The three-stage recipe enabled a coarse-to-fine alignment: contrastive learning produced general-purpose visual features, generative training learned to bind features to natural language, and SFT taught the system to follow user instructions.^[1]

Dynamic high-resolution input (InternVL 1.5)

The most important architectural change introduced in InternVL 1.5 was dynamic high-resolution tiling. Rather than resizing every input image to a single fixed resolution, the system selects an aspect-ratio-aware grid of 448 by 448 tiles. During training the model uses between 1 and 12 tiles per image; during inference it can scale up to 40 tiles, which is enough to support inputs of approximately 4K resolution.^[2]^[10] A global "thumbnail" tile is appended so that the model retains an overview of the full image alongside the higher-resolution patches.^[2] This mechanism made InternVL 1.5 particularly strong on OCR, document analysis, and chart understanding, where small text and fine layout details would otherwise be lost.

Data and language backbone scaling (InternVL 2)

InternVL 2 introduced what the team called "progressive alignment": the vision encoder is trained alongside small language models first, then progressively paired with larger LLMs without retraining the encoder from scratch.^[11] The seven public sizes shared the same vision encoder (304M parameters for the smaller variants, the 5.54B InternViT for the 26B / 40B / 76B variants) but swapped in different LLM backbones, ranging from a 0.5B Qwen-derived model at the 1B tier up to Hermes-2-Theta-Llama-3-70B (a community fine-tune of Llama 3 70B) at the 76B tier.^[11]^[12] Training used an 8k context window and combined long texts, multi-image conversations, and video clips, in contrast to the largely single-image data used in InternVL 1.5.^[12]

Test-time and data scaling (InternVL 2.5)

The InternVL 2.5 paper, "Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling", systematically dissected three orthogonal axes: vision-encoder scale, LLM scale, and inference-time strategies. The authors observed a 3.7-point gain on MMMU when applying a DPO-style refined SFT with explicit reasoning prompts at inference, an early demonstration of multimodal chain-of-thought as a tractable lever for open models.^[4]^[13] Training data was filtered down to roughly 16 million high-quality samples in a structured ChatML format that combined captioning, OCR, math, and chat data.^[13]

Native multimodal pre-training (InternVL3)

InternVL3, published as arXiv:2504.10479 on 14 April 2025 with Jinguo Zhu as lead author and 50 other contributors, departed from the staged contrastive-then-generative recipe in favour of a "native multimodal pre-training" paradigm. As the paper puts it, "rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage."^[5] The paper introduces Variable Visual Position Encoding (V2PE), which assigns smaller position increments to visual tokens than to text tokens, allowing image-heavy contexts to fit comfortably inside the model's position window. The training pipeline also adds a Mixed Preference Optimization stage that uses both online and offline rewards to mitigate the standard teacher-forcing distribution shift between training and inference.^[5]^[14]

The InternVL3 series spans seven sizes from 1B to 78B parameters, sharing weights with the InternViT-6B-448px-V2_5 encoder for the larger variants and with a smaller 304M ViT for the compact ones. InternVL3 expanded the scope of the model family beyond static images to include GUI agents, 3D scene understanding, tool use, and industrial inspection, partly through targeted data mixing rather than architectural changes.^[14]

Cascade RL, ViR, and Decoupled Deployment (InternVL3.5)

InternVL3.5, arXiv:2508.18265, submitted on 25 August 2025 with Weiyun Wang as lead author and 74 co-authors, introduced three additional system-level techniques.^[6] Cascade Reinforcement Learning splits post-training into an offline reinforcement learning warm-up that uses pre-collected preference pairs and a stable convergence objective, followed by an online stage that refines the model against a process reward model called VisualPRM. According to the paper, these contributions "collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05x inference speedup compared to its predecessor."^[6] The Visual Resolution Router (ViR) is a lightweight controller that selects per-tile resolutions at inference time, lowering compute when easy regions of an image do not need high resolution. Decoupled Vision-Language Deployment (DvD) physically separates the vision encoder and the language model across different GPU groups so that each can be parallelised at its own rate, which is responsible for the bulk of the 4.05 times inference speed-up at the 241B scale.^[6] The flagship InternVL3.5-241B-A28B is a mixture-of-experts model with 241 billion total parameters and 28 billion activated per token, the first MoE entry in the family, and the paper reports that it "narrows the performance gap with leading commercial models like GPT-5."^[6]^[7]

Model variants

The InternVL family has grown into a wide grid of sizes, vision encoders, and language backbones. The table below lists the most commonly cited dense chat models, plus the MoE flagship.

Model	Release	Vision encoder	Language model	Total params	Notable benchmark
InternVL-Chat-V1-2	Feb 2024	InternViT-6B-448px-V1-2	NousHermes-2-Yi-34B	~40B	OCRBench >700^[11]
InternVL-Chat-V1-5	18 Apr 2024	InternViT-6B-448px-V1-5	InternLM2-Chat-20B	25.51B	DocVQA 90.9 percent^[2]^[10]
InternVL2-1B	8 Jul 2024	304M ViT	Qwen2-0.5B	0.94B	Mobile-friendly^[11]^[12]
InternVL2-2B	4 Jul 2024	304M ViT	InternLM2-Chat-1.8B	2.21B	MMMU 36.3^[12]
InternVL2-4B	4 Jul 2024	304M ViT	Phi-3-mini-4k-instruct	4.15B	MMMU 47.9^[12]
InternVL2-8B	4 Jul 2024	304M ViT	InternLM2_5-7B-chat	8.08B	MMMU 49.3^[12]
InternVL2-26B	4 Jul 2024	InternViT-6B-V1-5	InternLM2-Chat-20B	25.51B	MMMU 51.2^[12]
InternVL2-40B	8 Jul 2024	InternViT-6B-V1-5	Nous-Hermes-2-Yi-34B	40.07B	MMMU 55.2^[12]
InternVL2-Llama3-76B	15 Jul 2024	InternViT-6B-V1-5	Hermes-2-Theta-Llama-3-70B	76.3B	MMMU 58.2^[11]^[12]
InternVL2.5-78B	5 Dec 2024	InternViT-6B-V2_5	Qwen2.5-72B-Instruct	78.41B	MMMU 70.1^[4]^[13]
InternVL3-78B	11 Apr 2025	InternViT-6B-V2_5	Qwen2.5-72B-Instruct	~78B	MMMU 72.2^[5]^[14]
InternVL3.5-241B-A28B	25 Aug 2025	InternViT-6B-V2_5	MoE LLM (241B total, 28B active)	241B	MMMU 77.7^[6]

The smaller variants are intended for on-device or edge deployment; an explicit "Mini-InternVL" line described in arXiv:2410.16261 demonstrated that a 4B parameter chat model could retain about 90 percent of the 25B model's benchmark performance while using only 5 percent of the compute, indicating that the architecture distils efficiently.^[9]^[11] The MoE flagship in InternVL3.5 was the first model in the family to break from a single dense LLM backbone, leveraging sparsity to keep activated compute close to that of a 28B model while training capacity scales to 241B.^[6]^[7]

Benchmarks

How does InternVL compare to GPT-4o and Claude 3.5 Sonnet?

Across the standard multimodal benchmark suite, InternVL has tracked or overtaken proprietary frontier models on text-heavy tasks while remaining a step behind them on the hardest reasoning tasks.

Benchmark	GPT-4o (20240513)	Claude 3.5 Sonnet	InternVL2-Llama3-76B	InternVL2.5-78B	InternVL3-78B
MMMU (val)	69.1	68.3	58.2	70.1	72.2
MathVista (testmini)	63.8	67.7	65.5	72.3	79.0
OCRBench	736	788	839	854	n.r.
DocVQA (test)	92.8	95.2	94.1	95.1	95.4
ChartQA (test)	85.7	90.8	88.4	n.r.	n.r.
TextVQA (val)	n.r.	n.r.	84.4	n.r.	n.r.
OpenCompass average	69.9	67.9	71.0	n.r.	n.r.
MMBench-EN (test)	83.4	79.7	86.5	n.r.	n.r.

GPT-4o, Claude 3.5 Sonnet, and InternVL2-Llama3-76B scores in the table come from the official InternVL2 model card cross-reference of public reports as of mid-2024.^[11] InternVL2.5 and InternVL3 numbers come from their respective papers and project blogs.^[4]^[5]^[13]^[14] On OCRBench the open InternVL line already outperformed both GPT-4o and Claude 3.5 Sonnet at the 76B scale in mid-2024 and the lead has widened in subsequent versions.^[11]^[13] MathVista was a notable acceleration point: the InternVL3-78B score of 79.0 on testmini placed it ahead of GPT-4o (60.0), Claude 3.5 Sonnet (66.8), and Gemini 2.0-Pro (71.3) in the project's own comparison table.^[14]

Is InternVL competitive on MMMU?

On MMMU, InternVL 2.5-78B's 70.1 in December 2024 was the first publicly verified open MLLM score above the 70 threshold; InternVL3-78B subsequently raised it to 72.2 in April 2025, setting what its paper called "a new state-of-the-art among open-source MLLMs" while remaining "highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro."^[4]^[5] InternVL3.5-241B-A28B then reached 77.7, closing most of the gap with the strongest closed-source frontier models reported at the time.^[6]^[7] The 8B InternVL3.5 variant achieves 73.4 on MMMU, illustrating how rapidly small open models have caught up to the original 78B-class scores.^[6]

Training data

InternVL's training data has evolved over time, with the contrastive pre-training corpus dominated by web image-text pairs and the supervised fine-tuning corpus increasingly biased toward OCR, math, and chat data:

Contrastive stage (InternVL 1.0): 4.98 billion cleaned image-text pairs from LAION-en, LAION-multi, LAION-COCO, COYO-700M, Wukong, Conceptual Captions (CC3M and CC12M), and SBU Captions.^[1]
Generative stage (InternVL 1.0): 1.03 billion high-quality pairs after stringent filtering.^[1]
Supervised fine-tuning (InternVL 1.0): roughly 4 million instruction samples spanning captioning, VQA, OCR-VQA, visual grounding, and dialogue.^[1]
Bilingual SFT (InternVL 1.5): 5 million high-quality bilingual samples in English and Chinese, with explicit coverage of common scenes, document images, scene text, and Chinese cultural content.^[2]^[10]
Curated mixture (InternVL 2.5): about 16 million samples in a structured ChatML format covering general visual conversation, OCR, math, science diagrams, charts, multi-image dialogues, and video.^[13]
Native multimodal pre-training (InternVL3): jointly trained on a multimodal mixture (image-text pairs, video clips, interleaved web documents) and pure text from OmniCorpus, with curriculum-style ratio scheduling across the single pre-training stage.^[14]
OCR-heavy InternViT continued pre-training: throughout the series the encoder weights have been refreshed on LAION-en, LAION-zh, COYO, GRIT, COCO, TextCaps, Objects365, OpenImages, All-Seeing, Wukong-OCR, LaionCOCO-OCR, and PaddleOCR-processed corpora.^[8]

LAION datasets and COYO-700M provide the bulk of the contrastive pre-training data; ImageNet and ADE20K are used for evaluating linear probing and segmentation transfer; OmniCorpus contributes the long-form text and interleaved document data used in InternVL2 and InternVL3.^[1]^[14]

What is InternVL used for?

Because every chat model in the series is published openly under MIT or LLaMA-compatible licenses (depending on the language backbone), InternVL has been widely adopted both as a baseline for academic research and as a practical drop-in for multimodal applications.^[7]^[8]^[11] The official documentation describes deployment paths through Hugging Face Transformers, vLLM, SGLang, and LMDeploy, and pre-built Docker images are provided for production use.^[11] The InternVL Space on Hugging Face hosts a public chat demo that has accumulated several hundred likes and a steady stream of usage since launch.^[7]

Application domains the project explicitly demonstrates include document and chart understanding (DocVQA, ChartQA, InfographicVQA), scene-text OCR (TextVQA, OCRBench), mathematical reasoning over images (MathVista, MathVerse, MathVision), visual grounding (RefCOCO, RefCOCO+, RefCOCOg), multimodal hallucination detection (POPE, HallBench), and multilingual evaluation (MMBench-CN, CCBench).^[11]^[13] Beyond benchmarks, the InternVL3 release added explicit support for GUI agents (interacting with software interfaces via screenshots), 3D scene understanding, industrial image analysis, and tool use, while InternVL3.5 extended these into embodied agents.^[5]^[14]^[6]

The model family has been used as a vision backbone for many downstream open-source systems, including VideoChat-Flash and several derivative video chat assistants released by OpenGVLab and the broader community.^[7] In addition, the InternViT-6B encoder is itself reused independently of the chat models, for instance as a feature extractor in retrieval pipelines or as a frozen backbone in research on multimodal alignment, taking advantage of the fact that it is released separately under the MIT License together with its weights and CLIP-style image processor.^[8]

Significance

The InternVL series sits at the intersection of three trends in mid-decade multimodal research.

First, it demonstrated that scaling vision encoders to language-model parity, rather than gluing a small encoder to a giant LLM, can yield large benchmark gains. The 5.9-billion-parameter InternViT-6B reached 88.2 percent linear-probing accuracy on ImageNet-1K and 47.2 mIoU on ADE20K semantic segmentation with a linear probe, both competitive with or better than other contemporary open encoders.^[1] This validated the premise that the vision side of multimodal models had been undersized.

Second, it served as a public counterpoint to closed-source MLLMs. The InternVL 1.5 paper, titled "How Far Are We to GPT-4V?", explicitly framed the open vs. closed gap as the central research question, and the team's later 2.5 and 3 papers reported the first MMMU scores above 70 and 72 percent respectively from any openly released MLLM.^[2]^[4]^[5] By making weights and training recipes available under permissive licenses, the project pulled the open-source ceiling up and gave smaller research groups access to a strong multimodal baseline for fine-tuning, distillation, and ablation studies.^[7]^[8]

Third, it has been an early adopter of techniques from the broader LLM stack, including DPO-style preference optimisation in 2.5, native multimodal pre-training in 3, and reinforcement learning with process reward models in 3.5.^[4]^[5]^[6] In doing so the project has provided a useful, well-documented testbed for transferring alignment ideas from text-only models into the multimodal setting.

Limitations and Criticisms

Despite its strong benchmark scores, the InternVL family inherits several limitations common to large open multimodal models.

The first is hallucination. The project's own POPE and HallBench evaluations show that even the 78B variants of InternVL 2 and 2.5 hallucinate objects and attributes in a non-trivial fraction of images, particularly on adversarial prompts. InternVL2-Llama3-76B scored 55.2 on HallBench average, just above Claude 3.5 Sonnet (49.9) and below the InternVL2-40B model on some splits.^[11] Hallucination control is one of the explicit motivations for the Mixed Preference Optimization stage introduced in InternVL3 and the Cascade RL stage in InternVL3.5, but the problem has not been eliminated.^[5]^[6]

The second is the dependency on heterogeneous language backbones. Different InternVL 2 sizes use Qwen2, InternLM2, Phi-3, NousHermes-2-Yi, and Hermes-2-Theta-Llama-3, each of which carries its own license, alignment quirks, and tokenizer. While InternVL's own project code is MIT-licensed, the chat models inherit the more restrictive terms of their LLM backbones, complicating commercial use.^[9]^[11] InternVL3.5's MoE flagship reduces but does not remove this concern.^[6]

The third is the cost and complexity of the dynamic high-resolution tiling. At inference time a 4K image can produce up to 40 tiles of 448 by 448 pixels each, each consuming hundreds of visual tokens after the MLP projection. This dramatically increases compute and memory, which the Visual Resolution Router in InternVL3.5 only partially addresses.^[2]^[6]

Finally, while the project provides open weights and open inference code, the full training data pipelines (especially the curated SFT and preference data used in 2.5 and 3) are not all publicly released in identical form, making exact reproduction of the strongest benchmark numbers difficult. The team has stated an intent to release training data alongside InternVL3, but historically the released mixtures have lagged behind the model releases.^[5]^[14]

InternVL is part of a wider 2024 to 2025 wave of open multimodal large language models. Direct competitors include the LLaVA series from Microsoft and the University of Wisconsin (and its 1.5, 1.6, and Next iterations), MiniCPM-V from ModelBest and Tsinghua, the Qwen-VL family (Qwen-VL, Qwen2-VL, Qwen2.5-VL) from Alibaba, DeepSeek-VL and DeepSeek-VL2 from DeepSeek, and Pixtral from Mistral AI, among others.^[7]

Relative to LLaVA, InternVL pursues a heavier vision encoder (5.5B+ vs. roughly 0.3B for LLaVA's CLIP backbone) and a more elaborate pre-training schedule.^[1] Relative to Qwen-VL and Qwen2.5-VL, the two families have repeatedly traded leadership on MMMU and OCRBench, with InternVL holding the lead at the 70+ MMMU threshold in late 2024 and Qwen2.5-VL closing the gap in early 2025.^[4]^[11] Compared to proprietary models such as GPT-4o, Claude 3.5 Sonnet, and the Gemini family, the InternVL series has at various points exceeded them on text-heavy benchmarks (OCRBench, DocVQA, ChartQA) and on MathVista, while continuing to trail on the hardest multi-disciplinary reasoning benchmarks such as MMMU-Pro and MMVet (GPT-4-Turbo split).^[11]^[14]

Architecturally the closest historical comparison is Google's ViT-22B, an even larger vision transformer that was reported but not openly released; InternViT-6B at half the scale was the first openly released ViT to enter the multi-billion-parameter regime and remained the default backbone for the OpenGVLab line for multiple generations.^[1]^[8] Compared to the BLIP-2 and Flamingo lineage, which used compact resamplers and queried features into a frozen LLM, InternVL's QLLaMA was an unusually heavy alignment module, although the team eventually moved to a simple MLP connector once the vision encoder itself was strong enough to carry most of the representation work.^[2]^[11]

References

Chen, Zhe et al., "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks", arXiv, 2023-12-21. https://arxiv.org/abs/2312.14238. Accessed 2026-06-22. ↩
Chen, Zhe et al., "How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites", arXiv, 2024-04-25. https://arxiv.org/abs/2404.16821. Accessed 2026-06-22. ↩
Chen, Zhe et al., "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks" (full text), ar5iv mirror, 2023-12-21. https://ar5iv.labs.arxiv.org/html/2312.14238. Accessed 2026-06-22. ↩
Chen, Zhe et al., "Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling", arXiv, 2024-12-06. https://arxiv.org/abs/2412.05271. Accessed 2026-06-22. ↩
Zhu, Jinguo et al., "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models", arXiv, 2025-04-14. https://arxiv.org/abs/2504.10479. Accessed 2026-06-22. ↩
Wang, Weiyun et al., "InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency", arXiv, 2025-08-25. https://arxiv.org/abs/2508.18265. Accessed 2026-06-22. ↩
OpenGVLab, "InternVL: A Pioneering Open-Source Alternative to GPT-4o", GitHub repository README, 2025. https://github.com/OpenGVLab/InternVL. Accessed 2026-06-22. ↩
OpenGVLab, "InternViT-6B-448px-V1-5 model card", Hugging Face, 2024. https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5. Accessed 2026-06-22. ↩
OpenGVLab, "OpenGVLab organisation page on Hugging Face", Hugging Face, 2025. https://huggingface.co/OpenGVLab. Accessed 2026-06-22. ↩
OpenGVLab, "InternVL 1.5: How Far Are We to GPT-4V?", InternVL project blog, 2024-04-30. https://internvl.github.io/blog/2024-04-30-InternVL-1.5/. Accessed 2026-06-22. ↩
OpenGVLab, "InternVL2-Llama3-76B model card", Hugging Face, 2024-07. https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B. Accessed 2026-06-22. ↩
OpenGVLab, "InternVL2: Better than the Best", InternVL project blog, 2024-07-02. https://internvl.github.io/blog/2024-07-02-InternVL-2.0/. Accessed 2026-06-22. ↩
OpenGVLab, "InternVL2.5: Pushing the Boundaries of Open-Source MLLMs", InternVL project blog, 2024-12-05. https://internvl.github.io/blog/2024-12-05-InternVL-2.5/. Accessed 2026-06-22. ↩
OpenGVLab, "InternVL3: Native Multimodal Pre-Training and Test-Time Recipes", InternVL project blog, 2025-04-11. https://internvl.github.io/blog/2025-04-11-InternVL-3.0/. Accessed 2026-06-22. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

InternVL

What is InternVL?

History and Background

Motivation: scaling vision to LLM size

When was InternVL released, and what versions exist?

Technical Details

How does InternViT differ from other vision encoders?

QLLaMA middleware (InternVL 1.0)

Three-stage training (InternVL 1.0)

Dynamic high-resolution input (InternVL 1.5)

Data and language backbone scaling (InternVL 2)

Test-time and data scaling (InternVL 2.5)

Native multimodal pre-training (InternVL3)

Cascade RL, ViR, and Decoupled Deployment (InternVL3.5)

Model variants

Benchmarks

How does InternVL compare to GPT-4o and Claude 3.5 Sonnet?

Is InternVL competitive on MMMU?

Training data

What is InternVL used for?

Significance

Limitations and Criticisms

See also

References

Improve this article

What links here

What links here

What is InternVL?

History and Background

Motivation: scaling vision to LLM size

When was InternVL released, and what versions exist?

Technical Details

How does InternViT differ from other vision encoders?

QLLaMA middleware (InternVL 1.0)

Three-stage training (InternVL 1.0)

Dynamic high-resolution input (InternVL 1.5)

Data and language backbone scaling (InternVL 2)

Test-time and data scaling (InternVL 2.5)

Native multimodal pre-training (InternVL3)

Cascade RL, ViR, and Decoupled Deployment (InternVL3.5)

Model variants

Benchmarks

How does InternVL compare to GPT-4o and Claude 3.5 Sonnet?

Is InternVL competitive on MMMU?

Training data

What is InternVL used for?

Significance

Limitations and Criticisms

How does InternVL compare to related open models?

See also

References

Improve this article

Related Articles

DeepSeek-OCR

Qwen2.5-VL

Qwen2-VL

MiniCPM-V

Qwen3-Omni

Qwen3-VL

What links here

Related Articles

DeepSeek-OCR

Qwen2.5-VL

Qwen2-VL

MiniCPM-V

Qwen3-Omni

Qwen3-VL

What links here