InternVL
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,364 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,364 words
Add missing citations, update stale details, or suggest a clearer explanation.
InternVL is a family of open-source multimodal large language models developed by the OpenGVLab research group at the Shanghai Artificial Intelligence Laboratory in collaboration with academic partners including Nanjing University, the University of Hong Kong, the Chinese University of Hong Kong, Fudan University, SenseTime Research, and Tsinghua University.[1] The series was introduced in late 2023 with the goal of scaling a vision foundation model to roughly the same parameter budget as a contemporary large language model, then aligning the two through staged training so that the resulting system could rival proprietary multimodal models such as GPT-4V, GPT-4o, Claude 3.5 Sonnet, and Gemini.[1][2] The defining architectural choice is a 5.9-billion-parameter vision transformer called InternViT, which is paired with an instruction-tuned language model and a lightweight projector (a multi-layer perceptron in later versions, or a learned-query "QLLaMA" middleware in the original release).[1][3] Successive releases (InternVL 1.0 in December 2023, InternVL 1.5 in April 2024, InternVL 2 in July 2024, InternVL 2.5 in December 2024, InternVL3 in April 2025, and InternVL3.5 in August 2025) have iterated on data curation, dynamic high-resolution input, reinforcement learning, and native multimodal pre-training, with the InternVL 2.5 78B model becoming the first publicly released multimodal large language model to exceed 70 percent on the MMMU validation benchmark.[4][5][6]
| Field | Value |
|---|---|
| Developer | OpenGVLab, Shanghai AI Laboratory and academic partners[1] |
| Initial paper | arXiv:2312.14238, 21 December 2023[1] |
| Conference | CVPR 2024 (Oral)[7] |
| Latest major release | InternVL3.5 (25 August 2025, arXiv:2508.18265)[6] |
| Code repository | github.com/OpenGVLab/InternVL[7] |
| License | MIT for project code and InternViT weights; LLM weights inherit base-model licenses[8][9] |
| Vision backbone | InternViT-6B (5.9B params, later trimmed to 5.54B)[1][8] |
| Largest dense model | InternVL3-78B[5] |
| Largest sparse model | InternVL3.5-241B-A28B (mixture-of-experts)[6][7] |
When InternVL was first proposed in December 2023, the gap between language-model scale and vision-model scale was conspicuous. State-of-the-art language models had crossed the hundred-billion-parameter threshold, while open-source vision encoders used in multimodal systems were typically capped near 0.3 to 1 billion parameters (for example the CLIP ViT-L/14 at 0.3B or EVA-CLIP-g at 1B). Existing vision-language systems generally bolted such a comparatively small vision encoder onto a much larger language model via a thin projection layer.[1] Zhe Chen and collaborators argued that this imbalance left visual representation as the bottleneck and proposed scaling a Vision Transformer to 6 billion parameters, comparable in spirit to Google's earlier ViT-22B but openly released, then aligning it with a supervised fine-tuning pipeline that used a frozen multilingual LLaMA-7B as both an encoder and a generation backbone.[1]
The first InternVL paper, "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks", was submitted to arXiv on 21 December 2023 by Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai.[1] It was accepted as an Oral presentation at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) in 2024.[7] Code and pre-trained weights were published on GitHub under the OpenGVLab organisation, where the repository has accumulated more than ten thousand stars as of 2025.[7]
The series moved quickly after its initial release:
Throughout the series the authorship and institutional affiliations remained centred on the OpenGVLab team at Shanghai AI Laboratory, with Zhe Chen leading the first three papers and Jinguo Zhu leading InternVL3.[1][4][5]
The InternViT-6B encoder is a vanilla, decoder-free Vision Transformer (ViT) architecture, designed to scale CLIP-style training to language-model scale. According to the original paper, the authors performed a hyperparameter sweep over depth in {32, 48, 64, 80}, head dimension in {64, 128}, and MLP ratio in {4, 8}, evaluating each candidate via contrastive learning on a 100-million subset of LAION-en for accuracy, throughput, and training stability.[1] The selected configuration uses a depth of 48 transformer blocks, a hidden width of 3,200, an MLP intermediate dimension of 12,800, 25 attention heads, and a 14 by 14 patch tokenizer, yielding 5.9 billion parameters in total.[1]
In the InternVL 1.5 update, the last three transformer blocks were discarded from the encoder, producing the InternViT-6B-448px-V1-5 variant with 45 blocks and 5.54 billion parameters. The patch size is fixed at 14, but the input resolution was retrained for 448 by 448 pixels and combined with dynamic tiling for high-resolution images.[8] Subsequent versions (V2.5 and V3) continued to refine the encoder weights through continued pre-training on broader OCR-heavy datasets that included LAION-en, LAION-zh, COYO, GRIT, Objects365, OpenImages, All-Seeing, Wukong-OCR, LaionCOCO-OCR, and PaddleOCR-processed corpora.[8] The encoder is released independently of the chat models, with hubs at OpenGVLab/InternViT-6B-224px, OpenGVLab/InternViT-6B-448px-V1-2, OpenGVLab/InternViT-6B-448px-V1-5, and OpenGVLab/InternViT-6B-448px-V2_5, all under the MIT License.[8]
In the original InternVL paper, the vision encoder and the downstream LLM were not connected with a thin projection layer. Instead, the authors inserted an 8-billion-parameter middleware module called QLLaMA. QLLaMA is built on multilingual LLaMA-7B weights, augmented with 96 learnable query tokens and a stack of cross-attention layers that fuse visual features into the language hidden state, adding roughly 1 billion newly trainable parameters.[1] The motivation for QLLaMA is twofold: it acts as a powerful glue layer (analogous in spirit to the Q-Former used in BLIP-2 but two orders of magnitude larger), and it can serve as a standalone vision encoder for tasks that do not require a full chat model, such as image-text retrieval.[1]
From InternVL 1.5 onward, the QLLaMA module was retired in most public chat models in favour of a much simpler two-layer MLP projector that maps InternViT features into the embedding space of the downstream LLM. The simpler connector reduced model size while leaving most performance gains attributable to the encoder, data, and language backbone.[2][11]
The original InternVL pipeline used three explicit stages:
The three-stage recipe enabled a coarse-to-fine alignment: contrastive learning produced general-purpose visual features, generative training learned to bind features to natural language, and SFT taught the system to follow user instructions.[1]
The most important architectural change introduced in InternVL 1.5 was dynamic high-resolution tiling. Rather than resizing every input image to a single fixed resolution, the system selects an aspect-ratio-aware grid of 448 by 448 tiles. During training the model uses between 1 and 12 tiles per image; during inference it can scale up to 40 tiles, which is enough to support inputs of approximately 4K resolution.[2][10] A global "thumbnail" tile is appended so that the model retains an overview of the full image alongside the higher-resolution patches.[2] This mechanism made InternVL 1.5 particularly strong on OCR, document analysis, and chart understanding, where small text and fine layout details would otherwise be lost.
InternVL 2 introduced what the team called "progressive alignment": the vision encoder is trained alongside small language models first, then progressively paired with larger LLMs without retraining the encoder from scratch.[11] The seven public sizes shared the same vision encoder (304M parameters for the smaller variants, the 5.54B InternViT for the 26B / 40B / 76B variants) but swapped in different LLM backbones, ranging from a 0.5B Qwen-derived model at the 1B tier up to Hermes-2-Theta-Llama-3-70B (a community fine-tune of Llama 3 70B) at the 76B tier.[11][12] Training used an 8k context window and combined long texts, multi-image conversations, and video clips, in contrast to the largely single-image data used in InternVL 1.5.[12]
The InternVL 2.5 paper, "Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling", systematically dissected three orthogonal axes: vision-encoder scale, LLM scale, and inference-time strategies. The authors observed a 3.7-point gain on MMMU when applying a DPO-style refined SFT with explicit reasoning prompts at inference, an early demonstration of multimodal chain-of-thought as a tractable lever for open models.[4][13] Training data was filtered down to roughly 16 million high-quality samples in a structured ChatML format that combined captioning, OCR, math, and chat data.[13]
InternVL3, published as arXiv:2504.10479 on 14 April 2025 with Jinguo Zhu as lead author and 50 other contributors, departed from the staged contrastive-then-generative recipe in favour of a "native multimodal pre-training" paradigm. Instead of adapting a text-only LLM after the fact, the team interleaved multimodal corpora (image-text pairs and video-text pairs) with pure-text corpora during a single joint pre-training stage, so the model acquires linguistic and visual competence at the same time.[5][14] The paper introduces Variable Visual Position Encoding (V2PE), which assigns smaller position increments to visual tokens than to text tokens, allowing image-heavy contexts to fit comfortably inside the model's position window. The training pipeline also adds a Mixed Preference Optimization stage that uses both online and offline rewards to mitigate the standard teacher-forcing distribution shift between training and inference.[5][14]
The InternVL3 series spans seven sizes from 1B to 78B parameters, sharing weights with the InternViT-6B-448px-V2_5 encoder for the larger variants and with a smaller 304M ViT for the compact ones. InternVL3 expanded the scope of the model family beyond static images to include GUI agents, 3D scene understanding, tool use, and industrial inspection, partly through targeted data mixing rather than architectural changes.[14]
InternVL3.5, arXiv:2508.18265, submitted on 25 August 2025 with Weiyun Wang as lead author and 74 co-authors, introduced three additional system-level techniques.[6] Cascade Reinforcement Learning splits post-training into an offline reinforcement learning warm-up that uses pre-collected preference pairs and a stable convergence objective, followed by an online stage that refines the model against a process reward model called VisualPRM. The combination yields up to a 16.0 percent gain in overall reasoning performance over InternVL3 on multimodal benchmarks.[6] The Visual Resolution Router (ViR) is a lightweight controller that selects per-tile resolutions at inference time, lowering compute when easy regions of an image do not need high resolution. Decoupled Vision-Language Deployment (DvD) physically separates the vision encoder and the language model across different GPU groups so that each can be parallelised at its own rate, yielding a 4.05 times inference speed-up at the 241B scale.[6] The flagship InternVL3.5-241B-A28B is a mixture-of-experts model with 241 billion total parameters and 28 billion activated per token, the first MoE entry in the family.[6][7]
The InternVL family has grown into a wide grid of sizes, vision encoders, and language backbones. The table below lists the most commonly cited dense chat models, plus the MoE flagship.
| Model | Release | Vision encoder | Language model | Total params | Notable benchmark |
|---|---|---|---|---|---|
| InternVL-Chat-V1-2 | Feb 2024 | InternViT-6B-448px-V1-2 | NousHermes-2-Yi-34B | ~40B | OCRBench >700[11] |
| InternVL-Chat-V1-5 | 18 Apr 2024 | InternViT-6B-448px-V1-5 | InternLM2-Chat-20B | 25.51B | DocVQA 90.9 percent[2][10] |
| InternVL2-1B | 8 Jul 2024 | 304M ViT | Qwen2-0.5B | 0.94B | Mobile-friendly[11][12] |
| InternVL2-2B | 4 Jul 2024 | 304M ViT | InternLM2-Chat-1.8B | 2.21B | MMMU 36.3[12] |
| InternVL2-4B | 4 Jul 2024 | 304M ViT | Phi-3-mini-4k-instruct | 4.15B | MMMU 47.9[12] |
| InternVL2-8B | 4 Jul 2024 | 304M ViT | InternLM2_5-7B-chat | 8.08B | MMMU 49.3[12] |
| InternVL2-26B | 4 Jul 2024 | InternViT-6B-V1-5 | InternLM2-Chat-20B | 25.51B | MMMU 51.2[12] |
| InternVL2-40B | 8 Jul 2024 | InternViT-6B-V1-5 | Nous-Hermes-2-Yi-34B | 40.07B | MMMU 55.2[12] |
| InternVL2-Llama3-76B | 15 Jul 2024 | InternViT-6B-V1-5 | Hermes-2-Theta-Llama-3-70B | 76.3B | MMMU 58.2[11][12] |
| InternVL2.5-78B | 5 Dec 2024 | InternViT-6B-V2_5 | Qwen2.5-72B-Instruct | 78.41B | MMMU 70.1[4][13] |
| InternVL3-78B | 11 Apr 2025 | InternViT-6B-V2_5 | Qwen2.5-72B-Instruct | ~78B | MMMU 72.2[5][14] |
| InternVL3.5-241B-A28B | 25 Aug 2025 | InternViT-6B-V2_5 | MoE LLM (241B total, 28B active) | 241B | MMMU 77.7[6] |
The smaller variants are intended for on-device or edge deployment; an explicit "Mini-InternVL" line described in arXiv:2410.16261 demonstrated that a 4B parameter chat model could retain about 90 percent of the 25B model's benchmark performance while using only 5 percent of the compute, indicating that the architecture distils efficiently.[9][11] The MoE flagship in InternVL3.5 was the first model in the family to break from a single dense LLM backbone, leveraging sparsity to keep activated compute close to that of a 28B model while training capacity scales to 241B.[6][7]
Across the standard multimodal benchmark suite, InternVL has tracked or overtaken proprietary frontier models on text-heavy tasks while remaining a step behind them on the hardest reasoning tasks.
| Benchmark | GPT-4o (20240513) | Claude 3.5 Sonnet | InternVL2-Llama3-76B | InternVL2.5-78B | InternVL3-78B |
|---|---|---|---|---|---|
| MMMU (val) | 69.1 | 68.3 | 58.2 | 70.1 | 72.2 |
| MathVista (testmini) | 63.8 | 67.7 | 65.5 | 72.3 | 79.0 |
| OCRBench | 736 | 788 | 839 | 854 | n.r. |
| DocVQA (test) | 92.8 | 95.2 | 94.1 | 95.1 | 95.4 |
| ChartQA (test) | 85.7 | 90.8 | 88.4 | n.r. | n.r. |
| TextVQA (val) | n.r. | n.r. | 84.4 | n.r. | n.r. |
| OpenCompass average | 69.9 | 67.9 | 71.0 | n.r. | n.r. |
| MMBench-EN (test) | 83.4 | 79.7 | 86.5 | n.r. | n.r. |
GPT-4o, Claude 3.5 Sonnet, and InternVL2-Llama3-76B scores in the table come from the official InternVL2 model card cross-reference of public reports as of mid-2024.[11] InternVL2.5 and InternVL3 numbers come from their respective papers and project blogs.[4][5][13][14] On OCRBench the open InternVL line already outperformed both GPT-4o and Claude 3.5 Sonnet at the 76B scale in mid-2024 and the lead has widened in subsequent versions.[11][13] MathVista was a notable acceleration point: the InternVL3-78B score of 79.0 on testmini placed it ahead of GPT-4o (60.0), Claude 3.5 Sonnet (66.8), and Gemini 2.0-Pro (71.3) in the project's own comparison table.[14]
On MMMU, InternVL 2.5-78B's 70.1 in December 2024 was the first publicly verified open MLLM score above the 70 threshold; InternVL3-78B subsequently raised it to 72.2 in April 2025, and InternVL3.5-241B-A28B reached 77.7, closing most of the gap with the strongest closed-source frontier models reported at the time.[4][5][6][7] The 8B InternVL3.5 variant achieves 73.4 on MMMU, illustrating how rapidly small open models have caught up to the original 78B-class scores.[6]
InternVL's training data has evolved over time, with the contrastive pre-training corpus dominated by web image-text pairs and the supervised fine-tuning corpus increasingly biased toward OCR, math, and chat data:
LAION datasets and COYO-700M provide the bulk of the contrastive pre-training data; ImageNet and ADE20K are used for evaluating linear probing and segmentation transfer; OmniCorpus contributes the long-form text and interleaved document data used in InternVL2 and InternVL3.[1][14]
Because every chat model in the series is published openly under MIT or LLaMA-compatible licenses (depending on the language backbone), InternVL has been widely adopted both as a baseline for academic research and as a practical drop-in for multimodal applications.[7][8][11] The official documentation describes deployment paths through Hugging Face Transformers, vLLM, SGLang, and LMDeploy, and pre-built Docker images are provided for production use.[11] The InternVL Space on Hugging Face hosts a public chat demo that has accumulated several hundred likes and a steady stream of usage since launch.[7]
Application domains the project explicitly demonstrates include document and chart understanding (DocVQA, ChartQA, InfographicVQA), scene-text OCR (TextVQA, OCRBench), mathematical reasoning over images (MathVista, MathVerse, MathVision), visual grounding (RefCOCO, RefCOCO+, RefCOCOg), multimodal hallucination detection (POPE, HallBench), and multilingual evaluation (MMBench-CN, CCBench).[11][13] Beyond benchmarks, the InternVL3 release added explicit support for GUI agents (interacting with software interfaces via screenshots), 3D scene understanding, industrial image analysis, and tool use, while InternVL3.5 extended these into embodied agents.[5][14][6]
The model family has been used as a vision backbone for many downstream open-source systems, including VideoChat-Flash and several derivative video chat assistants released by OpenGVLab and the broader community.[7] In addition, the InternViT-6B encoder is itself reused independently of the chat models, for instance as a feature extractor in retrieval pipelines or as a frozen backbone in research on multimodal alignment, taking advantage of the fact that it is released separately under the MIT License together with its weights and CLIP-style image processor.[8]
The InternVL series sits at the intersection of three trends in mid-decade multimodal research.
First, it demonstrated that scaling vision encoders to language-model parity, rather than gluing a small encoder to a giant LLM, can yield large benchmark gains. The 5.9-billion-parameter InternViT-6B reached 88.2 percent linear-probing accuracy on ImageNet-1K and 47.2 mIoU on ADE20K semantic segmentation with a linear probe, both competitive with or better than other contemporary open encoders.[1] This validated the premise that the vision side of multimodal models had been undersized.
Second, it served as a public counterpoint to closed-source MLLMs. The InternVL 1.5 paper, titled "How Far Are We to GPT-4V?", explicitly framed the open vs. closed gap as the central research question, and the team's later 2.5 and 3 papers reported the first MMMU scores above 70 and 72 percent respectively from any openly released MLLM.[2][4][5] By making weights and training recipes available under permissive licenses, the project pulled the open-source ceiling up and gave smaller research groups access to a strong multimodal baseline for fine-tuning, distillation, and ablation studies.[7][8]
Third, it has been an early adopter of techniques from the broader LLM stack, including DPO-style preference optimisation in 2.5, native multimodal pre-training in 3, and reinforcement learning with process reward models in 3.5.[4][5][6] In doing so the project has provided a useful, well-documented testbed for transferring alignment ideas from text-only models into the multimodal setting.
Despite its strong benchmark scores, the InternVL family inherits several limitations common to large open multimodal models.
The first is hallucination. The project's own POPE and HallBench evaluations show that even the 78B variants of InternVL 2 and 2.5 hallucinate objects and attributes in a non-trivial fraction of images, particularly on adversarial prompts. InternVL2-Llama3-76B scored 55.2 on HallBench average, just above Claude 3.5 Sonnet (49.9) and below the InternVL2-40B model on some splits.[11] Hallucination control is one of the explicit motivations for the Mixed Preference Optimization stage introduced in InternVL3 and the Cascade RL stage in InternVL3.5, but the problem has not been eliminated.[5][6]
The second is the dependency on heterogeneous language backbones. Different InternVL 2 sizes use Qwen2, InternLM2, Phi-3, NousHermes-2-Yi, and Hermes-2-Theta-Llama-3, each of which carries its own license, alignment quirks, and tokenizer. While InternVL's own project code is MIT-licensed, the chat models inherit the more restrictive terms of their LLM backbones, complicating commercial use.[9][11] InternVL3.5's MoE flagship reduces but does not remove this concern.[6]
The third is the cost and complexity of the dynamic high-resolution tiling. At inference time a 4K image can produce up to 40 tiles of 448 by 448 pixels each, each consuming hundreds of visual tokens after the MLP projection. This dramatically increases compute and memory, which the Visual Resolution Router in InternVL3.5 only partially addresses.[2][6]
Finally, while the project provides open weights and open inference code, the full training data pipelines (especially the curated SFT and preference data used in 2.5 and 3) are not all publicly released in identical form, making exact reproduction of the strongest benchmark numbers difficult. The team has stated an intent to release training data alongside InternVL3, but historically the released mixtures have lagged behind the model releases.[5][14]
InternVL is part of a wider 2024 to 2025 wave of open multimodal large language models. Direct competitors include the LLaVA series from Microsoft and the University of Wisconsin (and its 1.5, 1.6, and Next iterations), MiniCPM-V from ModelBest and Tsinghua, the Qwen-VL family (Qwen-VL, Qwen2-VL, Qwen2.5-VL) from Alibaba, DeepSeek-VL and DeepSeek-VL2 from DeepSeek, and Pixtral from Mistral AI, among others.[7]
Relative to LLaVA, InternVL pursues a heavier vision encoder (5.5B+ vs. roughly 0.3B for LLaVA's CLIP backbone) and a more elaborate pre-training schedule.[1] Relative to Qwen-VL and Qwen2.5-VL, the two families have repeatedly traded leadership on MMMU and OCRBench, with InternVL holding the lead at the 70+ MMMU threshold in late 2024 and Qwen2.5-VL closing the gap in early 2025.[4][11] Compared to proprietary models such as GPT-4o, Claude 3.5 Sonnet, and the Gemini family, the InternVL series has at various points exceeded them on text-heavy benchmarks (OCRBench, DocVQA, ChartQA) and on MathVista, while continuing to trail on the hardest multi-disciplinary reasoning benchmarks such as MMMU-Pro and MMVet (GPT-4-Turbo split).[11][14]
Architecturally the closest historical comparison is Google's ViT-22B, an even larger vision transformer that was reported but not openly released; InternViT-6B at half the scale was the first openly released ViT to enter the multi-billion-parameter regime and remained the default backbone for the OpenGVLab line for multiple generations.[1][8] Compared to the BLIP-2 and Flamingo lineage, which used compact resamplers and queried features into a frozen LLM, InternVL's QLLaMA was an unusually heavy alignment module, although the team eventually moved to a simple MLP connector once the vision encoder itself was strong enough to carry most of the representation work.[2][11]