InternVL3
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,440 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,440 words
Add missing citations, update stale details, or suggest a clearer explanation.
InternVL3 is an open-weights family of multimodal AI large language models released on April 11, 2025 by OpenGVLab, the general vision team associated with Shanghai AI Laboratory. It is the third major generation of the InternVL series of vision-language models, which couple an InternViT vision encoder with a text language model so the resulting system can process images, video, documents, and text within a single model. The family spans seven sizes, from InternVL3-1B to InternVL3-78B, and its headline technical contribution is "native multimodal pre-training," in which language and vision are learned jointly in one pre-training stage rather than by bolting a vision encoder onto a finished text LLM.[1][2] At launch the largest model, InternVL3-78B, scored 72.2 on the MMMU multimodal reasoning benchmark, which the authors reported as a new state of the art among open-source multimodal LLMs.[1][3]
InternVL3 is a series of general-purpose multimodal models built for tasks that mix images and language, including visual question answering, document and chart understanding, optical character recognition (OCR), multi-image and video reasoning, visual grounding, multilingual understanding, and graphical user interface (GUI) agent tasks.[2][3] Each model follows the "ViT-MLP-LLM" design shared across the InternVL line: an InternViT vision transformer encodes image tiles into visual tokens, a small multilayer perceptron (MLP) projector maps those tokens into the language model's embedding space, and a pretrained language model consumes the combined visual and text tokens.[2] Images are handled with dynamic resolution, tiling each input into up to 36 patches of 448 by 448 pixels during training and up to 128 tiles at inference, which lets the models read high-resolution documents and complex scenes.[2]
The release accompanied a technical report, "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models," posted to arXiv (arXiv:2504.10479) by a large OpenGVLab-led author group with Jinguo Zhu as lead author.[1] The report frames InternVL3 as an upgrade to the prior InternVL 2.5 generation that improves multimodal perception and reasoning while extending the models into newer capabilities such as tool use, GUI agents, industrial image analysis, and 3D vision.[2] The authors stated a commitment to open science, including a plan to publicly release training data alongside model weights.[1]
InternVL is developed by OpenGVLab, a vision-focused research effort connected to Shanghai AI Laboratory that builds large open vision and vision-language foundation models. The first InternVL paper was an oral presentation at the CVPR 2024 conference and positioned the series as an open-source alternative aiming to approach the capability of proprietary systems like GPT-4o.[2] The line progressed through InternVL 1.x and the InternVL 2.x family (including InternVL 2.5), steadily scaling the InternViT vision encoder and the paired language models and improving training recipes.[2]
A central building block is InternViT, OpenGVLab's family of vision transformer encoders, which InternVL3 reuses in two sizes: a roughly 300-million-parameter InternViT-300M-448px-V2_5 for the smaller models and a roughly 6-billion-parameter InternViT-6B-448px-V2_5 for the two largest.[2][4] On the language side, InternVL3 mostly initializes from Qwen2.5 chat models from Alibaba, with one variant instead built on OpenGVLab's own InternLM language model.[2][4] This places InternVL3 alongside other prominent open multimodal models such as Qwen2-VL and Qwen2.5-VL, and Meta's Llama 3.2 Vision, as well as closed systems including GPT-4o and Google's Gemini family.
OpenGVLab continued the series after InternVL3. In August 2025 it released InternVL3.5, which added reasoning and efficiency upgrades through a "Cascade Reinforcement Learning" recipe, a dynamic visual-resolution router, and a decoupled deployment framework, and broadened the lineup to scales from 1B up to a 241-billion-parameter mixture-of-experts model (InternVL3_5-241B-A28B), including a variant built on an open GPT-OSS language backbone.[5]
The defining change in InternVL3 is native multimodal pre-training. Earlier multimodal LLMs, including prior InternVL generations, typically started from a text-only LLM and then attached a vision encoder, aligning the two in a separate adaptation stage. InternVL3 instead consolidates language pre-training and multimodal alignment into a single stage, interleaving multimodal data (image-text, video-text, and interleaved image-text sequences) with large-scale pure-text corpora so the model acquires linguistic and visual abilities together.[1][2] The authors argue this avoids the complexity and potential misalignment of post-hoc adaptation while preserving strong text-only performance.[1]
InternVL3 combines native pre-training with several other techniques:
InternVL3 was released as seven dense models sharing the same architecture and training recipe but differing in scale of the language model and, for the two largest, the vision encoder. Parameter figures are taken from the official model documentation.[4]
| Model | Total parameters | Vision encoder (InternViT) | MLP projector | Language model |
|---|---|---|---|---|
| InternVL3-1B | 938M | 300M-448px-V2_5 | 4.48M | Qwen2.5-0.5B |
| InternVL3-2B | 2.09B | 300M-448px-V2_5 | 8.66M | Qwen2.5-1.5B |
| InternVL3-8B | 7.94B | 300M-448px-V2_5 | 27.5M | Qwen2.5-7B |
| InternVL3-9B | 9.14B | 300M-448px-V2_5 | 33.6M | InternLM3-8B-Instruct |
| InternVL3-14B | 15.12B | 300M-448px-V2_5 | 47.2M | Qwen2.5-14B |
| InternVL3-38B | 38.39B | 6B-448px-V2_5 | 91.8M | Qwen2.5-32B |
| InternVL3-78B | 78.41B | 6B-448px-V2_5 | 172M | Qwen2.5-72B |
The smaller five models (1B through 14B) pair the lighter InternViT-300M encoder with progressively larger language backbones, while the 38B and 78B models use the 6-billion-parameter InternViT-6B encoder.[4] All variants except InternVL3-9B initialize their language component from Qwen2.5; the 9B model is built on OpenGVLab's InternLM3-8B-Instruct.[2][4] This spread lets practitioners pick a size for on-device or budget-constrained deployment at the low end and for maximum capability at the high end.
OpenGVLab reported that InternVL3 improves on InternVL 2.5 across multimodal perception and reasoning, and the headline result was InternVL3-78B's score of 72.2 on MMMU (Massive Multi-discipline Multimodal Understanding), which the technical report described as a new state of the art among open-source multimodal LLMs at the time of release.[1][3] The authors further stated the model remained highly competitive with leading proprietary systems, naming ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro as comparison points, while retaining strong pure-language ability.[1] These figures are the developers' own reported results; as with most vendor-published model evaluations, exact numbers depend on prompting and the evaluation harness and should be treated accordingly.
Additional results published for InternVL3-78B in the model documentation include MathVista 79.0, DocVQA 95.4, and an OCRBench score of 906, with the documentation also reporting that InternVL3-78B outperforms Alibaba's Qwen2.5-VL-72B on most reasoning tasks and is broadly comparable to GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Pro on a range of multimodal tasks.[2] Across the family, OpenGVLab reported that each InternVL3 size tended to lead similarly sized open competitors on multimodal reasoning benchmarks.[2][3]
InternVL3 weights were published openly on Hugging Face under the OpenGVLab organization, where the larger models accumulated tens of thousands of downloads in the months after release.[4] The model cards list an Apache 2.0 license, but because most variants are derived from Qwen2.5, whose own license terms attach to derivatives, the effective licensing for those Qwen-based variants is constrained by the Qwen license; users have raised this discrepancy with OpenGVLab, and license terms therefore depend on the underlying language backbone of a given size.[6] OpenGVLab also released code and, per the technical report, committed to releasing the underlying training data.[1]
InternVL3 is significant as one of the strongest fully open multimodal LLM families of 2025 and as an early, widely cited demonstration that native multimodal pre-training, learning vision and language jointly from the start, can match or exceed the prevailing approach of adapting a finished text LLM. Its breadth of sizes, permissively published weights, and competitive benchmark results made it a common research baseline and a frequent starting point for fine-tuning and downstream multimodal systems, a role continued by the later InternVL3.5 release.[2][5]