InternVL3
Last reviewed
Sources
6 citations
Review status
Source-backed
Revision
v2 · 1,672 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
6 citations
Review status
Source-backed
Revision
v2 · 1,672 words
Add missing citations, update stale details, or suggest a clearer explanation.
InternVL3 is an open-weights family of multimodal AI large language models released on April 11, 2025 by OpenGVLab, the general vision team associated with Shanghai AI Laboratory. Spanning seven sizes from InternVL3-1B to InternVL3-78B, it is the third major generation of the InternVL series of vision-language models and introduced "native multimodal pre-training," in which language and vision are learned jointly in a single pre-training stage instead of bolting a vision encoder onto a finished text LLM.[1][2] At launch the largest model, InternVL3-78B, scored 72.2 on the MMMU multimodal reasoning benchmark, which the authors reported as a new state of the art among open-source multimodal LLMs.[1][3]
Each model couples an InternViT vision encoder with a text language model, mostly Qwen2.5, so the system can process images, video, documents, and text within one model. The technical report describes the core idea succinctly: "Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage."[1]
| Attribute | Detail |
|---|---|
| Developer | OpenGVLab (Shanghai AI Laboratory) |
| Release date | April 11, 2025 |
| Technical report | arXiv:2504.10479 (submitted April 14, 2025) |
| Model sizes | 7 dense models, InternVL3-1B to InternVL3-78B |
| Architecture | ViT-MLP-LLM (InternViT encoder + MLP projector + LLM) |
| Vision encoders | InternViT-300M-448px-V2_5 and InternViT-6B-448px-V2_5 |
| Language backbones | Qwen2.5 (six sizes), InternLM3-8B-Instruct (the 9B) |
| Headline result | InternVL3-78B: 72.2 on MMMU |
| License | Listed Apache 2.0/MIT, with Qwen license constraints on Qwen-based variants |
InternVL3 is a series of general-purpose multimodal models built for tasks that mix images and language, including visual question answering, document and chart understanding, optical character recognition (OCR), multi-image and video reasoning, visual grounding, multilingual understanding, and graphical user interface (GUI) agent tasks.[2][3] Each model follows the "ViT-MLP-LLM" design shared across the InternVL line: an InternViT vision transformer encodes image tiles into visual tokens, a small multilayer perceptron (MLP) projector maps those tokens into the language model's embedding space, and a pretrained language model consumes the combined visual and text tokens.[2] Images are handled with dynamic resolution, tiling each input into up to 36 patches of 448 by 448 pixels during training and up to 128 tiles at inference, which lets the models read high-resolution documents and complex scenes.[2]
The release accompanied a technical report, "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models," posted to arXiv (arXiv:2504.10479) by a large OpenGVLab-led author group with Jinguo Zhu as lead author.[1] The report frames InternVL3 as an upgrade to the prior InternVL 2.5 generation that improves multimodal perception and reasoning while extending the models into newer capabilities such as tool use, GUI agents, industrial image analysis, and 3D vision.[2] The authors stated a commitment to open science, writing that "in pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs."[1]
InternVL is developed by OpenGVLab, a vision-focused research effort connected to Shanghai AI Laboratory that builds large open vision and vision-language foundation models. The first InternVL paper was an oral presentation at the CVPR 2024 conference and positioned the series as an open-source alternative aiming to approach the capability of proprietary systems like GPT-4o.[2] The line progressed through InternVL 1.x and the InternVL 2.x family (including InternVL 2.5), steadily scaling the InternViT vision encoder and the paired language models and improving training recipes.[2]
A central building block is InternViT, OpenGVLab's family of vision transformer encoders, which InternVL3 reuses in two sizes: a roughly 300-million-parameter InternViT-300M-448px-V2_5 for the smaller models and a roughly 6-billion-parameter InternViT-6B-448px-V2_5 for the two largest.[2][4] On the language side, InternVL3 mostly initializes from Qwen2.5 chat models from Alibaba, with one variant instead built on OpenGVLab's own InternLM language model.[2][4] This places InternVL3 alongside other prominent open multimodal models such as Qwen2-VL and Qwen2.5-VL, and Meta's Llama 3.2 Vision, as well as closed systems including GPT-4o and Google's Gemini family.
OpenGVLab continued the series after InternVL3. In August 2025 it released InternVL3.5, which added reasoning and efficiency upgrades through a "Cascade Reinforcement Learning" recipe, a dynamic visual-resolution router, and a decoupled deployment framework, and broadened the lineup to scales from 1B up to a 241-billion-parameter mixture-of-experts model (InternVL3_5-241B-A28B), including a variant built on an open GPT-OSS language backbone.[5]
The defining change in InternVL3 is native multimodal pre-training. Earlier multimodal LLMs, including prior InternVL generations, typically started from a text-only LLM and then attached a vision encoder, aligning the two in a separate adaptation stage. InternVL3 instead consolidates language pre-training and multimodal alignment into a single stage, interleaving multimodal data (image-text, video-text, and interleaved image-text sequences) with large-scale pure-text corpora so the model acquires linguistic and visual abilities together.[1][2] The authors argue this avoids the complexity and potential misalignment of post-hoc adaptation while preserving strong text-only performance; OpenGVLab reports that, thanks to native multimodal pre-training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 chat models used to initialize it.[1][2]
InternVL3 combines native pre-training with several other techniques:
InternVL3 was released as seven dense models sharing the same architecture and training recipe but differing in scale of the language model and, for the two largest, the vision encoder. Parameter figures are taken from the official model documentation.[4]
| Model | Total parameters | Vision encoder (InternViT) | MLP projector | Language model |
|---|---|---|---|---|
| InternVL3-1B | 938M | 300M-448px-V2_5 | 4.48M | Qwen2.5-0.5B |
| InternVL3-2B | 2.09B | 300M-448px-V2_5 | 8.66M | Qwen2.5-1.5B |
| InternVL3-8B | 7.94B | 300M-448px-V2_5 | 27.5M | Qwen2.5-7B |
| InternVL3-9B | 9.14B | 300M-448px-V2_5 | 33.6M | InternLM3-8B-Instruct |
| InternVL3-14B | 15.12B | 300M-448px-V2_5 | 47.2M | Qwen2.5-14B |
| InternVL3-38B | 38.39B | 6B-448px-V2_5 | 91.8M | Qwen2.5-32B |
| InternVL3-78B | 78.41B | 6B-448px-V2_5 | 172M | Qwen2.5-72B |
The smaller five models (1B through 14B) pair the lighter InternViT-300M encoder with progressively larger language backbones, while the 38B and 78B models use the 6-billion-parameter InternViT-6B encoder.[4] All variants except InternVL3-9B initialize their language component from Qwen2.5; the 9B model is built on OpenGVLab's InternLM3-8B-Instruct.[2][4] This spread lets practitioners pick a size for on-device or budget-constrained deployment at the low end and for maximum capability at the high end.
OpenGVLab reported that InternVL3 improves on InternVL 2.5 across multimodal perception and reasoning, and the headline result was InternVL3-78B's score of 72.2 on MMMU (Massive Multi-discipline Multimodal Understanding), which the technical report described as a new state of the art among open-source multimodal LLMs at the time of release.[1][3] The authors further stated the model "remains highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency."[1]
Selected InternVL3-78B results reported in the model documentation:
| Benchmark | InternVL3-78B score |
|---|---|
| MMMU (multimodal reasoning) | 72.2 |
| MathVista (visual math) | 79.0 |
| DocVQA (document QA) | 95.4 |
| OCRBench | 906 |
The documentation also reports that InternVL3-78B outperforms Alibaba's Qwen2.5-VL-72B on most reasoning tasks and is broadly comparable to GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Pro on a range of multimodal tasks.[2] Across the family, OpenGVLab reported that each InternVL3 size tended to lead similarly sized open competitors on multimodal reasoning benchmarks.[2][3] These figures are the developers' own reported results; as with most vendor-published model evaluations, exact numbers depend on prompting and the evaluation harness and should be treated accordingly.
InternVL3 weights were published openly on Hugging Face under the OpenGVLab organization, where the larger models accumulated tens of thousands of downloads in the months after release.[4] The model cards list an Apache 2.0 (and MIT) license, but because most variants are derived from Qwen2.5, whose own license terms attach to derivatives, the effective licensing for those Qwen-based variants is constrained by the Qwen license; users have raised this discrepancy with OpenGVLab, and license terms therefore depend on the underlying language backbone of a given size.[2][6] OpenGVLab also released code and, per the technical report, committed to releasing the underlying training data.[1]
InternVL3 is significant as one of the strongest fully open multimodal LLM families of 2025 and as an early, widely cited demonstration that native multimodal pre-training, learning vision and language jointly from the start, can match or exceed the prevailing approach of adapting a finished text LLM. Its breadth of sizes, permissively published weights, and competitive benchmark results made it a common research baseline and a frequent starting point for fine-tuning and downstream multimodal systems, a role continued by the later InternVL3.5 release.[2][5]