InternVL3

AI Models Large Language Models

8 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,672 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

InternVL3 is an open-weights family of multimodal AI large language models released on April 11, 2025 by OpenGVLab, the general vision team associated with Shanghai AI Laboratory. Spanning seven sizes from InternVL3-1B to InternVL3-78B, it is the third major generation of the InternVL series of vision-language models and introduced "native multimodal pre-training," in which language and vision are learned jointly in a single pre-training stage instead of bolting a vision encoder onto a finished text LLM.^[1]^[2] At launch the largest model, InternVL3-78B, scored 72.2 on the MMMU multimodal reasoning benchmark, which the authors reported as a new state of the art among open-source multimodal LLMs.^[1]^[3]

Each model couples an InternViT vision encoder with a text language model, mostly Qwen2.5, so the system can process images, video, documents, and text within one model. The technical report describes the core idea succinctly: "Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage."^[1]

Quick facts

Attribute	Detail
Developer	OpenGVLab (Shanghai AI Laboratory)
Release date	April 11, 2025
Technical report	arXiv:2504.10479 (submitted April 14, 2025)
Model sizes	7 dense models, InternVL3-1B to InternVL3-78B
Architecture	ViT-MLP-LLM (InternViT encoder + MLP projector + LLM)
Vision encoders	InternViT-300M-448px-V2_5 and InternViT-6B-448px-V2_5
Language backbones	Qwen2.5 (six sizes), InternLM3-8B-Instruct (the 9B)
Headline result	InternVL3-78B: 72.2 on MMMU
License	Listed Apache 2.0/MIT, with Qwen license constraints on Qwen-based variants

What is InternVL3?

InternVL3 is a series of general-purpose multimodal models built for tasks that mix images and language, including visual question answering, document and chart understanding, optical character recognition (OCR), multi-image and video reasoning, visual grounding, multilingual understanding, and graphical user interface (GUI) agent tasks.^[2]^[3] Each model follows the "ViT-MLP-LLM" design shared across the InternVL line: an InternViT vision transformer encodes image tiles into visual tokens, a small multilayer perceptron (MLP) projector maps those tokens into the language model's embedding space, and a pretrained language model consumes the combined visual and text tokens.^[2] Images are handled with dynamic resolution, tiling each input into up to 36 patches of 448 by 448 pixels during training and up to 128 tiles at inference, which lets the models read high-resolution documents and complex scenes.^[2]

The release accompanied a technical report, "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models," posted to arXiv (arXiv:2504.10479) by a large OpenGVLab-led author group with Jinguo Zhu as lead author.^[1] The report frames InternVL3 as an upgrade to the prior InternVL 2.5 generation that improves multimodal perception and reasoning while extending the models into newer capabilities such as tool use, GUI agents, industrial image analysis, and 3D vision.^[2] The authors stated a commitment to open science, writing that "in pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs."^[1]

Who builds InternVL, and how does the series fit together?

InternVL is developed by OpenGVLab, a vision-focused research effort connected to Shanghai AI Laboratory that builds large open vision and vision-language foundation models. The first InternVL paper was an oral presentation at the CVPR 2024 conference and positioned the series as an open-source alternative aiming to approach the capability of proprietary systems like GPT-4o.^[2] The line progressed through InternVL 1.x and the InternVL 2.x family (including InternVL 2.5), steadily scaling the InternViT vision encoder and the paired language models and improving training recipes.^[2]

A central building block is InternViT, OpenGVLab's family of vision transformer encoders, which InternVL3 reuses in two sizes: a roughly 300-million-parameter InternViT-300M-448px-V2_5 for the smaller models and a roughly 6-billion-parameter InternViT-6B-448px-V2_5 for the two largest.^[2]^[4] On the language side, InternVL3 mostly initializes from Qwen2.5 chat models from Alibaba, with one variant instead built on OpenGVLab's own InternLM language model.^[2]^[4] This places InternVL3 alongside other prominent open multimodal models such as Qwen2-VL and Qwen2.5-VL, and Meta's Llama 3.2 Vision, as well as closed systems including GPT-4o and Google's Gemini family.

OpenGVLab continued the series after InternVL3. In August 2025 it released InternVL3.5, which added reasoning and efficiency upgrades through a "Cascade Reinforcement Learning" recipe, a dynamic visual-resolution router, and a decoupled deployment framework, and broadened the lineup to scales from 1B up to a 241-billion-parameter mixture-of-experts model (InternVL3_5-241B-A28B), including a variant built on an open GPT-OSS language backbone.^[5]

What is native multimodal pre-training?

The defining change in InternVL3 is native multimodal pre-training. Earlier multimodal LLMs, including prior InternVL generations, typically started from a text-only LLM and then attached a vision encoder, aligning the two in a separate adaptation stage. InternVL3 instead consolidates language pre-training and multimodal alignment into a single stage, interleaving multimodal data (image-text, video-text, and interleaved image-text sequences) with large-scale pure-text corpora so the model acquires linguistic and visual abilities together.^[1]^[2] The authors argue this avoids the complexity and potential misalignment of post-hoc adaptation while preserving strong text-only performance; OpenGVLab reports that, thanks to native multimodal pre-training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 chat models used to initialize it.^[1]^[2]

InternVL3 combines native pre-training with several other techniques:

Variable Visual Position Encoding (V2PE): uses smaller, more flexible position increments for visual tokens so that long sequences of image tokens consume less of the model's position budget, supporting longer multimodal contexts.^[1]^[2]
Mixed Preference Optimization (MPO): a post-training alignment method that learns from both positive and negative samples to reduce the distribution shift between teacher-forced training and free-running inference, improving reasoning quality.^[1]^[2]
Multimodal test-time scaling: at evaluation the system can generate multiple candidate answers and use a separate critic model, VisualPRM-8B, in a Best-of-N selection to pick the strongest response on reasoning and mathematics tasks.^[1]^[2]

What sizes does InternVL3 come in?

InternVL3 was released as seven dense models sharing the same architecture and training recipe but differing in scale of the language model and, for the two largest, the vision encoder. Parameter figures are taken from the official model documentation.^[4]

Model	Total parameters	Vision encoder (InternViT)	MLP projector	Language model
InternVL3-1B	938M	300M-448px-V2_5	4.48M	Qwen2.5-0.5B
InternVL3-2B	2.09B	300M-448px-V2_5	8.66M	Qwen2.5-1.5B
InternVL3-8B	7.94B	300M-448px-V2_5	27.5M	Qwen2.5-7B
InternVL3-9B	9.14B	300M-448px-V2_5	33.6M	InternLM3-8B-Instruct
InternVL3-14B	15.12B	300M-448px-V2_5	47.2M	Qwen2.5-14B
InternVL3-38B	38.39B	6B-448px-V2_5	91.8M	Qwen2.5-32B
InternVL3-78B	78.41B	6B-448px-V2_5	172M	Qwen2.5-72B

The smaller five models (1B through 14B) pair the lighter InternViT-300M encoder with progressively larger language backbones, while the 38B and 78B models use the 6-billion-parameter InternViT-6B encoder.^[4] All variants except InternVL3-9B initialize their language component from Qwen2.5; the 9B model is built on OpenGVLab's InternLM3-8B-Instruct.^[2]^[4] This spread lets practitioners pick a size for on-device or budget-constrained deployment at the low end and for maximum capability at the high end.

How does InternVL3 perform on benchmarks?

OpenGVLab reported that InternVL3 improves on InternVL 2.5 across multimodal perception and reasoning, and the headline result was InternVL3-78B's score of 72.2 on MMMU (Massive Multi-discipline Multimodal Understanding), which the technical report described as a new state of the art among open-source multimodal LLMs at the time of release.^[1]^[3] The authors further stated the model "remains highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency."^[1]

Selected InternVL3-78B results reported in the model documentation:

Benchmark	InternVL3-78B score
MMMU (multimodal reasoning)	72.2
MathVista (visual math)	79.0
DocVQA (document QA)	95.4
OCRBench	906

The documentation also reports that InternVL3-78B outperforms Alibaba's Qwen2.5-VL-72B on most reasoning tasks and is broadly comparable to GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Pro on a range of multimodal tasks.^[2] Across the family, OpenGVLab reported that each InternVL3 size tended to lead similarly sized open competitors on multimodal reasoning benchmarks.^[2]^[3] These figures are the developers' own reported results; as with most vendor-published model evaluations, exact numbers depend on prompting and the evaluation harness and should be treated accordingly.

Is InternVL3 open source, and where can you get it?

InternVL3 weights were published openly on Hugging Face under the OpenGVLab organization, where the larger models accumulated tens of thousands of downloads in the months after release.^[4] The model cards list an Apache 2.0 (and MIT) license, but because most variants are derived from Qwen2.5, whose own license terms attach to derivatives, the effective licensing for those Qwen-based variants is constrained by the Qwen license; users have raised this discrepancy with OpenGVLab, and license terms therefore depend on the underlying language backbone of a given size.^[2]^[6] OpenGVLab also released code and, per the technical report, committed to releasing the underlying training data.^[1]

Why does InternVL3 matter?

InternVL3 is significant as one of the strongest fully open multimodal LLM families of 2025 and as an early, widely cited demonstration that native multimodal pre-training, learning vision and language jointly from the start, can match or exceed the prevailing approach of adapting a finished text LLM. Its breadth of sizes, permissively published weights, and competitive benchmark results made it a common research baseline and a frequent starting point for fine-tuning and downstream multimodal systems, a role continued by the later InternVL3.5 release.^[2]^[5]

References

Zhu, Jinguo, et al. "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models." arXiv:2504.10479, April 2025. https://arxiv.org/abs/2504.10479 ↩
OpenGVLab. "InternVL3" (blog announcement), April 11, 2025. https://internvl.github.io/blog/2025-04-11-InternVL-3.0/ ↩
OpenGVLab. "Introduction of InternVL3.0 Series." InternVL documentation. https://internvl.readthedocs.io/en/latest/internvl3.0/introduction.html ↩
"OpenGVLab/InternVL3-78B." Hugging Face model card. https://huggingface.co/OpenGVLab/InternVL3-78B ↩
OpenGVLab. "InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency." arXiv:2508.18265, August 2025. https://arxiv.org/abs/2508.18265 ↩
"OpenGVLab/InternVL3-38B: License" (discussion). Hugging Face. https://huggingface.co/OpenGVLab/InternVL3-38B/discussions/1 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Document Question Answering Models ZAYA1-8B Zyphra

Quick facts

What is InternVL3?

Who builds InternVL, and how does the series fit together?

What is native multimodal pre-training?

What sizes does InternVL3 come in?

How does InternVL3 perform on benchmarks?

Is InternVL3 open source, and where can you get it?

Why does InternVL3 matter?

References

Improve this article

Related Articles

LLaMA/Model Card

Bert-base-uncased model

Foundation models

GPT

Llama 3

GPT-5

What links here

Related Articles

LLaMA/Model Card

Bert-base-uncased model

Foundation models

GPT

Llama 3

GPT-5

What links here