InternVideo
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,641 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,641 words
Add missing citations, update stale details, or suggest a clearer explanation.
InternVideo is a family of general-purpose video foundation models developed by OpenGVLab at the Shanghai Artificial Intelligence Laboratory in collaboration with Nanjing University and the Shenzhen Institutes of Advanced Technology of the Chinese Academy of Sciences. The series, which began with the original InternVideo technical report in December 2022 and progressed through InternVideo2 (March 2024) and InternVideo2.5 (January 2025), aims to provide a single pretrained video backbone that can be adapted to action recognition, video-text retrieval, temporal grounding, video question answering, and video-centric dialogue.[1][2][3] The lineage is characterized by a recurring design pattern of combining masked video modeling with multimodal contrastive learning and, in later versions, next-token prediction, all paired with progressively larger encoders and ever-larger video-text corpora.[1][2] Weights for most variants are released openly under the OpenGVLab namespace on Hugging Face and source code lives in a single GitHub monorepo, making the family one of the most widely used open video foundation model lineages.[4][5]
| Item | Value |
|---|---|
| Developer | OpenGVLab, Shanghai AI Laboratory; Nanjing University; SIAT, CAS[1][2][3] |
| First release | InternVideo technical report, 6 December 2022[1] |
| Latest major version | InternVideo2.5, 21 January 2025[3] |
| Paradigm | Video foundation model combining masked modeling and contrastive learning[1][2] |
| Largest encoder | InternVideo2-6B (6 billion parameters)[2] |
| Code | github.com/OpenGVLab/InternVideo[4] |
| Weights | huggingface.co/OpenGVLab[5] |
| Conference venue | InternVideo2 accepted to ECCV 2024[6] |
OpenGVLab is the general-vision research group of the Shanghai AI Laboratory, formed in partnership with SenseTime and a consortium of Chinese universities, and oriented around building open vision foundation models intended to transfer broadly across visual tasks.[7] Within that program, the image-side line of work eventually produced InternVL, while the video-side line produced the InternVideo series and the associated InternVid dataset.[4] The first paper, "InternVideo: General Video Foundation Models via Generative and Discriminative Learning," was posted to arXiv on 6 December 2022 with revision the following day, and listed seventeen authors including Yi Wang, Kunchang Li, Yali Wang, Limin Wang, and Yu Qiao.[1]
In July 2023 the group released a companion paper introducing InternVid, a large-scale video-text dataset built by an automated pipeline that uses large language models to caption clips harvested from YouTube. The full release contains more than seven million source videos totaling roughly 760,000 hours, which are cut into about 234 million clips paired with detailed natural-language descriptions amounting to about 4.1 billion words.[8] InternVid was accepted as a spotlight at the International Conference on Learning Representations 2024, and the full annotation set (230 million video-text pairs) was published in June 2024.[4][8] The paper also introduced ViCLIP, a video-text contrastive model trained on the dataset that served as a building block for later InternVideo work.[8]
In March 2024 the group released a substantially redesigned successor, "InternVideo2: Scaling Foundation Models for Multimodal Video Understanding" (arXiv:2403.15377), with twenty authors led by Yi Wang and Kunchang Li.[2] The paper was accepted to the European Conference on Computer Vision (ECCV) 2024 as a poster.[6] The work increased the video encoder to six billion parameters, introduced a three-stage progressive training recipe, and reported state-of-the-art results on more than sixty video and audio tasks.[2][6] During the same year, derivative systems based on InternVideo2 won seven championships across the egocentric vision (EgoVis) challenges held at CVPR 2024, spanning five tracks of the Ego4D challenge and three tracks of the EPIC-Kitchens challenge.[9]
A third arXiv report, "InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling" (arXiv:2501.12386), was posted on 21 January 2025.[3] InternVideo2.5 is not a new video encoder but rather a recipe for adapting an InternViT visual backbone with an InternLM-2.5-7B language model so that the resulting multimodal large language model can ingest much longer videos (between 64 and 512 frames, with each eight-frame clip compressed to 128 tokens) and answer fine-grained spatiotemporal questions about them.[3] An associated press release from the Nanjing University centre dated 11 February 2025 framed the system as one that can find a "needle in a haystack" within tens of thousands of frames, extending usable input length from roughly 3,000 frames to around 10,000.[10] The OpenGVLab monorepo also lists a December 2025 technical report for an "InternVideo-Next" branch, described as "general video foundation models for genuine world understanding."[4]
The first InternVideo is constructed by jointly training two complementary self-supervised branches and then fusing their representations.[1] One branch performs masked video modeling on a ViT-Huge (Vision Transformer) backbone, reconstructing patches from heavily masked tubelets, and the other branch performs video-language contrastive learning on a UniFormerV2-Large backbone using paired text. The two backbones are then connected through cross-model attention modules that select which features to share between them, yielding a unified 1.3 billion parameter representation.[11] Pretraining uses about 12 million video clips drawn from Kinetics-400, WebVid2M and WebVid10M, HowTo100M, AVA, Something-Something V2 and additional self-collected videos, supplemented with roughly 100 million image-text pairs from LAION-400M.[11] When fine-tuned, InternVideo reports 91.1 percent top-1 accuracy on Kinetics-400 and 77.2 percent on Something-Something V2, and the paper evaluates the model across roughly thirty-nine downstream datasets spanning action recognition and detection, video-language alignment, and open-world video tasks.[11]
InternVideo2 reorganises the recipe into three sequential stages.[2]
| Stage | Objective | Inputs and outputs |
|---|---|---|
| 1 | Unmasked token reconstruction. The video encoder is trained to predict features of unmasked tokens supplied by two teachers (an InternVL-6B image branch and a VideoMAEv2-g video branch) while around 80 percent of input tokens are masked. | Video only.[12] |
| 2 | Cross-modal alignment. Audio, speech, and text encoders are attached, and the system is trained with contrastive losses and masked language modeling across the resulting modalities. | Video plus audio plus text plus speech.[12] |
| 3 | Video-centric dialogue. The video encoder is connected to a large language model through a Q-Former interface, then instruction-tuned for chat. | Video plus text dialogue.[12] |
The architecture is scaled in two sizes: InternVideo2-1B (one billion parameters) and InternVideo2-6B (six billion).[2] At the data level the paper introduces a pipeline that segments long videos into semantically coherent shots and generates joint video, audio, and speech captions to improve cross-modal alignment, and the released paper reports state-of-the-art results on more than sixty out of seventy-four evaluated video and audio tasks.[2][6] In end-to-end finetuning, InternVideo2-6B reports 92.1 percent top-1 on Kinetics-400, 91.9 percent on Kinetics-600, and 85.9 percent on Kinetics-700, using only sixteen frames at 224 by 224 resolution.[12] For video question answering and dialogue, the InternVideo2-Stage3 variants paired with a VideoChat2 front end report scores on the order of 67 percent on MVBench and around 70 percent recall at IoU 0.5 on the Charades-STA temporal grounding benchmark.[12]
InternVideo2.5 keeps an InternViT vision encoder and an InternLM-2.5-7B language backbone, and concentrates new work in two modules.[3] The first, Hierarchical Context Compression (HiCO), exploits the redundancy of natural video by compressing each eight-frame clip down to a fixed-length set of about 128 tokens, allowing the multimodal large language model to ingest between sixty-four and five hundred and twelve frames within a tractable context budget.[3] The second, Task Preference Optimization (TPO), distils annotations from specialist vision tasks (such as detection, tracking, and segmentation) into a Direct Preference Optimization (DPO)-style training signal so that the dialogue model inherits fine spatial and temporal grounding behaviour from those specialists.[3] On standard video MLLM benchmarks the resulting 7B-class system reports about 75.7 on MVBench (an improvement of roughly 3.7 points over the InternVL2.5 base), about 65.1 on Video-MME, and about 74.9 on the Perception Test.[13]
The OpenGVLab GitHub repository organises the lineage into four subdirectories: InternVideo, InternVideo2, InternVideo2.5, and InternVideo-Next, plus a Data folder for InternVid.[4] Released model weights on Hugging Face include the following representative checkpoints, each downloadable under the Apache 2.0 license unless otherwise noted.
| Model | Parameters | Stage | Notable evaluation |
|---|---|---|---|
| InternVideo2-Stage1-1B-224p-f8-K710 | 1B video encoder | Stage 1 finetuned on Kinetics-710 | Kinetics action recognition.[14] |
| InternVideo2-Stage1-1B-224p-K400 | 1B | Stage 1 finetuned on Kinetics-400 | K400 action classification.[4] |
| InternVideo2-Stage1-1B-224p-K700 | 1B | Stage 1 finetuned on Kinetics-700 | K700 action classification.[4] |
| InternVideo2-Stage3-8B (HD/F16 variants) | 8B (encoder plus LM) | Stage 3 dialogue | MVBench, Video-MME.[4][12] |
| InternVideo2_Chat_8B_InternLM2_5 | 8B chat | Stage 3 with InternLM2.5 LM | Video dialogue.[15] |
| InternVideo2_5_Chat_8B | 8B chat (HiCO+TPO) | InternVideo2.5 long-context | MVBench, Video-MME, Perception Test.[10][13] |
OpenGVLab also released smaller distilled variants of InternVideo2 (S, B, and L sizes) and a VideoCLIP combination with MobileCLIP for efficient deployment, both announced in August 2024.[4]
Across the published Kinetics-400, Kinetics-600, Kinetics-700, and Something-Something V2 leaderboards, InternVideo (2022) and InternVideo2 (2024) variants posted leading or near-leading numbers at the time of release, including the 91.1 percent K400 accuracy reported for the original InternVideo and the 92.1 percent K400 / 91.9 percent K600 / 85.9 percent K700 numbers reported for InternVideo2-6B.[11][12] The line is widely used as a backbone for downstream action detection and temporal localization, and OpenGVLab's EgoVideo system, built on InternVideo2 and adapted to egocentric data, won championships in five Ego4D and three EPIC-Kitchens tracks at CVPR 2024.[9]
InternVideo2 reports state-of-the-art zero-shot and finetuned numbers on MSR-VTT, ActivityNet-Captions, DiDeMo, LSMDC, and VATEX text-to-video and video-to-text retrieval splits, and around 70 percent R1@0.5 on Charades-STA temporal grounding when used as the visual backbone of a small adapter network.[12] The ViCLIP variant introduced with InternVid is the public CLIP-style video-text head most commonly paired with the InternVideo encoders.[8]
The Stage 3 InternVideo2-Chat checkpoints, and the later InternVideo2.5-Chat checkpoints, are intended as drop-in vision backbones for video multimodal models in the style of LLaVA and Qwen2-VL. When paired with VideoChat2's Q-Former front end and a Vicuna or Mistral 7B language model, the system reports leading results on MVBench (a comprehensive multimodal video benchmark introduced by the same group at CVPR 2024) and was competitive with GPT-4o-class systems on perception subtasks at the time of release.[16][17] InternVideo2.5 further raised the MVBench score to roughly 75.7 and pushed long-context video question answering to tens of thousands of frames, addressing one of the principal limitations of earlier video MLLMs.[3][13]
InternVid, the dataset that accompanies the model line, was built by running large language models on transcribed video clips to produce dense captions; the resulting 230 million video-text pairs have become a common training corpus for downstream video models, video contrastive learning, and video generation research.[8]
The InternVideo line follows OpenGVLab's pattern of releasing both code and weights early in the publication cycle. The main GitHub repository hosts all paper code, training scripts, and evaluation harnesses under one tree, while Hugging Face hosts the checkpoints, each tagged with the stage and resolution it was trained at.[4][5] The InternVid dataset is also distributed publicly with both full and subset annotations to support reproducible training.[4][8] As of the InternVideo2.5 release, the most-used downstream checkpoint, InternVideo2_5_Chat_8B, is freely available with example code for video chat, object tracking, and segmentation.[10] Several derivative repositories under OpenGVLab (Ask-Anything for VideoChat / VideoChat2 and VideoChat-Flash for long-context modelling) demonstrate how to plug the video encoder into different language models such as Vicuna, Mistral, and Phi-3.[16][18]
The InternVideo papers themselves note several limits that carry over across the family. First, the early InternVideo backbone is engineered as a fusion of two pretrained branches, ViT-Huge and UniFormerV2-Large, which complicates the training pipeline and inflates the effective parameter count relative to single-tower designs.[11] Second, InternVideo2's six-billion-parameter encoder is expensive to deploy: end-to-end action recognition results require sixteen 224 by 224 frames per clip and the dialogue stage attaches an additional multi-billion parameter language model, putting the largest configurations beyond commodity inference hardware.[2][12] Third, although the dataset pipeline behind InternVid is described in detail, the training corpora include large web-scraped sources (HowTo100M, WebVid, YouTube-derived InternVid) whose copyright status and content distribution are not exhaustively audited in the public release.[1][8] Fourth, the InternVideo2.5 paper reports a Video-MME improvement of less than one point over a strong InternVL2.5 baseline, suggesting that on some general video question answering benchmarks the long-context machinery yields diminishing returns; the larger gains appear concentrated in perception and grounding subtasks.[3][13] Finally, like other large vision foundation models, InternVideo lacks formal guarantees about hallucination on out-of-distribution video, and the model cards explicitly require users to agree not to deploy the systems for experiments that could harm human subjects.[14]
InternVideo sits within a broader family of video foundation models that include VideoMAE-style masked encoders, CLIP-derived video-text models, and large video-language assistants. The most directly comparable open systems in 2024 to 2026 are video adapters built on Qwen2-VL, Qwen2.5-VL, MiniCPM-V, and Llama 3.2 Vision, all of which adopt similar Q-Former or perceiver style projection between a vision encoder and a language model.[17] Among image-side foundation models in the same OpenGVLab program, InternVL supplies the InternViT backbone reused inside InternVideo2.5.[3] The InternVideo line is distinguished from these neighbours by the explicit progression through three pretraining stages (masked reconstruction, multimodal contrastive learning, then video-centric next token prediction), and by the very large open video-text dataset (InternVid) released alongside the models.[2][8]