InternVideo

AI Models Chinese AI Computer Vision

13 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v3 · 2,639 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

InternVideo is a family of general-purpose video foundation models developed by OpenGVLab at the Shanghai Artificial Intelligence Laboratory in collaboration with Nanjing University and the Shenzhen Institutes of Advanced Technology of the Chinese Academy of Sciences. The series, which began with the original InternVideo technical report in December 2022 and progressed through InternVideo2 (March 2024) and InternVideo2.5 (January 2025), aims to provide a single pretrained video backbone that can be adapted to action recognition, video-text retrieval, temporal grounding, video question answering, and video-centric dialogue.^[1]^[2]^[3] The lineage is characterized by a recurring design pattern of combining masked video modeling with multimodal contrastive learning and, in later versions, next-token prediction, all paired with progressively larger encoders and ever-larger video-text corpora.^[1]^[2] Weights for most variants are released openly under the OpenGVLab namespace on Hugging Face and source code lives in a single GitHub monorepo, making the family one of the most widely used open video foundation model lineages.^[4]^[5]

Infobox

Item	Value
Developer	OpenGVLab, Shanghai AI Laboratory; Nanjing University; SIAT, CAS^[1]^[2]^[3]
First release	InternVideo technical report, 6 December 2022^[1]
Latest major version	InternVideo2.5, 21 January 2025^[3]
Paradigm	Video foundation model combining masked modeling and contrastive learning^[1]^[2]
Largest encoder	InternVideo2-6B (6 billion parameters)^[2]
Code	github.com/OpenGVLab/InternVideo^[4]
Weights	huggingface.co/OpenGVLab^[5]
Conference venue	InternVideo2 accepted to ECCV 2024^[6]

History

Origins at OpenGVLab

OpenGVLab is the general-vision research group of the Shanghai AI Laboratory, formed in partnership with SenseTime and a consortium of Chinese universities, and oriented around building open vision foundation models intended to transfer broadly across visual tasks.^[7] Within that program, the image-side line of work eventually produced InternVL, while the video-side line produced the InternVideo series and the associated InternVid dataset.^[4] The first paper, "InternVideo: General Video Foundation Models via Generative and Discriminative Learning," was posted to arXiv on 6 December 2022 with revision the following day, and listed seventeen authors including Yi Wang, Kunchang Li, Yali Wang, Limin Wang, and Yu Qiao.^[1]

InternVid dataset

In July 2023 the group released a companion paper introducing InternVid, a large-scale video-text dataset built by an automated pipeline that uses large language models to caption clips harvested from YouTube. The full release contains more than seven million source videos totaling roughly 760,000 hours, which are cut into about 234 million clips paired with detailed natural-language descriptions amounting to about 4.1 billion words.^[8] InternVid was accepted as a spotlight at the International Conference on Learning Representations 2024, and the full annotation set (230 million video-text pairs) was published in June 2024.^[4]^[8] The paper also introduced ViCLIP, a video-text contrastive model trained on the dataset that served as a building block for later InternVideo work.^[8]

InternVideo2 and ECCV 2024

In March 2024 the group released a substantially redesigned successor, "InternVideo2: Scaling Foundation Models for Multimodal Video Understanding" (arXiv:2403.15377), with twenty authors led by Yi Wang and Kunchang Li.^[2] The paper was accepted to the European Conference on Computer Vision (ECCV) 2024 as a poster.^[6] The work increased the video encoder to six billion parameters, introduced a three-stage progressive training recipe, and reported state-of-the-art results on more than sixty video and audio tasks.^[2]^[6] During the same year, derivative systems based on InternVideo2 won seven championships across the egocentric vision (EgoVis) challenges held at CVPR 2024, spanning five tracks of the Ego4D challenge and three tracks of the EPIC-Kitchens challenge.^[9]

InternVideo2.5 and beyond

A third arXiv report, "InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling" (arXiv:2501.12386), was posted on 21 January 2025.^[3] InternVideo2.5 is not a new video encoder but rather a recipe for adapting an InternViT visual backbone with an InternLM-2.5-7B language model so that the resulting multimodal large language model can ingest much longer videos (between 64 and 512 frames, with each eight-frame clip compressed to 128 tokens) and answer fine-grained spatiotemporal questions about them.^[3] An associated press release from the Nanjing University centre dated 11 February 2025 framed the system as one that can find a "needle in a haystack" within tens of thousands of frames, extending usable input length from roughly 3,000 frames to around 10,000.^[10] The OpenGVLab monorepo also lists a December 2025 technical report for an "InternVideo-Next" branch, described as "general video foundation models for genuine world understanding."^[4]

Technical details

InternVideo (2022)

The first InternVideo is constructed by jointly training two complementary self-supervised branches and then fusing their representations.^[1] One branch performs masked video modeling on a ViT-Huge (Vision Transformer) backbone, reconstructing patches from heavily masked tubelets, and the other branch performs video-language contrastive learning on a UniFormerV2-Large backbone using paired text. The two backbones are then connected through cross-model attention modules that select which features to share between them, yielding a unified 1.3 billion parameter representation.^[11] Pretraining uses about 12 million video clips drawn from Kinetics-400, WebVid2M and WebVid10M, HowTo100M, AVA, Something-Something V2 and additional self-collected videos, supplemented with roughly 100 million image-text pairs from LAION-400M.^[11] When fine-tuned, InternVideo reports 91.1 percent top-1 accuracy on Kinetics-400 and 77.2 percent on Something-Something V2, and the paper evaluates the model across roughly thirty-nine downstream datasets spanning action recognition and detection, video-language alignment, and open-world video tasks.^[11]

InternVideo2 (2024)

InternVideo2 reorganises the recipe into three sequential stages.^[2]

Stage	Objective	Inputs and outputs
1	Unmasked token reconstruction. The video encoder is trained to predict features of unmasked tokens supplied by two teachers (an InternVL-6B image branch and a VideoMAEv2-g video branch) while around 80 percent of input tokens are masked.	Video only.^[12]
2	Cross-modal alignment. Audio, speech, and text encoders are attached, and the system is trained with contrastive losses and masked language modeling across the resulting modalities.	Video plus audio plus text plus speech.^[12]
3	Video-centric dialogue. The video encoder is connected to a large language model through a Q-Former interface, then instruction-tuned for chat.	Video plus text dialogue.^[12]

The architecture is scaled in two sizes: InternVideo2-1B (one billion parameters) and InternVideo2-6B (six billion).^[2] At the data level the paper introduces a pipeline that segments long videos into semantically coherent shots and generates joint video, audio, and speech captions to improve cross-modal alignment, and the released paper reports state-of-the-art results on more than sixty out of seventy-four evaluated video and audio tasks.^[2]^[6] In end-to-end finetuning, InternVideo2-6B reports 92.1 percent top-1 on Kinetics-400, 91.9 percent on Kinetics-600, and 85.9 percent on Kinetics-700, using only sixteen frames at 224 by 224 resolution.^[12] For video question answering and dialogue, the InternVideo2-Stage3 variants paired with a VideoChat2 front end report scores on the order of 67 percent on MVBench and around 70 percent recall at IoU 0.5 on the Charades-STA temporal grounding benchmark.^[12]

InternVideo2.5 (2025)

InternVideo2.5 keeps an InternViT vision encoder and an InternLM-2.5-7B language backbone, and concentrates new work in two modules.^[3] The first, Hierarchical Context Compression (HiCO), exploits the redundancy of natural video by compressing each eight-frame clip down to a fixed-length set of about 128 tokens, allowing the multimodal large language model to ingest between sixty-four and five hundred and twelve frames within a tractable context budget.^[3] The second, Task Preference Optimization (TPO), distils annotations from specialist vision tasks (such as detection, tracking, and segmentation) into a Direct Preference Optimization (DPO)-style training signal so that the dialogue model inherits fine spatial and temporal grounding behaviour from those specialists.^[3] On standard video MLLM benchmarks the resulting 7B-class system reports about 75.7 on MVBench (an improvement of roughly 3.7 points over the InternVL2.5 base), about 65.1 on Video-MME, and about 74.9 on the Perception Test.^[13]

Variants and releases

The OpenGVLab GitHub repository organises the lineage into four subdirectories: InternVideo, InternVideo2, InternVideo2.5, and InternVideo-Next, plus a Data folder for InternVid.^[4] Released model weights on Hugging Face include the following representative checkpoints, each downloadable under the Apache 2.0 license unless otherwise noted.

Model	Parameters	Stage	Notable evaluation
InternVideo2-Stage1-1B-224p-f8-K710	1B video encoder	Stage 1 finetuned on Kinetics-710	Kinetics action recognition.^[14]
InternVideo2-Stage1-1B-224p-K400	1B	Stage 1 finetuned on Kinetics-400	K400 action classification.^[4]
InternVideo2-Stage1-1B-224p-K700	1B	Stage 1 finetuned on Kinetics-700	K700 action classification.^[4]
InternVideo2-Stage3-8B (HD/F16 variants)	8B (encoder plus LM)	Stage 3 dialogue	MVBench, Video-MME.^[4]^[12]
InternVideo2_Chat_8B_InternLM2_5	8B chat	Stage 3 with InternLM2.5 LM	Video dialogue.^[15]
InternVideo2_5_Chat_8B	8B chat (HiCO+TPO)	InternVideo2.5 long-context	MVBench, Video-MME, Perception Test.^[10]^[13]

OpenGVLab also released smaller distilled variants of InternVideo2 (S, B, and L sizes) and a VideoCLIP combination with MobileCLIP for efficient deployment, both announced in August 2024.^[4]

Applications

Action recognition and detection

Across the published Kinetics-400, Kinetics-600, Kinetics-700, and Something-Something V2 leaderboards, InternVideo (2022) and InternVideo2 (2024) variants posted leading or near-leading numbers at the time of release, including the 91.1 percent K400 accuracy reported for the original InternVideo and the 92.1 percent K400 / 91.9 percent K600 / 85.9 percent K700 numbers reported for InternVideo2-6B.^[11]^[12] The line is widely used as a backbone for downstream action detection and temporal localization, and OpenGVLab's EgoVideo system, built on InternVideo2 and adapted to egocentric data, won championships in five Ego4D and three EPIC-Kitchens tracks at CVPR 2024.^[9]

Video-language retrieval and grounding

InternVideo2 reports state-of-the-art zero-shot and finetuned numbers on MSR-VTT, ActivityNet-Captions, DiDeMo, LSMDC, and VATEX text-to-video and video-to-text retrieval splits, and around 70 percent R1@0.5 on Charades-STA temporal grounding when used as the visual backbone of a small adapter network.^[12] The ViCLIP variant introduced with InternVid is the public CLIP-style video-text head most commonly paired with the InternVideo encoders.^[8]

Video question answering and dialogue

The Stage 3 InternVideo2-Chat checkpoints, and the later InternVideo2.5-Chat checkpoints, are intended as drop-in vision backbones for video multimodal models in the style of LLaVA and Qwen2-VL. When paired with VideoChat2's Q-Former front end and a Vicuna or Mistral 7B language model, the system reports leading results on MVBench (a comprehensive multimodal video benchmark introduced by the same group at CVPR 2024) and was competitive with GPT-4o-class systems on perception subtasks at the time of release.^[16]^[17] InternVideo2.5 further raised the MVBench score to roughly 75.7 and pushed long-context video question answering to tens of thousands of frames, addressing one of the principal limitations of earlier video MLLMs.^[3]^[13]

Video captioning and dataset construction

InternVid, the dataset that accompanies the model line, was built by running large language models on transcribed video clips to produce dense captions; the resulting 230 million video-text pairs have become a common training corpus for downstream video models, video contrastive learning, and video generation research.^[8]

Open-source release strategy

The InternVideo line follows OpenGVLab's pattern of releasing both code and weights early in the publication cycle. The main GitHub repository hosts all paper code, training scripts, and evaluation harnesses under one tree, while Hugging Face hosts the checkpoints, each tagged with the stage and resolution it was trained at.^[4]^[5] The InternVid dataset is also distributed publicly with both full and subset annotations to support reproducible training.^[4]^[8] As of the InternVideo2.5 release, the most-used downstream checkpoint, InternVideo2_5_Chat_8B, is freely available with example code for video chat, object tracking, and segmentation.^[10] Several derivative repositories under OpenGVLab (Ask-Anything for VideoChat / VideoChat2 and VideoChat-Flash for long-context modelling) demonstrate how to plug the video encoder into different language models such as Vicuna, Mistral, and Phi-3.^[16]^[18]

Limitations and criticisms

The InternVideo papers themselves note several limits that carry over across the family. First, the early InternVideo backbone is engineered as a fusion of two pretrained branches, ViT-Huge and UniFormerV2-Large, which complicates the training pipeline and inflates the effective parameter count relative to single-tower designs.^[11] Second, InternVideo2's six-billion-parameter encoder is expensive to deploy: end-to-end action recognition results require sixteen 224 by 224 frames per clip and the dialogue stage attaches an additional multi-billion parameter language model, putting the largest configurations beyond commodity inference hardware.^[2]^[12] Third, although the dataset pipeline behind InternVid is described in detail, the training corpora include large web-scraped sources (HowTo100M, WebVid, YouTube-derived InternVid) whose copyright status and content distribution are not exhaustively audited in the public release.^[1]^[8] Fourth, the InternVideo2.5 paper reports a Video-MME improvement of less than one point over a strong InternVL2.5 baseline, suggesting that on some general video question answering benchmarks the long-context machinery yields diminishing returns; the larger gains appear concentrated in perception and grounding subtasks.^[3]^[13] Finally, like other large vision foundation models, InternVideo lacks formal guarantees about hallucination on out-of-distribution video, and the model cards explicitly require users to agree not to deploy the systems for experiments that could harm human subjects.^[14]

InternVideo sits within a broader family of video foundation models that include VideoMAE-style masked encoders, CLIP-derived video-text models, and large video-language assistants. The most directly comparable open systems in 2024 to 2026 are video adapters built on Qwen2-VL, Qwen2.5-VL, MiniCPM-V, and Llama 3.2 Vision, all of which adopt similar Q-Former or perceiver style projection between a vision encoder and a language model.^[17] Among image-side foundation models in the same OpenGVLab program, InternVL supplies the InternViT backbone reused inside InternVideo2.5.^[3] The InternVideo line is distinguished from these neighbours by the explicit progression through three pretraining stages (masked reconstruction, multimodal contrastive learning, then video-centric next token prediction), and by the very large open video-text dataset (InternVid) released alongside the models.^[2]^[8]

References

Wang, Y. et al., "InternVideo: General Video Foundation Models via Generative and Discriminative Learning", arXiv (cs.CV) 2212.03191, 2022-12-06. https://arxiv.org/abs/2212.03191. Accessed 2026-05-21. ↩
Wang, Y. et al., "InternVideo2: Scaling Foundation Models for Multimodal Video Understanding", arXiv (cs.CV) 2403.15377, 2024-03-22. https://arxiv.org/abs/2403.15377. Accessed 2026-05-21. ↩
Wang, Y. et al., "InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling", arXiv (cs.CV) 2501.12386, 2025-01-21. https://arxiv.org/abs/2501.12386. Accessed 2026-05-21. ↩
OpenGVLab, "InternVideo: Video Foundation Models & Data for Multimodal Understanding (GitHub repository)", GitHub, 2024-08-01. https://github.com/OpenGVLab/InternVideo. Accessed 2026-05-21. ↩
OpenGVLab, "OpenGVLab organization page", Hugging Face, 2024-08-01. https://huggingface.co/OpenGVLab. Accessed 2026-05-21. ↩
ECCV 2024, "InternVideo2: Scaling Foundation Models for Multimodal Video Understanding (Poster 1476)", European Conference on Computer Vision, 2024-10-01. https://eccv.ecva.net/virtual/2024/poster/1476. Accessed 2026-05-21. ↩
SenseTime, "SenseTime and Shanghai AI Lab Jointly Unveil OpenGVLab", SenseTime News, 2022-02-25. https://www.sensetime.com/en/news-detail/41164735?categoryId=1072. Accessed 2026-05-21. ↩
Wang, Y. et al., "InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation", arXiv (cs.CV) 2307.06942, 2023-07-13. https://arxiv.org/abs/2307.06942. Accessed 2026-05-21. ↩
OpenGVLab, "EgoVideo: Solutions for EgoVis Challenges in CVPR 2024 (GitHub repository)", GitHub, 2024-06-26. https://github.com/OpenGVLab/EgoVideo. Accessed 2026-05-21. ↩
Nanjing University Large Model Research Center, "Shusheng InternVideo2.5 Open-Sourced, Precisely Finding the 'Needle in a Haystack' in Tens of Thousands of Frames, with Fine-Grained Spatiotemporal Perception", Nanjing University, 2025-02-11. https://cs.nju.edu.cn/lm/en/post/2025-02-11-internvideo-25-release/index.html. Accessed 2026-05-21. ↩
Wang, Y. et al., "InternVideo: General Video Foundation Models via Generative and Discriminative Learning (HTML)", ar5iv (arXiv 2212.03191), 2022-12-06. https://ar5iv.labs.arxiv.org/html/2212.03191. Accessed 2026-05-21. ↩
Wang, Y. et al., "InternVideo2: Scaling Foundation Models for Multimodal Video Understanding (HTML v4)", arXiv 2403.15377v4, 2024-08-14. https://arxiv.org/html/2403.15377v4. Accessed 2026-05-21. ↩
Wang, Y. et al., "InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling (HTML v1)", arXiv 2501.12386v1, 2025-01-21. https://arxiv.org/html/2501.12386v1. Accessed 2026-05-21. ↩
OpenGVLab, "InternVideo2-Stage1-1B-224p-f8-k710 model card", Hugging Face, 2024-04-01. https://huggingface.co/OpenGVLab/InternVideo2-Stage1-1B-224p-f8-k710. Accessed 2026-05-21. ↩
OpenGVLab, "InternVideo2_Chat_8B_InternLM2_5 model card", Hugging Face, 2024-08-01. https://huggingface.co/OpenGVLab/InternVideo2_Chat_8B_InternLM2_5. Accessed 2026-05-21. ↩
OpenGVLab, "Ask-Anything VideoChat2 README", GitHub, 2024-06-01. https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/README.md. Accessed 2026-05-21. ↩
Li, K. et al., "VideoChat: Chat-Centric Video Understanding", arXiv (cs.CV) 2305.06355, 2023-05-10. https://arxiv.org/abs/2305.06355. Accessed 2026-05-21. ↩
OpenGVLab, "VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling (GitHub repository)", GitHub, 2024-12-01. https://github.com/OpenGVLab/VideoChat-Flash. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

EgoSchema OpenGVLab

Infobox

History

Origins at OpenGVLab

InternVid dataset

InternVideo2 and ECCV 2024

InternVideo2.5 and beyond

Technical details

InternVideo (2022)

InternVideo2 (2024)

InternVideo2.5 (2025)

Variants and releases

Applications

Action recognition and detection

Video-language retrieval and grounding

Video question answering and dialogue

Video captioning and dataset construction

Open-source release strategy

Limitations and criticisms

Related work and comparison

See also

References

Improve this article

Related Articles

Seedance

Seedream

Wan 2.1-VACE

Wan 2.5

DeepSeek-OCR

CloudWalk Technology

What links here

Related Articles

Seedance

Seedream

Wan 2.1-VACE

Wan 2.5

DeepSeek-OCR

CloudWalk Technology

What links here