OpenGVLab
Last reviewed
Jun 3, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 1,445 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 1,445 words
Add missing citations, update stale details, or suggest a clearer explanation.
OpenGVLab (short for General Vision Lab) is an open-source organization and project hub for general-vision and multimodal foundation models associated with Shanghai AI Laboratory. The "GV" stands for general vision, the lab's goal of building visual models that generalize across tasks so that little extra effort is needed to adapt them to a new problem. OpenGVLab distributes its code on GitHub and its model weights and datasets on Hugging Face, and it is best known for three model families: the InternImage vision backbone, the InternVideo video foundation models, and the InternVL multimodal large language models. [1][2]
OpenGVLab was launched on 25 February 2022 as a joint effort by SenseTime and Shanghai AI Laboratory, with the Chinese University of Hong Kong and Shanghai Jiao Tong University also credited as contributors. It was presented as an open-source platform for general vision, built around an earlier general-vision model called INTERN that the same institutions had developed in 2021. At launch the platform released pre-trained models covering four core visual tasks (classification, detection, segmentation, and depth estimation), published an evaluation benchmark for general-vision models, and made large annotated image datasets available to outside developers. [3][4][5]
Since then OpenGVLab has grown from that initial platform into the umbrella under which Shanghai AI Laboratory's general-vision group releases its research. The "Intern" prefix shared by InternImage, InternVideo, InternVL, and the language-model series InternLM reflects the lab's broader Shusheng (书生, "scholar") family of foundation models, of which the OpenGVLab projects form the vision and multimodal branch. The group describes its remit as vision-centric artificial intelligence, and its work spans computer vision backbones, video understanding, image and video datasets, and vision-language models that compete with closed systems such as GPT-4V and GPT-4o. [1][6]
The table below lists the most prominent projects hosted under OpenGVLab.
| Project | Type | First release | Venue |
|---|---|---|---|
| InternImage | Image backbone (CNN with DCNv3) | Nov 2022 | CVPR 2023 (Highlight) |
| InternVideo | Video foundation model | Dec 2022 | N/A (technical report) |
| InternVid | Video-text dataset | Jul 2023 | ICLR 2024 (spotlight) |
| InternVideo2 | Video foundation model | Mar 2024 | ECCV 2024 |
| InternVL | Multimodal LLM (vision-language) | Dec 2023 | CVPR 2024 (Oral) |
| VideoChat / Ask-Anything | Video dialogue system | May 2023 | N/A |
InternImage is a large-scale vision backbone and the project that gave OpenGVLab its early visibility. The paper, "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions" (arXiv:2211.05778), was first posted on 10 November 2022 and was accepted to CVPR 2023 as a Highlight. Its lead authors are Wenhai Wang and Jifeng Dai, with co-authors including Zhe Chen, Xizhou Zhu, Hongsheng Li, Xiaogang Wang, and Yu Qiao, working with Shanghai AI Laboratory, Tsinghua University, and other institutions. [7][8]
The model's distinguishing choice is that it is a convolutional neural network rather than a vision transformer. Its core operator is DCNv3, a third-generation deformable convolution that gives the network dynamic, input-dependent receptive fields and adaptive spatial aggregation. The InternImage authors argue this lets a CNN reach the large effective receptive field and weak inductive bias that made transformers strong on large-scale data, while keeping the efficiency of convolutions. The largest variant, InternImage-H, has about 1.08 billion parameters, an unusual scale for a CNN-based vision model. [7][8]
InternImage reported strong results across classification, detection, and segmentation. It reached 89.6% top-1 accuracy on ImageNet, 65.5 mAP on object detection for COCO test-dev (using a composite-backbone variant with the DINO detector and test-time augmentation), and 62.9 mIoU on ADE20K semantic segmentation. The DCNv3 operator and its CUDA kernels are released in the same repository. [7][8]
InternVideo is OpenGVLab's family of video foundation models. The first paper, "InternVideo: General Video Foundation Models via Generative and Discriminative Learning" (arXiv:2212.03191), was posted in December 2022 by Yi Wang, Kunchang Li, Yu Qiao, Limin Wang, and colleagues. It trains a general video model by combining two complementary self-supervised objectives, masked video modeling (the generative side) and video-language contrastive learning (the discriminative side), and coordinating the two representations. The authors reported state-of-the-art results on 39 video datasets, including 91.1% top-1 accuracy on Kinetics-400 and 77.2% on Something-Something V2. [9]
The successor, InternVideo2, was described in "InternVideo2: Scaling Foundation Models for Multimodal Video Understanding" (arXiv:2403.15377, March 2024) and accepted to ECCV 2024. It scales the approach and is released in several sizes, including a 1-billion-parameter vision model (InternVideo2-1B) that is paired with a 7B language model for video question answering. The repository has continued with later releases such as InternVideo2.5 (January 2025), which focuses on long and rich context for video multimodal models. Because InternVideo has its own article, this page only summarizes it. [10][11]
InternVL is OpenGVLab's line of multimodal large language models, positioned as an open alternative to proprietary vision-language systems. The first paper, "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks" (arXiv:2312.14238), was posted on 21 December 2023 by Zhe Chen, Jifeng Dai, Wenhai Wang, Yu Qiao, and colleagues, and was accepted to CVPR 2024 as an Oral. Its design scales the vision encoder to 6 billion parameters (InternViT-6B) and aligns it with a language model, reporting state-of-the-art results on 32 visual-linguistic benchmarks spanning perception, retrieval, and multimodal dialogue. [12][13]
InternVL became one of the more actively iterated open multimodal projects. The repository tracks a lineage from InternVL 1.0 through 1.5, 2.0, 2.5, 3.0, and 3.5, with the InternViT visual encoder released in both 6B and lighter 300M variants. As with InternVideo, the model has its own dedicated article and is summarized here only as one of the three flagship OpenGVLab families. [12][13]
Beyond the three headline model families, OpenGVLab maintains a number of related projects and large datasets. InternVid, described in "InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation" (arXiv:2307.06942) and accepted as a spotlight at ICLR 2024, is a video-text dataset built with an automatic, large-language-model-assisted captioning pipeline. It comprises more than 7 million videos totaling about 760,000 hours, yielding roughly 234 million video clips with descriptions amounting to about 4.1 billion words, and it underpins the InternVideo training data. [14][15]
On the application side, VideoChat (also released under the name Ask-Anything) is an end-to-end video-centric multimodal dialogue system that lets users chat about the contents of a video. Other repositories under the organization include VideoMamba, a state-space-model architecture for efficient video understanding, and the All-Seeing project on open-world panoptic visual recognition. The organization states that it has open-sourced more than 70 works in total, and its Hugging Face page hosts several hundred model checkpoints along with datasets such as InternVid and ShareGPT4o. [1][2]
OpenGVLab is the public, open-source face of Shanghai AI Laboratory's general-vision research rather than a separate company. It originated from a 2022 partnership between Shanghai AI Laboratory and SenseTime (the lab and the company share founding ties and personnel, and Yu Qiao, a leading figure at Shanghai AI Laboratory, appears across the major OpenGVLab papers). The projects sit alongside the lab's other open releases, including the InternLM language models, the XTuner and LMDeploy toolchains, and shared evaluation efforts, and they frequently reuse each other's components. InternVL, for example, combines an OpenGVLab vision encoder with language backbones that include InternLM. The work is distributed openly on GitHub and Hugging Face, which has made InternImage, InternVideo, and InternVL widely used reference implementations and pre-trained weights in academic and industrial computer-vision work. [1][6][16]