OpenGVLab

Chinese AI Computer Vision Open Source AI

8 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v2 · 1,647 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

OpenGVLab is the open-source organization and project hub for general-vision and multimodal foundation models run by the General Vision group at Shanghai AI Laboratory. It develops and freely releases three flagship model families, the InternImage vision backbone, the InternVideo video foundation models, and the InternVL multimodal large language models, distributing code on GitHub and weights and datasets on Hugging Face. The "GV" in the name stands for general vision, the goal of building visual models that generalize across tasks so little extra effort is needed to adapt them to a new problem. ^[1]^[2]

What is OpenGVLab?

OpenGVLab (short for General Vision Lab) is an open-source platform for general-vision and multimodal AI associated with Shanghai AI Laboratory. The organization describes itself on its GitHub profile as "a research group from Shanghai AI Lab focused on Vision-Centric AI research," and it states that "The GV in our name, OpenGVLab, means general vision, a general understanding of vision, so little effort is needed to adapt to new vision-based tasks." As of 2026 the organization has open-sourced more than 70 works and hosts 94 public repositories on GitHub, with several hundred model checkpoints and datasets on Hugging Face. ^[1]^[2]

OpenGVLab was launched on 25 February 2022 as a joint effort by SenseTime and Shanghai AI Laboratory, with the Chinese University of Hong Kong and Shanghai Jiao Tong University also credited as contributors. It was presented as an open-source platform for general vision, built around an earlier general-vision model called INTERN that the same institutions had developed in 2021. At launch the platform released pre-trained models covering four core visual tasks (classification, detection, segmentation, and depth estimation), published what it called the industry's first benchmark for evaluating general-vision models, and made large annotated image datasets, with a labelling system on the order of 100,000 labels, available to outside developers. ^[3]^[4]^[5]

Since then OpenGVLab has grown from that initial platform into the umbrella under which Shanghai AI Laboratory's general-vision group releases its research. The "Intern" prefix shared by InternImage, InternVideo, InternVL, and the language-model series InternLM reflects the lab's broader Shusheng (书生, "scholar") family of foundation models, of which the OpenGVLab projects form the vision and multimodal branch. The group describes its remit as vision-centric artificial intelligence, and its work spans computer vision backbones, video understanding, image and video datasets, and vision-language models that compete with closed systems such as GPT-4V and GPT-4o. ^[1]^[6]

The table below lists the most prominent projects hosted under OpenGVLab.

Project	Type	First release	Venue
InternImage	Image backbone (CNN with DCNv3)	Nov 2022	CVPR 2023 (Highlight)
InternVideo	Video foundation model	Dec 2022	N/A (technical report)
InternVid	Video-text dataset	Jul 2023	ICLR 2024 (spotlight)
InternVideo2	Video foundation model	Mar 2024	ECCV 2024
InternVL	Multimodal LLM (vision-language)	Dec 2023	CVPR 2024 (Oral)
VideoChat / Ask-Anything	Video dialogue system	May 2023	N/A

What is InternImage?

InternImage is a large-scale vision backbone and the project that gave OpenGVLab its early visibility. The paper, "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions" (arXiv:2211.05778), was first posted on 10 November 2022 and was accepted to CVPR 2023 as a Highlight. Its lead authors are Wenhai Wang and Jifeng Dai, with co-authors including Zhe Chen, Xizhou Zhu, Hongsheng Li, Xiaogang Wang, and Yu Qiao, working with Shanghai AI Laboratory, Tsinghua University, and other institutions. ^[7]^[8]

The model's distinguishing choice is that it is a convolutional neural network rather than a vision transformer. Its core operator is DCNv3, a third-generation deformable convolution that gives the network dynamic, input-dependent receptive fields and adaptive spatial aggregation. The InternImage authors argue this lets a CNN reach the large effective receptive field and weak inductive bias that made transformers strong on large-scale data, while keeping the efficiency of convolutions. The largest variant, InternImage-H, has about 1.08 billion parameters, an unusual scale for a CNN-based vision model. ^[7]^[8]

InternImage reported strong results across classification, detection, and segmentation. It reached 89.6% top-1 accuracy on ImageNet, 65.5 mAP on object detection for COCO test-dev (using a composite-backbone variant with the DINO detector and test-time augmentation), and 62.9 mIoU on ADE20K semantic segmentation. The project repository states that InternImage was "the only model that surpassed 65 mAP in the world" on COCO at the time, and a later, larger 3-billion-parameter variant (InternImage-G) pushed ImageNet top-1 accuracy past 90%. The DCNv3 operator and its CUDA kernels are released in the same repository. ^[7]^[8]

What is InternVideo?

InternVideo is OpenGVLab's family of video foundation models. The first paper, "InternVideo: General Video Foundation Models via Generative and Discriminative Learning" (arXiv:2212.03191), was posted in December 2022 by Yi Wang, Kunchang Li, Yu Qiao, Limin Wang, and colleagues. It trains a general video model by combining two complementary self-supervised objectives, masked video modeling (the generative side) and video-language contrastive learning (the discriminative side), and coordinating the two representations. The authors reported state-of-the-art results on 39 video datasets, and the paper states that "our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively." ^[9]

The successor, InternVideo2, was described in "InternVideo2: Scaling Foundation Models for Multimodal Video Understanding" (arXiv:2403.15377, March 2024) and accepted to ECCV 2024. It scales the approach and is released in several sizes, including a 1-billion-parameter vision model (InternVideo2-1B) that is paired with a 7B language model for video question answering. The repository has continued with later releases such as InternVideo2.5 (January 2025), which focuses on long and rich context for video multimodal models. Because InternVideo has its own article, this page only summarizes it. ^[10]^[11]

What is InternVL?

InternVL is OpenGVLab's line of multimodal large language models, positioned as an open alternative to proprietary vision-language systems. The first paper, "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks" (arXiv:2312.14238), was posted on 21 December 2023 by Zhe Chen, Jifeng Dai, Wenhai Wang, Yu Qiao, and colleagues, and was accepted to CVPR 2024 as an Oral. Its design scales the vision encoder to 6 billion parameters (InternViT-6B) and aligns it with a language model, reporting state-of-the-art results on 32 visual-linguistic benchmarks spanning perception, retrieval, and multimodal dialogue, with performance approaching GPT-4V and Gemini Pro on tests such as MMMU, DocVQA, ChartQA, and MathVista. ^[12]^[13]

InternVL became one of the more actively iterated open multimodal projects. The repository tracks a lineage from InternVL 1.0 through 1.5, 2.0, 2.5, 3.0, and 3.5, with the InternViT visual encoder released in both 6B and lighter 300M variants. As with InternVideo, the model has its own dedicated article and is summarized here only as one of the three flagship OpenGVLab families. ^[12]^[13]

What other projects and datasets does OpenGVLab maintain?

Beyond the three headline model families, OpenGVLab maintains a number of related projects and large datasets. InternVid, described in "InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation" (arXiv:2307.06942) and accepted as a spotlight at ICLR 2024, is a video-text dataset built with an automatic, large-language-model-assisted captioning pipeline. The paper states the dataset "contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words," and it underpins the InternVideo training data. ^[14]^[15]

On the application side, VideoChat (also released under the name Ask-Anything) is an end-to-end video-centric multimodal dialogue system that lets users chat about the contents of a video. Other repositories under the organization include VideoMamba, a state-space-model architecture for efficient video understanding, and the All-Seeing project on open-world panoptic visual recognition. The organization states that it has open-sourced more than 70 works in total, and its Hugging Face page hosts several hundred model checkpoints along with datasets such as InternVid and ShareGPT4o. ^[1]^[2]

How does OpenGVLab relate to Shanghai AI Laboratory?

OpenGVLab is the public, open-source face of Shanghai AI Laboratory's general-vision research rather than a separate company; its own GitHub profile describes it as "a research group from Shanghai AI Lab." It originated from a 2022 partnership between Shanghai AI Laboratory and SenseTime (the lab and the company share founding ties and personnel, and Yu Qiao, a leading figure at Shanghai AI Laboratory, appears across the major OpenGVLab papers). The projects sit alongside the lab's other open releases, including the InternLM language models, the XTuner and LMDeploy toolchains, and shared evaluation efforts, and they frequently reuse each other's components. InternVL, for example, combines an OpenGVLab vision encoder with language backbones that include InternLM. The work is distributed openly on GitHub and Hugging Face, which has made InternImage, InternVideo, and InternVL widely used reference implementations and pre-trained weights in academic and industrial computer-vision work. ^[1]^[6]^[16]

References

OpenGVLab on GitHub. https://github.com/OpenGVLab ↩
OpenGVLab on Hugging Face. https://huggingface.co/OpenGVLab ↩
"Shanghai AI Lab, SenseTime Launch Open-Source General Vision AI Platform," Yicai Global, 2022. https://www.yicaiglobal.com/news/shanghai-ai-lab-sensetime-launch-open-source-general-vision-ai-platform ↩
"SenseTime and Shanghai AI Lab Jointly Unveil OpenGVLab," SenseTime News. https://www.sensetime.com/en/news-detail/41164735?categoryId=1072 ↩
"SenseTime unveils open source computer vision platform," Biometric Update, March 2022. https://www.biometricupdate.com/202203/sensetime-unveils-open-source-computer-vision-platform ↩
InternLM organization and documentation, Shanghai AI Laboratory. https://github.com/InternLM/InternLM ↩
Wang, W., Dai, J., Chen, Z., et al. "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions," arXiv:2211.05778. https://arxiv.org/abs/2211.05778 ↩
InternImage repository (CVPR 2023 Highlight). https://github.com/OpenGVLab/InternImage ↩
Wang, Y., Li, K., Qiao, Y., et al. "InternVideo: General Video Foundation Models via Generative and Discriminative Learning," arXiv:2212.03191. https://arxiv.org/abs/2212.03191 ↩
Wang, Y., et al. "InternVideo2: Scaling Foundation Models for Multimodal Video Understanding," arXiv:2403.15377. https://arxiv.org/abs/2403.15377 ↩
InternVideo repository (ECCV 2024). https://github.com/OpenGVLab/InternVideo ↩
Chen, Z., Wu, J., Wang, W., et al. "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks," arXiv:2312.14238. https://arxiv.org/abs/2312.14238 ↩
InternVL repository (CVPR 2024 Oral). https://github.com/OpenGVLab/InternVL ↩
Wang, Y., et al. "InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation," arXiv:2307.06942. https://arxiv.org/abs/2307.06942 ↩
InternVid dataset on Hugging Face. https://huggingface.co/datasets/OpenGVLab/InternVid ↩
Shanghai AI Laboratory. https://www.shlab.org.cn/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

InternLM InternVL3 Shanghai AI Laboratory

What is OpenGVLab?

What is InternImage?

What is InternVideo?

What is InternVL?

What other projects and datasets does OpenGVLab maintain?

How does OpenGVLab relate to Shanghai AI Laboratory?

References

Improve this article

Related Articles

DeepSeek-OCR

Wan 2.1-VACE

CloudWalk Technology

SenseTime

Megvii

Yitu Technology

What links here

Related Articles

DeepSeek-OCR

Wan 2.1-VACE

CloudWalk Technology

SenseTime

Megvii

Yitu Technology

What links here