Video Classification Models
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,499 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,499 words
Add missing citations, update stale details, or suggest a clearer explanation.
Video classification models are machine learning systems that assign one or more category labels to a video clip, typically describing the human action depicted. The task sits at the intersection of computer vision and temporal sequence modeling and is foundational to video understanding. Modern systems use 3D convolutional neural networks, vision transformers, and large-scale self-supervised pre-training to recognize actions ranging from coarse sport categories on Kinetics to fine-grained manipulation interactions on Something-Something.
Video classification is related to but distinct from several neighboring tasks. Action detection (temporal action localization) predicts both the label and the temporal boundaries of an action inside an untrimmed video, while classification assumes the clip is already trimmed. Action segmentation densely labels every frame, and video captioning produces a free-form text description. The core challenge that separates video classification from image classification is temporal modeling: a single frame may identify a sport, but disambiguating actions such as picking something up versus putting something down requires reasoning about pixel change over time.
Before deep learning became dominant, video classification relied on hand-crafted spatiotemporal features. Improved Dense Trajectories (iDT), introduced by Heng Wang and Cordelia Schmid in 2013, tracked points across frames using optical flow and described their neighborhoods with HOG, HOF, and MBH descriptors. These local descriptors were aggregated into Fisher vectors and fed into linear support vector machines. iDT held state-of-the-art results on UCF-101 and HMDB-51 for years, and combining deep features with iDT often produced small but consistent gains.
The first deep model to clearly outperform iDT was the two-stream convolutional network of Karen Simonyan and Andrew Zisserman at Oxford in 2014. It used two convolutional neural networks: a spatial stream that consumed RGB frames and a temporal stream that consumed stacks of pre-computed optical flow. The streams were trained independently and their predictions averaged at test time. The split reflected the idea that ventral and dorsal visual pathways process appearance and motion separately, and it achieved strong results on UCF-101 and HMDB-51 with far less data than direct 3D approaches required.
In December 2014, Du Tran and colleagues at Facebook AI Research and Dartmouth introduced C3D, replacing 2D kernels with homogeneous 3x3x3 spatiotemporal kernels learned end-to-end on Sports-1M. C3D made 3D convolution practical at scale and produced features that transferred well across action recognition and video retrieval.
Three years later, Joao Carreira and Andrew Zisserman at DeepMind released I3D (Inflated 3D ConvNet) together with the Kinetics dataset. I3D took a 2D backbone (Inception-v1) and inflated each NxN filter into NxNxN by repeating weights along the temporal axis, letting the network inherit ImageNet priors and reach 80.2 percent on HMDB-51 and 97.9 percent on UCF-101 after Kinetics pre-training. Variants followed: R(2+1)D (Tran et al. 2017) factorized each 3D kernel into separate spatial and temporal convolutions; S3D (Xie et al. 2018) and P3D (Qiu et al. 2017) explored similar factorizations; X3D (Feichtenhofer 2020) progressively expanded a small 2D network along width, depth, time, and resolution; CSN (Tran et al. 2019) added depthwise separability.
The SlowFast architecture from Christoph Feichtenhofer and FAIR colleagues in late 2018 introduced two parallel pathways at different temporal sampling rates. The Slow pathway processed a few frames with many channels for spatial semantics; the Fast pathway processed a dense frame sequence with few channels for motion. Lateral connections fused features between them, and SlowFast set new state-of-the-art on Kinetics and AVA.
A parallel line of work pursued efficiency directly. The Temporal Shift Module (TSM), proposed by Ji Lin, Chuang Gan, and Song Han at MIT-IBM in late 2018, shifted a small fraction of feature channels along the temporal axis inside a 2D ResNet. The shift required zero parameters and almost no FLOPs yet matched 3D CNN accuracy at 2D CNN cost. TSM became popular for edge deployment and motivated related designs such as TANet and TPN.
With the success of the vision transformer on still images, researchers extended attention-based designs to video in 2021. TimeSformer from Gedas Bertasius and FAIR colleagues showed that pure self-attention over spatial and temporal patches could match 3D CNN accuracy, with divided space-time attention giving the best speed-accuracy trade-off. Google Research released ViViT, which proposed factorization variants of spatiotemporal attention. Video Swin Transformer from Microsoft Research Asia adapted the shifted-window mechanism of Swin Transformer to spatiotemporal windows, reaching 84.9 percent top-1 on Kinetics-400. The MViT family (Fan et al. 2021) and improved MViTv2 (Li et al. 2021-2022) built hierarchical attention that pooled tokens across stages, reporting 86.1 percent on Kinetics-400.
Masked image modeling migrated to video. VideoMAE (Tong et al. 2022) and VideoMAEv2 (Wang et al. 2023) masked tube-shaped regions of clips at very high ratios (up to 95 percent) and trained a vision transformer to reconstruct them. MaskFeat (Wei et al. 2021) reconstructed HOG features rather than pixels; VideoMoCo used contrastive learning inspired by MoCo; BEVT combined image and video tokens. In 2024, Meta AI released V-JEPA, which follows Yann LeCun's Joint Embedding Predictive Architecture: V-JEPA predicts the latent embeddings of masked regions, capturing semantic structure rather than texture. V-JEPA 2 (2025) scaled this approach and demonstrated zero-shot transfer to robotic planning.
The next shift was toward unified video and language models. VideoBERT (Sun et al. 2019) tokenized clips and trained a BERT-style model jointly with cooking-instruction text. CLIP4Clip and VideoCLIP adapted CLIP for video-text retrieval, and X-CLIP (Ni et al. 2022) added cross-frame attention. InternVideo (Wang et al. 2022) combined masked video modeling with video-language contrastive learning, and InternVideo2 (March 2024) scaled this to a 1B model that reached 92.1 percent top-1 on Kinetics-400. Video-LLaMA (Zhang et al. June 2023), Video-LLaVA (Lin et al. October 2023), and VideoLLaMA 2/3 (2024-2025) wired video encoders into open-source large language models. Proprietary systems such as Gemini and Qwen2-VL extended classification to general video reasoning.
| Family | Representative models | Key idea | Typical use |
|---|---|---|---|
| Two-stream | Simonyan & Zisserman 2014, TSN | Separate RGB and optical flow CNNs | Strong on small datasets |
| 3D CNN | C3D, I3D, R(2+1)D, X3D, CSN | Spatiotemporal convolution | General-purpose recognition |
| Slow-fast | SlowFast | Two pathways at different frame rates | Action recognition and detection |
| Efficient 2D plus shift | TSM, TANet | Channel shifts inside 2D backbone | Mobile and real-time |
| Video transformer | TimeSformer, ViViT, Video Swin, MViTv2 | Self-attention over space-time tokens | Modern high-accuracy systems |
| Hybrid CNN-transformer | Uniformer | Local convolution at low levels, global attention at high | Balanced compute |
| Self-supervised | VideoMAE, VideoMAEv2, V-JEPA | Masked prediction in pixel or latent space | Pre-training without labels |
| Video-language | InternVideo, Video-LLaMA, Video-LLaVA | Joint video and text representation | Zero-shot classification, captioning, VQA |
| Model | Date | Organization | Notes |
|---|---|---|---|
| C3D | Dec 2014 | Facebook AI, Dartmouth, NYU | First widely used 3D CNN for video |
| Two-stream CNN | Jun 2014 | University of Oxford | RGB plus optical flow branches |
| I3D | May 2017 | DeepMind | Inflated 2D weights, introduced Kinetics |
| R(2+1)D | Nov 2017 | Facebook AI | Factorized spatial and temporal convolution |
| TSM | Nov 2018 | MIT-IBM Watson AI Lab | Zero-cost temporal shift inside 2D ResNet |
| SlowFast | Dec 2018 | Facebook AI Research | Dual-pathway design with fast and slow streams |
| X3D | Apr 2020 | Facebook AI Research | Family of efficient progressively expanded networks |
| TimeSformer | Feb 2021 | Facebook AI Research | First pure transformer for video classification |
| ViViT | Mar 2021 | Google Research | Factorized space-time attention |
| Video Swin | Jun 2021 | Microsoft Research Asia | Hierarchical shifted spatiotemporal windows |
| MViTv2 | Dec 2021 | Facebook AI Research | Improved multiscale vision transformer |
| Uniformer | Jan 2022 | SenseTime, CUHK | Hybrid CNN and transformer at different stages |
| VideoMAE | Mar 2022 | Nanjing University, Tencent | Masked autoencoder with 90 to 95 percent masking |
| InternVideo | Dec 2022 | OpenGVLab, Shanghai AI Lab | Combines masked modeling and video-language contrast |
| VideoMAEv2 | Mar 2023 | Tencent, Nanjing University | Billion-parameter masked video pre-training |
| Video-LLaMA | Jun 2023 | Alibaba DAMO | Audio-visual LLM for video understanding |
| Video-LLaVA | Oct 2023 | Peking University, Tencent | Unified visual-language model for image and video |
| InternVideo2 | Mar 2024 | OpenGVLab | 1B model, 92.1 percent on Kinetics-400 |
| V-JEPA | Feb 2024 | Meta AI | Joint embedding predictive pre-training on video |
| V-JEPA 2 | 2025 | Meta AI | Scaled JEPA enabling robotic planning |
| Dataset | Year | Size | Description |
|---|---|---|---|
| HMDB-51 | 2011 | ~7,000 clips, 51 actions | Brown University; mix of movies and YouTube |
| UCF-101 | 2012 | 13,320 clips, 101 actions | UCF; YouTube clips, long-standing benchmark |
| Sports-1M | 2014 | 1.1M clips, 487 sports | Google; weakly labeled by YouTube tags |
| ActivityNet | 2015 | 20K untrimmed videos, 200 classes | Trimmed and untrimmed splits |
| YouTube-8M | 2016 | 8M videos, 4,800+ entities | Google; multi-label, video-level |
| Kinetics-400 | 2017 | ~306K clips, 400 actions | DeepMind; primary modern benchmark |
| Kinetics-600 / 700 | 2018 / 2019 | ~500K / ~650K clips | Expanded Kinetics splits |
| Something-Something v1/v2 | 2017 | 220K clips, 174 templates | Twenty Billion Neurons; fine-grained hand-object actions |
| AVA | 2017 | 1.6M annotations | Google; atomic visual actions, spatiotemporal localization |
| Moments in Time | 2018 | 1M clips, 339 classes | MIT-IBM; 3-second clips, broad action vocabulary |
| Charades | 2016 | 9,848 clips, 157 classes | AI2; daily indoor activities |
| EPIC-KITCHENS-100 | 2020 | 100 hours egocentric | University of Bristol; cooking with object interactions |
| HowTo100M | 2019 | 136M clips with narration | Inria; instructional videos for video-language pre-training |
The core task has spawned related problems. Temporal action localization predicts start and end times of each action in an untrimmed video (ActivityNet, THUMOS). Action segmentation densely labels every frame. Spatiotemporal action detection (AVA) requires bounding boxes around the actor. Video question answering (TGIF-QA, MSR-VTT-QA, NExT-QA) measures higher-level reasoning. Video captioning (MSR-VTT, MSVD, VATEX) and video-text retrieval evaluate alignment with natural-language descriptions. Foundation models are typically evaluated across this entire suite.
Classification accuracy is reported as top-1 and top-5 on test clips (Kinetics, UCF-101, HMDB-51, Something-Something). Many systems use multi-crop or multi-clip evaluation, dividing a video into overlapping segments and averaging softmax outputs. Detection benchmarks use mean average precision (mAP) at IoU thresholds of 0.5 for AVA or 0.5 to 0.95 for ActivityNet. Captioning uses BLEU, METEOR, ROUGE-L, and CIDEr; retrieval uses recall at K (R@1, R@5, R@10). For video-language foundation models, suites such as Perception Test, MVBench, and VideoMME aggregate dozens of subtasks.
Video classification underpins many products. Content moderation pipelines at YouTube, TikTok, and other platforms flag violence, sexual content, or copyrighted material. Sports analytics companies tag plays to compile statistics and highlight reels. Surveillance systems detect falls, fights, or abandoned objects. In autonomous driving, classifiers spot drowsiness, distraction, or gestures. Home robotics and warehouse automation use action labels for imitation learning. Sign-language translation systems classify continuous gestures into text. Video search engines index clips for natural-language retrieval, and traffic systems classify vehicle movements at intersections.
Several challenges remain open. Long-video understanding beyond about one minute is still hard because the cost of attention or convolution grows with temporal extent. MovieChat, LongVLM, and LongVU tackle parts of this with memory banks or hierarchical processing, but full-movie reasoning is far from solved. Computational cost is a persistent burden because videos are orders of magnitude larger than images. Spurious correlations are pervasive: a model may label a clip as swimming because of the blue pool rather than the motion, so benchmarks like Something-Something deliberately remove static cues. Label noise in web-scraped datasets reduces sample efficiency. Generalization to unseen camera angles and first-person video is a weak point of models trained on third-person YouTube data, motivating egocentric datasets such as Ego4D. Finally, evaluation gaps between benchmarks and real deployment surfaces often leave engineers without good public proxies for production metrics.