Video Classification Models
Last reviewed
May 31, 2026
Sources
22 citations
Review status
Source-backed
Revision
v4 ยท 5,417 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
22 citations
Review status
Source-backed
Revision
v4 ยท 5,417 words
Add missing citations, update stale details, or suggest a clearer explanation.
Video classification models are machine learning systems that assign one or more category labels to a video clip, typically describing the human action depicted. The task sits at the intersection of computer vision and temporal sequence modeling and is foundational to video understanding. Modern systems use 3D convolutional neural networks, vision transformers, and large-scale self-supervised pre-training to recognize actions ranging from coarse sport categories on Kinetics to fine-grained manipulation interactions on Something-Something.
Video classification is related to but distinct from several neighboring tasks. Action recognition in the broadest sense encompasses all of these, but the classification subproblem specifically involves labeling a pre-trimmed clip with a single class. Action detection (temporal action localization) predicts both the label and the temporal boundaries of an action inside an untrimmed video, while classification assumes the clip is already trimmed. Action segmentation densely labels every frame, and video captioning produces a free-form text description. The core challenge that separates video classification from image classification is temporal modeling: a single frame may identify a sport, but disambiguating actions such as picking something up versus putting something down requires reasoning about pixel change over time.
Before deep learning became dominant, video classification relied on hand-crafted spatiotemporal features. Improved Dense Trajectories (iDT), introduced by Heng Wang and Cordelia Schmid in 2013, tracked points across frames using optical flow and described their neighborhoods with HOG, HOF, and MBH descriptors. These local descriptors were aggregated into Fisher vectors and fed into linear support vector machines. iDT held state-of-the-art results on UCF-101 and HMDB-51 for years, and combining deep features with iDT often produced small but consistent gains.
The first deep model to clearly outperform iDT was the two-stream convolutional network of Karen Simonyan and Andrew Zisserman at Oxford in 2014. It used two convolutional neural networks: a spatial stream that consumed RGB frames and a temporal stream that consumed stacks of pre-computed optical flow. The streams were trained independently and their predictions averaged at test time. The split reflected the idea that ventral and dorsal visual pathways process appearance and motion separately, and it achieved strong results on UCF-101 and HMDB-51 with far less data than direct 3D approaches required.
The two-stream design has an inherent bottleneck: optical flow must be pre-computed outside the network using algorithms like TV-L1 or Farneback, which is slow on GPUs and adds significant preprocessing cost. Temporal Segment Networks (TSN, Wang et al. 2016) extended two-stream to longer videos by dividing each video into segments and sampling one snippet per segment, then averaging predictions. The two-stream paradigm remained competitive through 2017 before being eclipsed by 3D CNN approaches trained on large-scale datasets.
In December 2014, Du Tran and colleagues at Facebook AI Research and Dartmouth introduced C3D, replacing 2D kernels with homogeneous 3x3x3 spatiotemporal kernels learned end-to-end on Sports-1M. C3D made 3D convolution practical at scale and produced features that transferred well across action recognition and video retrieval. A key limitation was that C3D's 16-frame temporal window was insufficient to capture longer-range temporal patterns.
Three years later, Joao Carreira and Andrew Zisserman at DeepMind released I3D (Inflated 3D ConvNet) together with the Kinetics dataset. I3D took a 2D backbone (Inception-v1) and inflated each NxN filter into NxNxN by repeating weights along the temporal axis, letting the network inherit ImageNet priors and reach 80.2 percent on HMDB-51 and 97.9 percent on UCF-101 after Kinetics pre-training. The availability of Kinetics-400 was critical: models pretrained on Kinetics generalized far better to UCF-101 and HMDB-51 than those trained on Sports-1M alone.
Variants followed: R(2+1)D (Tran et al. 2017) factorized each 3D kernel into a 2D spatial convolution followed by a 1D temporal convolution, which doubles the number of nonlinearities per block and eases optimization, achieving results comparable or superior to I3D on Kinetics, UCF-101, and HMDB-51. S3D (Xie et al. 2018) and P3D (Qiu et al. 2017) explored similar factorizations from different directions. X3D (Feichtenhofer, CVPR 2020) progressively expanded a small 2D network along five axes (width, depth, temporal duration, temporal stride, and resolution) to find an efficient frontier; X3D models match or exceed I3D accuracy at a fraction of the computation. CSN (Tran et al. 2019) decomposed 3D convolutions using channel-separated (depthwise) operations for further efficiency gains.
The SlowFast architecture from Christoph Feichtenhofer and FAIR colleagues in late 2018 introduced two parallel pathways at different temporal sampling rates. The Slow pathway processed a few frames (e.g., 8 frames) with many channels to capture spatial semantics. The Fast pathway processed a dense sequence (e.g., 32 frames) with far fewer channels to capture motion. Lateral connections fused features between pathways at multiple resolution levels, and SlowFast set new state-of-the-art on Kinetics and the AVA spatiotemporal action detection benchmark. Nearly all 3D CNN architectures including SlowFast are compute-heavy, often requiring tens to hundreds of billions of FLOPs per clip, which remains a challenge for real-time deployment.
A parallel line of work pursued efficiency directly. The Temporal Shift Module (TSM), proposed by Ji Lin, Chuang Gan, and Song Han at MIT-IBM in late 2018, shifted a small fraction of feature channels along the temporal axis inside a 2D ResNet at essentially zero extra cost. TSM required zero additional parameters and almost no additional FLOPs yet matched or exceeded 3D CNN accuracy at 2D CNN inference cost. This made TSM popular for edge deployment and motivated related efficient temporal designs such as TANet and TPN.
With the success of the vision transformer on still images, researchers extended attention-based designs to video in 2021. TimeSformer (Bertasius et al., ICML 2021) from FAIR demonstrated that a convolution-free approach built on pure self-attention over spatial and temporal patches could match and exceed 3D CNN accuracy. Its key innovation was divided space-time attention: instead of computing full joint spatiotemporal attention (which scales as the product of spatial patches times temporal frames), TimeSformer applies temporal attention first (comparing each patch to all same-location patches across frames) and then spatial attention within each frame separately, reducing the number of comparisons from T x N to T + N. TimeSformer is roughly three times faster to train and needs less than one-tenth the inference compute of comparable 3D CNNs, and can process clips up to 96 frames (about 102 seconds) that would be intractable for 3D CNNs.
Google Research released ViViT (Arnab et al., ICCV 2021), which proposed multiple factorization variants of spatiotemporal attention and demonstrated state-of-the-art results on Kinetics-400/600, Epic-Kitchens, Something-Something v2, and Moments in Time. ViViT also showed how to leverage pretrained image models to train effectively on comparatively small video datasets.
Video Swin Transformer (Liu et al., 2021) from Microsoft Research Asia adapted the shifted-window mechanism of the Swin Transformer to 3D spatiotemporal windows. Rather than computing global self-attention over all video tokens, Video Swin restricts attention to local 3D windows and uses a shifted-window scheme to enable cross-window connections. This inductive bias of locality yielded excellent efficiency: Video Swin reached 84.9 percent top-1 on Kinetics-400 and 86.1 percent on Kinetics-600 using approximately 20x less pre-training data and a 3x smaller model than competing approaches at the time, as well as 69.6 percent on Something-Something v2.
The MViT family (Fan et al., 2021) introduced a hierarchical multiscale vision transformer for video: early stages process tokens at high spatial and temporal resolution with few channels, and later stages progressively reduce resolution while expanding channels, using pooling attention to manage computation. The improved MViTv2 (Li et al., CVPR 2022) added decomposed relative positional embeddings and residual pooling connections, reaching 86.1 percent top-1 on Kinetics-400 and serving as a unified architecture for video classification, image classification, and object detection.
UniFormer (Li et al., ICLR 2022) from SenseTime and CUHK observed that pure convolutions handle local redundancy well while pure attention captures global dependencies, but neither does both efficiently. UniFormer integrates convolution and self-attention within a single unified transformer format: shallow layers use local Multi-Head Relation Aggregation (MHRA), which operates like a 3D convolution to reduce spatiotemporal redundancy, while deep layers switch to global MHRA, which uses full attention to capture long-range temporal dependencies. With only ImageNet-1K pre-training, UniFormer achieves 82.9 percent / 84.8 percent top-1 on Kinetics-400/600 while requiring roughly 10x fewer GFLOPs than competing methods. UniFormerV2 (2022) extended the approach by arming pretrained image ViTs with video UniFormer modules, reaching state-of-the-art on eight benchmarks including Kinetics-400/600/700, Something-Something v1/v2, ActivityNet, and HACS.
Masked autoencoding migrated from images to video. VideoMAE (Tong et al., NeurIPS 2022 Spotlight) applied tube masking (consistent masks across frames) at extremely high masking ratios of 90 to 95 percent, exploiting temporal redundancy to make reconstruction challenging. VideoMAE trained a vanilla ViT backbone and achieved 87.4 percent top-1 on Kinetics-400 without extra training data, demonstrating that data quality matters more than data quantity for self-supervised video pre-training. VideoMAE required only 3,000 to 4,000 video clips to learn strong representations, a dramatic improvement in data efficiency over supervised approaches.
VideoMAE V2 (Wang et al., CVPR 2023) scaled VideoMAE with a dual-masking strategy: the encoder operates on one subset of video tokens and the decoder processes a different subset, enabling efficient pre-training of billion-parameter models. VideoMAE V2 used a progressive training paradigm, first pre-training on a diverse unlabeled multi-source dataset and then performing post-pre-training on labeled data, to train a billion-parameter video ViT that reached 90.0 percent top-1 on Kinetics-400, 89.9 percent on Kinetics-600, 68.7 percent on Something-Something v1, and 77.0 percent on Something-Something v2.[^vmae2]
MaskFeat (Wei et al. 2021) reconstructed HOG features rather than pixels; VideoMoCo applied contrastive learning inspired by MoCo; BEVT combined image and video masked tokens. These approaches established self-supervised video pre-training as a productive alternative to supervised training on labeled video datasets.
In February 2024, Meta AI released V-JEPA (Video Joint Embedding Predictive Architecture), following Yann LeCun's JEPA framework. Unlike masked autoencoders that reconstruct pixel values, V-JEPA predicts the latent embeddings of masked spatiotemporal regions, encouraging the model to capture semantic structure rather than texture or low-level appearance.
V-JEPA 2, released in June 2025, scaled this approach substantially: the model uses a ViT-g backbone with one billion parameters, trained on a curated dataset (VideoMix22M) comprising more than one million hours of internet video plus one million images. Key architectural improvements include 3D rotary position embeddings (RoPE) and support for 64-frame clips at 384x384 resolution. V-JEPA 2 achieved 77.3 percent top-1 on Something-Something v2, substantially outperforming InternVideo2's 67.7 percent on that benchmark, and set state-of-the-art on action anticipation on Epic-Kitchens-100 (39.7 recall-at-5). A separate action-conditioned variant called V-JEPA 2-AC, trained on only 62 hours of unlabeled robot videos, enables zero-shot robotic planning: the system performs model-predictive control on Franka arms for pick-and-place tasks without collecting any environment-specific data, running up to 30x faster than competing systems like Nvidia Cosmos.[^vjepa2]
The next shift was toward unified video and language models, often described as video foundation models. VideoBERT (Sun et al. 2019) tokenized clips and trained a BERT-style model jointly with cooking-instruction text. CLIP4Clip and VideoCLIP adapted CLIP for video-text retrieval by transferring CLIP's image-text alignment to the video domain, and X-CLIP (Ni et al. 2022) added cross-frame attention to better aggregate temporal context when matching video to text.
InternVideo (Wang et al. 2022) from OpenGVLab and Shanghai AI Lab combined masked video modeling with video-language contrastive learning in a unified training framework. InternVideo2 (Wang et al., ECCV 2024) scaled the video encoder to 6 billion parameters and introduced a progressive training strategy that unifies masked video token reconstruction, crossmodal contrastive learning, and next-token prediction across over 400 million training samples. InternVideo2-6B achieved 92.1 percent top-1 on Kinetics-400 using only 16 input frames, surpassing previous approaches that relied on larger resolution or model ensembles, and validated performance on more than 60 video and audio understanding tasks.[^iv2]
VideoPrism (Zhao et al., ICML 2024) from Google DeepMind took a different approach to building a foundational video encoder. Pre-trained on 36 million high-quality video-caption pairs plus 582 million clips with noisy text such as ASR transcripts, VideoPrism uses a global-local distillation of semantic video embeddings combined with a token shuffling scheme to improve over standard masked autoencoding. As a single frozen model without task-specific fine-tuning, VideoPrism reached state-of-the-art results on 31 of 33 video understanding benchmarks evaluated, spanning classification, video-text retrieval, video captioning, and video question answering.[^videoprism]
Video-LLaMA (Zhang et al., June 2023), Video-LLaVA (Lin et al., October 2023), and VideoLLaMA 2/3 (2024-2025) wired video encoders into open-source large language models, enabling instruction-following dialogue about video content. Proprietary systems such as Gemini 1.5 Pro (with a 1M token context window capable of processing about one hour of video) and Qwen2-VL extended classification to general video reasoning and long-context video understanding.
| Family | Representative models | Key idea | Typical use |
|---|---|---|---|
| Two-stream | Simonyan & Zisserman 2014, TSN | Separate RGB and optical flow CNNs | Strong on small datasets |
| 3D CNN | C3D, I3D, R(2+1)D, X3D, CSN | Spatiotemporal convolution | General-purpose recognition |
| Slow-fast | SlowFast | Two pathways at different frame rates | Action recognition and detection |
| Efficient 2D plus shift | TSM, TANet | Channel shifts inside 2D backbone | Mobile and real-time |
| Video transformer | TimeSformer, ViViT, Video Swin, MViTv2 | Self-attention over space-time tokens | Modern high-accuracy systems |
| Hybrid CNN-transformer | UniFormer, UniFormerV2 | Local convolution at low levels, global attention at high | Balanced compute |
| Self-supervised | VideoMAE, VideoMAE V2, V-JEPA, V-JEPA 2 | Masked prediction in pixel or latent space | Pre-training without labels |
| Video-language / foundation | InternVideo2, VideoPrism, Video-LLaMA, Video-LLaVA | Joint video and text representation | Zero-shot classification, captioning, VQA |
Two-stream CNNs work by processing two complementary input modalities in parallel. The spatial stream ingests single RGB frames and learns to recognize scene context and object appearance. The temporal stream ingests a stack of dense optical flow fields (typically 10 to 20 consecutive flow fields, each representing horizontal and vertical pixel displacement) that encode the direction and magnitude of motion. Each stream is a standard image CNN, often based on VGG or ResNet. Predictions are fused via late averaging or learned weighting. The main advantage is that RGB and optical flow each capture complementary information; the main disadvantage is that optical flow computation (using methods like TV-L1) is slow and cannot be end-to-end optimized within the CNN pipeline. TSN extended two-stream to long videos through sparse temporal sampling across segments.
3D CNNs extend standard 2D image convolutions to learn spatiotemporal features jointly. A 3D kernel of size TxHxW slides across both the spatial and temporal dimensions simultaneously, enabling the network to detect motion patterns directly. C3D used uniform 3x3x3 kernels throughout. I3D inflated a 2D backbone by repeating 2D weights along the temporal axis, bootstrapping temporal feature learning from strong spatial priors. R(2+1)D showed that factorizing into a spatial 2D convolution followed by a temporal 1D convolution doubles the nonlinearity depth, improves optimization stability, and achieves better accuracy than monolithic 3D kernels. X3D demonstrated that starting from a minimal model and progressively expanding along individual axes (width, depth, temporal duration, stride, resolution) yields a family of models that Pareto-dominate most earlier 3D CNNs on the accuracy-FLOPs trade-off.
Video transformers apply the self-attention mechanism from NLP transformers to video tokens. A video is divided into non-overlapping spatiotemporal patches (tubes), each linearly projected into an embedding. The key design challenge is that the number of tokens grows as the product of spatial patches and temporal frames, making full joint attention computationally prohibitive for long videos.
The main strategies to manage this are: divided/factorized attention (TimeSformer, ViViT), which alternates between temporal and spatial attention; local windowed attention (Video Swin, UniFormer), which restricts attention to a 3D neighborhood; hierarchical pooling attention (MViT/MViTv2), which progressively reduces the number of tokens; and hybrid approaches (UniFormer) that use convolution for local and attention for global stages. All video transformers benefit substantially from pre-training on large image datasets (ImageNet) or video datasets (Kinetics, HowTo100M) before task-specific fine-tuning.
| Model | Date | Organization | K-400 top-1 | Notes |
|---|---|---|---|---|
| C3D | Dec 2014 | Facebook AI, Dartmouth, NYU | n/a (Sports-1M era) | First widely used 3D CNN for video |
| Two-stream CNN | Jun 2014 | University of Oxford | n/a | RGB plus optical flow branches |
| I3D | May 2017 | DeepMind | ~72% | Inflated 2D weights; introduced Kinetics-400 |
| R(2+1)D | Nov 2017 | Facebook AI | ~73% | Factorized spatial and temporal convolution |
| TSM | Nov 2018 | MIT-IBM Watson AI Lab | ~74% | Zero-cost temporal shift inside 2D ResNet |
| SlowFast | Dec 2018 | Facebook AI Research | ~79% | Dual-pathway design with fast and slow streams |
| X3D | Apr 2020 | Facebook AI Research | ~79-80% | Family of efficient progressively expanded networks |
| TimeSformer | Feb 2021 | Facebook AI Research | 82.2% | First pure transformer for video classification |
| ViViT | Mar 2021 | Google Research | ~80% | Factorized space-time attention; ICCV 2021 |
| Video Swin | Jun 2021 | Microsoft Research Asia | 84.9% | Hierarchical shifted spatiotemporal windows |
| MViTv2 | Dec 2021 | Facebook AI Research | 86.1% | Improved multiscale vision transformer; CVPR 2022 |
| UniFormer | Jan 2022 | SenseTime, CUHK | 82.9% | Hybrid CNN and transformer; ICLR 2022 |
| VideoMAE | Mar 2022 | Nanjing University, Tencent | 87.4% | Masked autoencoder with 90 to 95 percent masking; NeurIPS 2022 |
| InternVideo | Dec 2022 | OpenGVLab, Shanghai AI Lab | ~88% | Combines masked modeling and video-language contrast |
| VideoMAE V2 | Mar 2023 | Tencent, Nanjing University | 90.0% | Dual masking; billion-param ViT; CVPR 2023 |
| Video-LLaMA | Jun 2023 | Alibaba DAMO | n/a | Audio-visual LLM for video understanding |
| Video-LLaVA | Oct 2023 | Peking University, Tencent | n/a | Unified visual-language model for image and video |
| VideoPrism | Feb 2024 | Google DeepMind | n/a | Frozen encoder; SOTA on 31 of 33 video benchmarks |
| InternVideo2 | Mar 2024 | OpenGVLab | 92.1% | 6B encoder; ECCV 2024 |
| V-JEPA | Feb 2024 | Meta AI | n/a | Joint embedding predictive pre-training on video |
| V-JEPA 2 | Jun 2025 | Meta AI | n/a | Scaled JEPA; 77.3% SSv2; zero-shot robotic planning |
The Kinetics-400 top-1 accuracy on the test set has served as the primary measure of progress in video classification since 2017. The table below shows the historical progression.
| Year | Model | Top-1 (%) | Key advance |
|---|---|---|---|
| 2017 | I3D (RGB only) | ~71 | Inflated 3D from ImageNet |
| 2017 | I3D (two-stream) | ~74 | Added optical flow |
| 2018 | SlowFast | ~79 | Dual temporal pathway |
| 2021 | TimeSformer-L | 82.2 | First pure video transformer |
| 2021 | Video Swin-L | 84.9 | Shifted-window spatiotemporal attention |
| 2022 | MViTv2-L | 86.1 | Pooling attention + relative positional bias |
| 2022 | VideoMAE ViT-H | 87.4 | Self-supervised tube masking |
| 2023 | VideoMAE V2 ViT-g | 90.0 | Dual masking; billion-param ViT |
| 2024 | InternVideo2-6B | 92.1 | 6B encoder; multi-objective pre-training |
Each major jump corresponds to a new architectural paradigm: 3D CNNs establishing the baseline, SlowFast exploiting dual temporal resolution, video transformers unlocking larger scale, and masked pre-training enabling efficient training of billion-parameter models.
| Dataset | Year | Size | Description |
|---|---|---|---|
| HMDB-51 | 2011 | ~7,000 clips, 51 actions | Brown University; mix of movies and YouTube |
| UCF-101 | 2012 | 13,320 clips, 101 actions | UCF; YouTube clips, long-standing benchmark |
| Sports-1M | 2014 | 1.1M clips, 487 sports | Google; weakly labeled by YouTube tags |
| ActivityNet | 2015 | 20K untrimmed videos, 200 classes | Trimmed and untrimmed splits |
| YouTube-8M | 2016 | 8M videos, 4,800+ entities | Google; multi-label, video-level |
| Kinetics-400 | 2017 | ~306K clips, 400 actions | DeepMind; primary modern benchmark; ~10-second clips, 400+ per class |
| Kinetics-600 / 700 | 2018 / 2019 | ~500K / ~650K clips | Expanded Kinetics splits |
| Something-Something v1/v2 | 2017 | 220K clips, 174 templates | Twenty Billion Neurons; fine-grained hand-object actions requiring temporal reasoning |
| AVA | 2017 | 1.6M annotations | Google; atomic visual actions, spatiotemporal localization |
| Moments in Time | 2018 | 1M clips, 339 classes | MIT-IBM; 3-second clips, broad action vocabulary |
| Charades | 2016 | 9,848 clips, 157 classes | AI2; daily indoor activities |
| EPIC-KITCHENS-100 | 2020 | 100 hours egocentric | University of Bristol; cooking with object interactions |
| HowTo100M | 2019 | 136M clips with narration | Inria; instructional videos for video-language pre-training |
| Ego4D | 2021 | 3,670 hours egocentric | Meta AI / multiple universities; daily life from first-person perspective |
The Kinetics family of datasets from DeepMind is the central modern benchmark for video classification. Kinetics-400 contains approximately 306K video clips across 400 human action classes, each clip approximately 10 seconds long and sourced from YouTube. At least 400 clips are provided per class, spanning both human-object interactions (playing musical instruments, using tools) and human-human interactions (handshakes, fighting). Four versions have been released: Kinetics-400 (2017), Kinetics-600 (2018, ~500K clips, 600 classes), Kinetics-700 (2019, ~650K clips, 700 classes), and Kinetics-700-2020. Pre-training on Kinetics provides strong initialization for transfer to smaller benchmarks like UCF-101 and HMDB-51.
Something-Something (v1: 2017, v2: 2018) was created by Twenty Billion Neurons (later acquired by Qualcomm) and contains over 220K short video clips of people performing hand-object interactions described by 174 caption templates such as "Pushing something from left to right" or "Pretending to pour something out of something." The dataset was specifically designed to require temporal reasoning, not just static appearance recognition. Actions like "Moving something closer to something" and "Moving something away from something" are visually identical in a single frame and can only be distinguished by understanding the direction of motion over time. This makes Something-Something a harder test of temporal modeling than Kinetics, and models that rely heavily on scene statistics (background shortcuts) perform poorly on it. Something-Something v2 top-1 accuracy is a standard secondary benchmark; as of 2025, VideoMAE V2 reports 77.0 percent and V-JEPA 2 reports 77.3 percent.
The Atomic Visual Actions (AVA) dataset from Google annotates 15-minute clips from 430 movies with 1.6 million spatiotemporal action annotations. Each annotation consists of a bounding box around an actor at a specific second within the video and one or more of 80 atomic action labels such as "sit", "walk", or "talk to". AVA tests spatiotemporal action detection rather than clip-level classification, requiring models to both localize actors and classify their actions simultaneously. Performance is measured by mean average precision (mAP) at IoU 0.5 on the detected bounding boxes.
The core task has spawned related problems. Temporal action localization predicts start and end times of each action in an untrimmed video (ActivityNet, THUMOS14). Methods fall into three paradigms: anchor-based approaches propose candidate temporal windows and classify them; boundary-based approaches detect action start and end boundaries separately; and query-based approaches use learned queries to directly regress temporal extents, similar to DETR for object detection. A key challenge is that classifiers trained for clip-level recognition often fail on temporal localization because they are not sensitive to action boundaries, an issue known as the training task discrepancy problem.
Action segmentation densely labels every frame of an untrimmed video, used in cooking and procedure understanding. Spatiotemporal action detection (AVA, JHMDB) requires bounding boxes around the actor. Video question answering (TGIF-QA, MSR-VTT-QA, NExT-QA, PerceptionTest) measures higher-level reasoning about video content. Video captioning (MSR-VTT, MSVD, VATEX) and video-text retrieval evaluate alignment with natural-language descriptions. Action anticipation predicts future actions before they occur (Epic-Kitchens-100, forecast split). Foundation models are typically evaluated across this entire suite rather than on classification alone.
Classification accuracy is reported as top-1 and top-5 on test clips (Kinetics, UCF-101, HMDB-51, Something-Something). Many systems use multi-crop or multi-clip evaluation, dividing a video into overlapping segments and averaging softmax outputs; a common protocol is 3 spatial crops times 10 temporal clips. Detection benchmarks use mean average precision (mAP) at IoU thresholds of 0.5 for AVA or 0.5 to 0.95 for ActivityNet. Captioning uses BLEU, METEOR, ROUGE-L, and CIDEr; retrieval uses recall at K (R@1, R@5, R@10). Action anticipation uses recall-at-5 on Epic-Kitchens-100. For video-language foundation models, aggregated suites such as Perception Test, MVBench, and VideoMME aggregate dozens of subtasks into a single comprehensive score.
Several toolkits and libraries support research and deployment of video classification models.
MMAction2 from OpenMMLab is the primary open-source toolkit for video understanding research. It implements most major architectures including TSN, TSM, I3D, SlowFast, R(2+1)D, X3D, TimeSformer, Video Swin, MViTv2, VideoMAE, and UniFormer, with a unified configuration-driven interface and support for Kinetics, Something-Something, and AVA training.
PyTorchVideo from Facebook AI Research provides video datasets, transforms, and model implementations including SlowFast and X3D optimized for production use.
Hugging Face Transformers includes VideoMAE, TimeSformer, and ViViT via the AutoModel interface, and the Hugging Face Hub hosts pretrained checkpoints for most major video transformer models.
Decord is a fast video reader optimized for deep learning pipelines, supporting random temporal access and on-the-fly decoding without reading full video files into memory.
FFMPEG and OpenCV handle video decoding in most research pipelines, while specialized libraries like torchvision include basic video clip sampling utilities.
Video classification underpins many products. Content moderation pipelines at YouTube, TikTok, and other platforms flag violence, sexual content, or copyrighted material. Sports analytics companies tag plays to compile statistics and highlight reels. Surveillance systems detect falls, fights, or abandoned objects. In autonomous driving, classifiers spot drowsiness, distraction, or gestures inside the vehicle cabin. Home robotics and warehouse automation use action labels for imitation learning, with V-JEPA 2 demonstrating that video representation learning can directly enable robotic manipulation planning. Sign-language translation systems classify continuous gestures into text. Video search engines index clips for natural-language retrieval, and traffic systems classify vehicle movements at intersections. Medical monitoring systems detect patient falls, medication administration, or abnormal behaviors in hospital settings. Fitness applications analyze exercise form and count repetitions using action recognition models running on device cameras.
Several challenges remain open.
Long-video understanding beyond about one minute is still hard because the cost of attention or convolution grows with temporal extent, and most architectures process clips of 8 to 64 frames. MovieChat, LongVLM, and LongVU tackle parts of this with memory banks or hierarchical processing, but full-movie reasoning is far from solved.
Computational cost is a persistent burden because videos are orders of magnitude larger than images. A 10-second clip at 30 fps produces 300 frames. 3D CNNs and video transformers processing many frames require tens to hundreds of billions of FLOPs per clip, making deployment on edge devices difficult. Efficient approaches like TSM and X3D partially address this trade-off.
Spurious correlations and shortcut learning are pervasive: a model may label a clip as swimming because of the blue pool background rather than because of the motion, or classify tennis because of the court. Benchmarks like Something-Something deliberately remove static cues to test genuine temporal reasoning. Models trained primarily on Kinetics often exploit background context rather than motion.
Label noise in web-scraped datasets like Kinetics reduces sample efficiency and requires careful data curation.
Generalization to unseen camera angles and first-person (egocentric) video is a weak point of models trained on third-person YouTube data, motivating datasets such as Ego4D and EPIC-KITCHENS-100 that capture first-person perspectives in uncontrolled environments.
Optical flow dependency in two-stream networks requires slow preprocessing that cannot be fully parallelized on GPU, and dense optical flow extraction introduces latency that makes real-time two-stream deployment difficult. TSM and 3D CNN approaches avoid this bottleneck entirely.
Evaluation gaps between benchmarks and real deployment surfaces often leave engineers without good public proxies for production metrics, particularly for domain-specific actions not well represented in Kinetics or Something-Something.
Adversarial brittleness: like all neural networks, video classifiers are susceptible to imperceptible perturbations that dramatically change predicted labels, an important concern for safety-critical applications.