Video Classification Models

AI Models Computer Vision

27 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v4 · 5,417 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Video classification models are machine learning systems that assign one or more category labels to a video clip, typically describing the human action depicted. The task sits at the intersection of computer vision and temporal sequence modeling and is foundational to video understanding. Modern systems use 3D convolutional neural networks, vision transformers, and large-scale self-supervised pre-training to recognize actions ranging from coarse sport categories on Kinetics to fine-grained manipulation interactions on Something-Something.

Video classification is related to but distinct from several neighboring tasks. Action recognition in the broadest sense encompasses all of these, but the classification subproblem specifically involves labeling a pre-trimmed clip with a single class. Action detection (temporal action localization) predicts both the label and the temporal boundaries of an action inside an untrimmed video, while classification assumes the clip is already trimmed. Action segmentation densely labels every frame, and video captioning produces a free-form text description. The core challenge that separates video classification from image classification is temporal modeling: a single frame may identify a sport, but disambiguating actions such as picking something up versus putting something down requires reasoning about pixel change over time.

History

Pre-deep-learning era

Before deep learning became dominant, video classification relied on hand-crafted spatiotemporal features. Improved Dense Trajectories (iDT), introduced by Heng Wang and Cordelia Schmid in 2013, tracked points across frames using optical flow and described their neighborhoods with HOG, HOF, and MBH descriptors. These local descriptors were aggregated into Fisher vectors and fed into linear support vector machines. iDT held state-of-the-art results on UCF-101 and HMDB-51 for years, and combining deep features with iDT often produced small but consistent gains.

Two-stream networks

The first deep model to clearly outperform iDT was the two-stream convolutional network of Karen Simonyan and Andrew Zisserman at Oxford in 2014. It used two convolutional neural networks: a spatial stream that consumed RGB frames and a temporal stream that consumed stacks of pre-computed optical flow. The streams were trained independently and their predictions averaged at test time. The split reflected the idea that ventral and dorsal visual pathways process appearance and motion separately, and it achieved strong results on UCF-101 and HMDB-51 with far less data than direct 3D approaches required.

The two-stream design has an inherent bottleneck: optical flow must be pre-computed outside the network using algorithms like TV-L1 or Farneback, which is slow on GPUs and adds significant preprocessing cost. Temporal Segment Networks (TSN, Wang et al. 2016) extended two-stream to longer videos by dividing each video into segments and sampling one snippet per segment, then averaging predictions. The two-stream paradigm remained competitive through 2017 before being eclipsed by 3D CNN approaches trained on large-scale datasets.

3D convolutional networks

In December 2014, Du Tran and colleagues at Facebook AI Research and Dartmouth introduced C3D, replacing 2D kernels with homogeneous 3x3x3 spatiotemporal kernels learned end-to-end on Sports-1M. C3D made 3D convolution practical at scale and produced features that transferred well across action recognition and video retrieval. A key limitation was that C3D's 16-frame temporal window was insufficient to capture longer-range temporal patterns.

Three years later, Joao Carreira and Andrew Zisserman at DeepMind released I3D (Inflated 3D ConvNet) together with the Kinetics dataset. I3D took a 2D backbone (Inception-v1) and inflated each NxN filter into NxNxN by repeating weights along the temporal axis, letting the network inherit ImageNet priors and reach 80.2 percent on HMDB-51 and 97.9 percent on UCF-101 after Kinetics pre-training. The availability of Kinetics-400 was critical: models pretrained on Kinetics generalized far better to UCF-101 and HMDB-51 than those trained on Sports-1M alone.

Variants followed: R(2+1)D (Tran et al. 2017) factorized each 3D kernel into a 2D spatial convolution followed by a 1D temporal convolution, which doubles the number of nonlinearities per block and eases optimization, achieving results comparable or superior to I3D on Kinetics, UCF-101, and HMDB-51. S3D (Xie et al. 2018) and P3D (Qiu et al. 2017) explored similar factorizations from different directions. X3D (Feichtenhofer, CVPR 2020) progressively expanded a small 2D network along five axes (width, depth, temporal duration, temporal stride, and resolution) to find an efficient frontier; X3D models match or exceed I3D accuracy at a fraction of the computation. CSN (Tran et al. 2019) decomposed 3D convolutions using channel-separated (depthwise) operations for further efficiency gains.

SlowFast and efficient designs

The SlowFast architecture from Christoph Feichtenhofer and FAIR colleagues in late 2018 introduced two parallel pathways at different temporal sampling rates. The Slow pathway processed a few frames (e.g., 8 frames) with many channels to capture spatial semantics. The Fast pathway processed a dense sequence (e.g., 32 frames) with far fewer channels to capture motion. Lateral connections fused features between pathways at multiple resolution levels, and SlowFast set new state-of-the-art on Kinetics and the AVA spatiotemporal action detection benchmark. Nearly all 3D CNN architectures including SlowFast are compute-heavy, often requiring tens to hundreds of billions of FLOPs per clip, which remains a challenge for real-time deployment.

A parallel line of work pursued efficiency directly. The Temporal Shift Module (TSM), proposed by Ji Lin, Chuang Gan, and Song Han at MIT-IBM in late 2018, shifted a small fraction of feature channels along the temporal axis inside a 2D ResNet at essentially zero extra cost. TSM required zero additional parameters and almost no additional FLOPs yet matched or exceeded 3D CNN accuracy at 2D CNN inference cost. This made TSM popular for edge deployment and motivated related efficient temporal designs such as TANet and TPN.

Video transformers

With the success of the vision transformer on still images, researchers extended attention-based designs to video in 2021. TimeSformer (Bertasius et al., ICML 2021) from FAIR demonstrated that a convolution-free approach built on pure self-attention over spatial and temporal patches could match and exceed 3D CNN accuracy. Its key innovation was divided space-time attention: instead of computing full joint spatiotemporal attention (which scales as the product of spatial patches times temporal frames), TimeSformer applies temporal attention first (comparing each patch to all same-location patches across frames) and then spatial attention within each frame separately, reducing the number of comparisons from T x N to T + N. TimeSformer is roughly three times faster to train and needs less than one-tenth the inference compute of comparable 3D CNNs, and can process clips up to 96 frames (about 102 seconds) that would be intractable for 3D CNNs.

Google Research released ViViT (Arnab et al., ICCV 2021), which proposed multiple factorization variants of spatiotemporal attention and demonstrated state-of-the-art results on Kinetics-400/600, Epic-Kitchens, Something-Something v2, and Moments in Time. ViViT also showed how to leverage pretrained image models to train effectively on comparatively small video datasets.

Video Swin Transformer (Liu et al., 2021) from Microsoft Research Asia adapted the shifted-window mechanism of the Swin Transformer to 3D spatiotemporal windows. Rather than computing global self-attention over all video tokens, Video Swin restricts attention to local 3D windows and uses a shifted-window scheme to enable cross-window connections. This inductive bias of locality yielded excellent efficiency: Video Swin reached 84.9 percent top-1 on Kinetics-400 and 86.1 percent on Kinetics-600 using approximately 20x less pre-training data and a 3x smaller model than competing approaches at the time, as well as 69.6 percent on Something-Something v2.

The MViT family (Fan et al., 2021) introduced a hierarchical multiscale vision transformer for video: early stages process tokens at high spatial and temporal resolution with few channels, and later stages progressively reduce resolution while expanding channels, using pooling attention to manage computation. The improved MViTv2 (Li et al., CVPR 2022) added decomposed relative positional embeddings and residual pooling connections, reaching 86.1 percent top-1 on Kinetics-400 and serving as a unified architecture for video classification, image classification, and object detection.

UniFormer (Li et al., ICLR 2022) from SenseTime and CUHK observed that pure convolutions handle local redundancy well while pure attention captures global dependencies, but neither does both efficiently. UniFormer integrates convolution and self-attention within a single unified transformer format: shallow layers use local Multi-Head Relation Aggregation (MHRA), which operates like a 3D convolution to reduce spatiotemporal redundancy, while deep layers switch to global MHRA, which uses full attention to capture long-range temporal dependencies. With only ImageNet-1K pre-training, UniFormer achieves 82.9 percent / 84.8 percent top-1 on Kinetics-400/600 while requiring roughly 10x fewer GFLOPs than competing methods. UniFormerV2 (2022) extended the approach by arming pretrained image ViTs with video UniFormer modules, reaching state-of-the-art on eight benchmarks including Kinetics-400/600/700, Something-Something v1/v2, ActivityNet, and HACS.

Self-supervised pre-training

Masked autoencoding migrated from images to video. VideoMAE (Tong et al., NeurIPS 2022 Spotlight) applied tube masking (consistent masks across frames) at extremely high masking ratios of 90 to 95 percent, exploiting temporal redundancy to make reconstruction challenging. VideoMAE trained a vanilla ViT backbone and achieved 87.4 percent top-1 on Kinetics-400 without extra training data, demonstrating that data quality matters more than data quantity for self-supervised video pre-training. VideoMAE required only 3,000 to 4,000 video clips to learn strong representations, a dramatic improvement in data efficiency over supervised approaches.

VideoMAE V2 (Wang et al., CVPR 2023) scaled VideoMAE with a dual-masking strategy: the encoder operates on one subset of video tokens and the decoder processes a different subset, enabling efficient pre-training of billion-parameter models. VideoMAE V2 used a progressive training paradigm, first pre-training on a diverse unlabeled multi-source dataset and then performing post-pre-training on labeled data, to train a billion-parameter video ViT that reached 90.0 percent top-1 on Kinetics-400, 89.9 percent on Kinetics-600, 68.7 percent on Something-Something v1, and 77.0 percent on Something-Something v2.[^vmae2]

MaskFeat (Wei et al. 2021) reconstructed HOG features rather than pixels; VideoMoCo applied contrastive learning inspired by MoCo; BEVT combined image and video masked tokens. These approaches established self-supervised video pre-training as a productive alternative to supervised training on labeled video datasets.

In February 2024, Meta AI released V-JEPA (Video Joint Embedding Predictive Architecture), following Yann LeCun's JEPA framework. Unlike masked autoencoders that reconstruct pixel values, V-JEPA predicts the latent embeddings of masked spatiotemporal regions, encouraging the model to capture semantic structure rather than texture or low-level appearance.

V-JEPA 2, released in June 2025, scaled this approach substantially: the model uses a ViT-g backbone with one billion parameters, trained on a curated dataset (VideoMix22M) comprising more than one million hours of internet video plus one million images. Key architectural improvements include 3D rotary position embeddings (RoPE) and support for 64-frame clips at 384x384 resolution. V-JEPA 2 achieved 77.3 percent top-1 on Something-Something v2, substantially outperforming InternVideo2's 67.7 percent on that benchmark, and set state-of-the-art on action anticipation on Epic-Kitchens-100 (39.7 recall-at-5). A separate action-conditioned variant called V-JEPA 2-AC, trained on only 62 hours of unlabeled robot videos, enables zero-shot robotic planning: the system performs model-predictive control on Franka arms for pick-and-place tasks without collecting any environment-specific data, running up to 30x faster than competing systems like Nvidia Cosmos.[^vjepa2]

Video-language foundation models

The next shift was toward unified video and language models, often described as video foundation models. VideoBERT (Sun et al. 2019) tokenized clips and trained a BERT-style model jointly with cooking-instruction text. CLIP4Clip and VideoCLIP adapted CLIP for video-text retrieval by transferring CLIP's image-text alignment to the video domain, and X-CLIP (Ni et al. 2022) added cross-frame attention to better aggregate temporal context when matching video to text.

InternVideo (Wang et al. 2022) from OpenGVLab and Shanghai AI Lab combined masked video modeling with video-language contrastive learning in a unified training framework. InternVideo2 (Wang et al., ECCV 2024) scaled the video encoder to 6 billion parameters and introduced a progressive training strategy that unifies masked video token reconstruction, crossmodal contrastive learning, and next-token prediction across over 400 million training samples. InternVideo2-6B achieved 92.1 percent top-1 on Kinetics-400 using only 16 input frames, surpassing previous approaches that relied on larger resolution or model ensembles, and validated performance on more than 60 video and audio understanding tasks.[^iv2]

VideoPrism (Zhao et al., ICML 2024) from Google DeepMind took a different approach to building a foundational video encoder. Pre-trained on 36 million high-quality video-caption pairs plus 582 million clips with noisy text such as ASR transcripts, VideoPrism uses a global-local distillation of semantic video embeddings combined with a token shuffling scheme to improve over standard masked autoencoding. As a single frozen model without task-specific fine-tuning, VideoPrism reached state-of-the-art results on 31 of 33 video understanding benchmarks evaluated, spanning classification, video-text retrieval, video captioning, and video question answering.[^videoprism]

Video-LLaMA (Zhang et al., June 2023), Video-LLaVA (Lin et al., October 2023), and VideoLLaMA 2/3 (2024-2025) wired video encoders into open-source large language models, enabling instruction-following dialogue about video content. Proprietary systems such as Gemini 1.5 Pro (with a 1M token context window capable of processing about one hour of video) and Qwen2-VL extended classification to general video reasoning and long-context video understanding.

Architectures

Family	Representative models	Key idea	Typical use
Two-stream	Simonyan & Zisserman 2014, TSN	Separate RGB and optical flow CNNs	Strong on small datasets
3D CNN	C3D, I3D, R(2+1)D, X3D, CSN	Spatiotemporal convolution	General-purpose recognition
Slow-fast	SlowFast	Two pathways at different frame rates	Action recognition and detection
Efficient 2D plus shift	TSM, TANet	Channel shifts inside 2D backbone	Mobile and real-time
Video transformer	TimeSformer, ViViT, Video Swin, MViTv2	Self-attention over space-time tokens	Modern high-accuracy systems
Hybrid CNN-transformer	UniFormer, UniFormerV2	Local convolution at low levels, global attention at high	Balanced compute
Self-supervised	VideoMAE, VideoMAE V2, V-JEPA, V-JEPA 2	Masked prediction in pixel or latent space	Pre-training without labels
Video-language / foundation	InternVideo2, VideoPrism, Video-LLaMA, Video-LLaVA	Joint video and text representation	Zero-shot classification, captioning, VQA

Two-stream networks

Two-stream CNNs work by processing two complementary input modalities in parallel. The spatial stream ingests single RGB frames and learns to recognize scene context and object appearance. The temporal stream ingests a stack of dense optical flow fields (typically 10 to 20 consecutive flow fields, each representing horizontal and vertical pixel displacement) that encode the direction and magnitude of motion. Each stream is a standard image CNN, often based on VGG or ResNet. Predictions are fused via late averaging or learned weighting. The main advantage is that RGB and optical flow each capture complementary information; the main disadvantage is that optical flow computation (using methods like TV-L1) is slow and cannot be end-to-end optimized within the CNN pipeline. TSN extended two-stream to long videos through sparse temporal sampling across segments.

3D convolutional networks

3D CNNs extend standard 2D image convolutions to learn spatiotemporal features jointly. A 3D kernel of size TxHxW slides across both the spatial and temporal dimensions simultaneously, enabling the network to detect motion patterns directly. C3D used uniform 3x3x3 kernels throughout. I3D inflated a 2D backbone by repeating 2D weights along the temporal axis, bootstrapping temporal feature learning from strong spatial priors. R(2+1)D showed that factorizing into a spatial 2D convolution followed by a temporal 1D convolution doubles the nonlinearity depth, improves optimization stability, and achieves better accuracy than monolithic 3D kernels. X3D demonstrated that starting from a minimal model and progressively expanding along individual axes (width, depth, temporal duration, stride, resolution) yields a family of models that Pareto-dominate most earlier 3D CNNs on the accuracy-FLOPs trade-off.

Video transformers

Video transformers apply the self-attention mechanism from NLP transformers to video tokens. A video is divided into non-overlapping spatiotemporal patches (tubes), each linearly projected into an embedding. The key design challenge is that the number of tokens grows as the product of spatial patches and temporal frames, making full joint attention computationally prohibitive for long videos.

The main strategies to manage this are: divided/factorized attention (TimeSformer, ViViT), which alternates between temporal and spatial attention; local windowed attention (Video Swin, UniFormer), which restricts attention to a 3D neighborhood; hierarchical pooling attention (MViT/MViTv2), which progressively reduces the number of tokens; and hybrid approaches (UniFormer) that use convolution for local and attention for global stages. All video transformers benefit substantially from pre-training on large image datasets (ImageNet) or video datasets (Kinetics, HowTo100M) before task-specific fine-tuning.

Notable models

Model	Date	Organization	K-400 top-1	Notes
C3D	Dec 2014	Facebook AI, Dartmouth, NYU	n/a (Sports-1M era)	First widely used 3D CNN for video
Two-stream CNN	Jun 2014	University of Oxford	n/a	RGB plus optical flow branches
I3D	May 2017	DeepMind	~72%	Inflated 2D weights; introduced Kinetics-400
R(2+1)D	Nov 2017	Facebook AI	~73%	Factorized spatial and temporal convolution
TSM	Nov 2018	MIT-IBM Watson AI Lab	~74%	Zero-cost temporal shift inside 2D ResNet
SlowFast	Dec 2018	Facebook AI Research	~79%	Dual-pathway design with fast and slow streams
X3D	Apr 2020	Facebook AI Research	~79-80%	Family of efficient progressively expanded networks
TimeSformer	Feb 2021	Facebook AI Research	82.2%	First pure transformer for video classification
ViViT	Mar 2021	Google Research	~80%	Factorized space-time attention; ICCV 2021
Video Swin	Jun 2021	Microsoft Research Asia	84.9%	Hierarchical shifted spatiotemporal windows
MViTv2	Dec 2021	Facebook AI Research	86.1%	Improved multiscale vision transformer; CVPR 2022
UniFormer	Jan 2022	SenseTime, CUHK	82.9%	Hybrid CNN and transformer; ICLR 2022
VideoMAE	Mar 2022	Nanjing University, Tencent	87.4%	Masked autoencoder with 90 to 95 percent masking; NeurIPS 2022
InternVideo	Dec 2022	OpenGVLab, Shanghai AI Lab	~88%	Combines masked modeling and video-language contrast
VideoMAE V2	Mar 2023	Tencent, Nanjing University	90.0%	Dual masking; billion-param ViT; CVPR 2023
Video-LLaMA	Jun 2023	Alibaba DAMO	n/a	Audio-visual LLM for video understanding
Video-LLaVA	Oct 2023	Peking University, Tencent	n/a	Unified visual-language model for image and video
VideoPrism	Feb 2024	Google DeepMind	n/a	Frozen encoder; SOTA on 31 of 33 video benchmarks
InternVideo2	Mar 2024	OpenGVLab	92.1%	6B encoder; ECCV 2024
V-JEPA	Feb 2024	Meta AI	n/a	Joint embedding predictive pre-training on video
V-JEPA 2	Jun 2025	Meta AI	n/a	Scaled JEPA; 77.3% SSv2; zero-shot robotic planning

Kinetics-400 benchmark progression

The Kinetics-400 top-1 accuracy on the test set has served as the primary measure of progress in video classification since 2017. The table below shows the historical progression.

Year	Model	Top-1 (%)	Key advance
2017	I3D (RGB only)	~71	Inflated 3D from ImageNet
2017	I3D (two-stream)	~74	Added optical flow
2018	SlowFast	~79	Dual temporal pathway
2021	TimeSformer-L	82.2	First pure video transformer
2021	Video Swin-L	84.9	Shifted-window spatiotemporal attention
2022	MViTv2-L	86.1	Pooling attention + relative positional bias
2022	VideoMAE ViT-H	87.4	Self-supervised tube masking
2023	VideoMAE V2 ViT-g	90.0	Dual masking; billion-param ViT
2024	InternVideo2-6B	92.1	6B encoder; multi-objective pre-training

Each major jump corresponds to a new architectural paradigm: 3D CNNs establishing the baseline, SlowFast exploiting dual temporal resolution, video transformers unlocking larger scale, and masked pre-training enabling efficient training of billion-parameter models.

Datasets

Dataset	Year	Size	Description
HMDB-51	2011	~7,000 clips, 51 actions	Brown University; mix of movies and YouTube
UCF-101	2012	13,320 clips, 101 actions	UCF; YouTube clips, long-standing benchmark
Sports-1M	2014	1.1M clips, 487 sports	Google; weakly labeled by YouTube tags
ActivityNet	2015	20K untrimmed videos, 200 classes	Trimmed and untrimmed splits
YouTube-8M	2016	8M videos, 4,800+ entities	Google; multi-label, video-level
Kinetics-400	2017	~306K clips, 400 actions	DeepMind; primary modern benchmark; ~10-second clips, 400+ per class
Kinetics-600 / 700	2018 / 2019	~500K / ~650K clips	Expanded Kinetics splits
Something-Something v1/v2	2017	220K clips, 174 templates	Twenty Billion Neurons; fine-grained hand-object actions requiring temporal reasoning
AVA	2017	1.6M annotations	Google; atomic visual actions, spatiotemporal localization
Moments in Time	2018	1M clips, 339 classes	MIT-IBM; 3-second clips, broad action vocabulary
Charades	2016	9,848 clips, 157 classes	AI2; daily indoor activities
EPIC-KITCHENS-100	2020	100 hours egocentric	University of Bristol; cooking with object interactions
HowTo100M	2019	136M clips with narration	Inria; instructional videos for video-language pre-training
Ego4D	2021	3,670 hours egocentric	Meta AI / multiple universities; daily life from first-person perspective

Kinetics

The Kinetics family of datasets from DeepMind is the central modern benchmark for video classification. Kinetics-400 contains approximately 306K video clips across 400 human action classes, each clip approximately 10 seconds long and sourced from YouTube. At least 400 clips are provided per class, spanning both human-object interactions (playing musical instruments, using tools) and human-human interactions (handshakes, fighting). Four versions have been released: Kinetics-400 (2017), Kinetics-600 (2018, ~500K clips, 600 classes), Kinetics-700 (2019, ~650K clips, 700 classes), and Kinetics-700-2020. Pre-training on Kinetics provides strong initialization for transfer to smaller benchmarks like UCF-101 and HMDB-51.

Something-Something

Something-Something (v1: 2017, v2: 2018) was created by Twenty Billion Neurons (later acquired by Qualcomm) and contains over 220K short video clips of people performing hand-object interactions described by 174 caption templates such as "Pushing something from left to right" or "Pretending to pour something out of something." The dataset was specifically designed to require temporal reasoning, not just static appearance recognition. Actions like "Moving something closer to something" and "Moving something away from something" are visually identical in a single frame and can only be distinguished by understanding the direction of motion over time. This makes Something-Something a harder test of temporal modeling than Kinetics, and models that rely heavily on scene statistics (background shortcuts) perform poorly on it. Something-Something v2 top-1 accuracy is a standard secondary benchmark; as of 2025, VideoMAE V2 reports 77.0 percent and V-JEPA 2 reports 77.3 percent.

AVA

The Atomic Visual Actions (AVA) dataset from Google annotates 15-minute clips from 430 movies with 1.6 million spatiotemporal action annotations. Each annotation consists of a bounding box around an actor at a specific second within the video and one or more of 80 atomic action labels such as "sit", "walk", or "talk to". AVA tests spatiotemporal action detection rather than clip-level classification, requiring models to both localize actors and classify their actions simultaneously. Performance is measured by mean average precision (mAP) at IoU 0.5 on the detected bounding boxes.

Tasks beyond classification

The core task has spawned related problems. Temporal action localization predicts start and end times of each action in an untrimmed video (ActivityNet, THUMOS14). Methods fall into three paradigms: anchor-based approaches propose candidate temporal windows and classify them; boundary-based approaches detect action start and end boundaries separately; and query-based approaches use learned queries to directly regress temporal extents, similar to DETR for object detection. A key challenge is that classifiers trained for clip-level recognition often fail on temporal localization because they are not sensitive to action boundaries, an issue known as the training task discrepancy problem.

Action segmentation densely labels every frame of an untrimmed video, used in cooking and procedure understanding. Spatiotemporal action detection (AVA, JHMDB) requires bounding boxes around the actor. Video question answering (TGIF-QA, MSR-VTT-QA, NExT-QA, PerceptionTest) measures higher-level reasoning about video content. Video captioning (MSR-VTT, MSVD, VATEX) and video-text retrieval evaluate alignment with natural-language descriptions. Action anticipation predicts future actions before they occur (Epic-Kitchens-100, forecast split). Foundation models are typically evaluated across this entire suite rather than on classification alone.

Evaluation metrics

Classification accuracy is reported as top-1 and top-5 on test clips (Kinetics, UCF-101, HMDB-51, Something-Something). Many systems use multi-crop or multi-clip evaluation, dividing a video into overlapping segments and averaging softmax outputs; a common protocol is 3 spatial crops times 10 temporal clips. Detection benchmarks use mean average precision (mAP) at IoU thresholds of 0.5 for AVA or 0.5 to 0.95 for ActivityNet. Captioning uses BLEU, METEOR, ROUGE-L, and CIDEr; retrieval uses recall at K (R@1, R@5, R@10). Action anticipation uses recall-at-5 on Epic-Kitchens-100. For video-language foundation models, aggregated suites such as Perception Test, MVBench, and VideoMME aggregate dozens of subtasks into a single comprehensive score.

Open-source ecosystem

Several toolkits and libraries support research and deployment of video classification models.

MMAction2 from OpenMMLab is the primary open-source toolkit for video understanding research. It implements most major architectures including TSN, TSM, I3D, SlowFast, R(2+1)D, X3D, TimeSformer, Video Swin, MViTv2, VideoMAE, and UniFormer, with a unified configuration-driven interface and support for Kinetics, Something-Something, and AVA training.

PyTorchVideo from Facebook AI Research provides video datasets, transforms, and model implementations including SlowFast and X3D optimized for production use.

Hugging Face Transformers includes VideoMAE, TimeSformer, and ViViT via the AutoModel interface, and the Hugging Face Hub hosts pretrained checkpoints for most major video transformer models.

Decord is a fast video reader optimized for deep learning pipelines, supporting random temporal access and on-the-fly decoding without reading full video files into memory.

FFMPEG and OpenCV handle video decoding in most research pipelines, while specialized libraries like torchvision include basic video clip sampling utilities.

Applications

Video classification underpins many products. Content moderation pipelines at YouTube, TikTok, and other platforms flag violence, sexual content, or copyrighted material. Sports analytics companies tag plays to compile statistics and highlight reels. Surveillance systems detect falls, fights, or abandoned objects. In autonomous driving, classifiers spot drowsiness, distraction, or gestures inside the vehicle cabin. Home robotics and warehouse automation use action labels for imitation learning, with V-JEPA 2 demonstrating that video representation learning can directly enable robotic manipulation planning. Sign-language translation systems classify continuous gestures into text. Video search engines index clips for natural-language retrieval, and traffic systems classify vehicle movements at intersections. Medical monitoring systems detect patient falls, medication administration, or abnormal behaviors in hospital settings. Fitness applications analyze exercise form and count repetitions using action recognition models running on device cameras.

Limitations

Several challenges remain open.

Long-video understanding beyond about one minute is still hard because the cost of attention or convolution grows with temporal extent, and most architectures process clips of 8 to 64 frames. MovieChat, LongVLM, and LongVU tackle parts of this with memory banks or hierarchical processing, but full-movie reasoning is far from solved.

Computational cost is a persistent burden because videos are orders of magnitude larger than images. A 10-second clip at 30 fps produces 300 frames. 3D CNNs and video transformers processing many frames require tens to hundreds of billions of FLOPs per clip, making deployment on edge devices difficult. Efficient approaches like TSM and X3D partially address this trade-off.

Spurious correlations and shortcut learning are pervasive: a model may label a clip as swimming because of the blue pool background rather than because of the motion, or classify tennis because of the court. Benchmarks like Something-Something deliberately remove static cues to test genuine temporal reasoning. Models trained primarily on Kinetics often exploit background context rather than motion.

Label noise in web-scraped datasets like Kinetics reduces sample efficiency and requires careful data curation.

Generalization to unseen camera angles and first-person (egocentric) video is a weak point of models trained on third-person YouTube data, motivating datasets such as Ego4D and EPIC-KITCHENS-100 that capture first-person perspectives in uncontrolled environments.

Optical flow dependency in two-stream networks requires slow preprocessing that cannot be fully parallelized on GPU, and dense optical flow extraction introduces latency that makes real-time two-stream deployment difficult. TSM and 3D CNN approaches avoid this bottleneck entirely.

Evaluation gaps between benchmarks and real deployment surfaces often leave engineers without good public proxies for production metrics, particularly for domain-specific actions not well represented in Kinetics or Something-Something.

Adversarial brittleness: like all neural networks, video classifiers are susceptible to imperceptible perturbations that dramatically change predicted labels, an important concern for safety-critical applications.

References

Simonyan, K. and Zisserman, A. (2014). "Two-Stream Convolutional Networks for Action Recognition in Videos." arXiv:1406.2199. https://arxiv.org/abs/1406.2199
Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2014). "Learning Spatiotemporal Features with 3D Convolutional Networks" (C3D). arXiv:1412.0767. https://arxiv.org/abs/1412.0767
Carreira, J. and Zisserman, A. (2017). "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset" (I3D). arXiv:1705.07750. https://arxiv.org/abs/1705.07750
Kay, W. et al. (2017). "The Kinetics Human Action Video Dataset." arXiv:1705.06950. https://arxiv.org/abs/1705.06950
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2017). "A Closer Look at Spatiotemporal Convolutions for Action Recognition" (R(2+1)D). arXiv:1711.11248. https://arxiv.org/abs/1711.11248
Lin, J., Gan, C., and Han, S. (2018). "TSM: Temporal Shift Module for Efficient Video Understanding." arXiv:1811.08383. https://arxiv.org/abs/1811.08383
Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2018). "SlowFast Networks for Video Recognition." arXiv:1812.03982. https://arxiv.org/abs/1812.03982
Feichtenhofer, C. (2020). "X3D: Expanding Architectures for Efficient Video Recognition." CVPR 2020. https://openaccess.thecvf.com/content_CVPR_2020/papers/Feichtenhofer_X3D_Expanding_Architectures_for_Efficient_Video_Recognition_CVPR_2020_paper.pdf
Bertasius, G., Wang, H., and Torresani, L. (2021). "Is Space-Time Attention All You Need for Video Understanding?" (TimeSformer). ICML 2021. arXiv:2102.05095. https://arxiv.org/abs/2102.05095
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., and Schmid, C. (2021). "ViViT: A Video Vision Transformer." ICCV 2021. arXiv:2103.15691. https://arxiv.org/abs/2103.15691
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. (2021). "Video Swin Transformer." arXiv:2106.13230. https://arxiv.org/abs/2106.13230
Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., and Feichtenhofer, C. (2021). "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection." CVPR 2022. arXiv:2112.01526. https://arxiv.org/abs/2112.01526
Tong, Z., Song, Y., Wang, J., and Wang, L. (2022). "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training." NeurIPS 2022. arXiv:2203.12602. https://arxiv.org/abs/2203.12602
Soomro, K., Zamir, A. R., and Shah, M. (2012). "UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild." arXiv:1212.0402. https://arxiv.org/abs/1212.0402
Goyal, R. et al. (2017). "The 'Something Something' Video Database for Learning and Evaluating Visual Common Sense." arXiv:1706.04261. https://arxiv.org/abs/1706.04261
Zhang, H., Li, X., and Bing, L. (2023). "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding." arXiv:2306.02858. https://arxiv.org/abs/2306.02858
Bardes, A. et al. (2024). "Revisiting Feature Prediction for Learning Visual Representations from Video" (V-JEPA). Meta AI. https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, Y., and Qiao, Y. (2022). "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning." ICLR 2022. arXiv:2201.04676. https://arxiv.org/abs/2201.04676
Wang, W., Tian, H., Wu, J., Lu, L., Zhao, H., Tao, D., and Qiao, Y. (2022). "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer." arXiv:2211.09552. https://arxiv.org/abs/2211.09552
Wang, P. et al. (2016). "Temporal Segment Networks: Towards Good Practices for Deep Action Recognition." ECCV 2016. arXiv:1608.00859. https://arxiv.org/abs/1608.00859
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., Schmid, C., and Malik, J. (2018). "AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions." CVPR 2018. arXiv:1705.08421. https://arxiv.org/abs/1705.08421
Grauman, K. et al. (2022). "Ego4D: Around the World in 3,000 Hours of Egocentric Video." CVPR 2022. arXiv:2110.07058. https://arxiv.org/abs/2110.07058

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Image Classification Models Segment Anything Model and Dataset (SAM and SA-1B)

History

Pre-deep-learning era

Two-stream networks

3D convolutional networks

SlowFast and efficient designs

Video transformers

Self-supervised pre-training

Video-language foundation models

Architectures

Two-stream networks

3D convolutional networks

Video transformers

Notable models

Kinetics-400 benchmark progression

Datasets

Kinetics

Something-Something

AVA

Tasks beyond classification

Evaluation metrics

Open-source ecosystem

Applications

Limitations

See also

References

Improve this article

Related Articles

Image-to-Image Models

Image Classification Models

Segment Anything Model and Dataset (SAM and SA-1B)

Unconditional Image Generation Models

Visual Question Answering Models

Zero-Shot Image Classification Models

What links here

Related Articles

Image-to-Image Models

Image Classification Models

Segment Anything Model and Dataset (SAM and SA-1B)

Unconditional Image Generation Models

Visual Question Answering Models

Zero-Shot Image Classification Models

What links here