V-JEPA
Last reviewed
May 16, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 ยท 2,906 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 ยท 2,906 words
Add missing citations, update stale details, or suggest a clearer explanation.
V-JEPA (Video Joint Embedding Predictive Architecture) is a self-supervised video representation model released by Meta AI on February 15, 2024. It was the first published video instance of the Joint Embedding Predictive Architecture paradigm proposed by Yann LeCun in 2022 and the immediate precursor to V-JEPA 2, which extended the approach to world modeling and zero-shot robot control in June 2025. V-JEPA is trained on two million publicly available videos using a feature prediction objective in latent space, without negative samples, text supervision, pretrained image encoders, or pixel-level reconstruction. The model produces general purpose vision transformer backbones that, when frozen, transfer to motion oriented and appearance oriented downstream tasks through lightweight attentive probes.
The model was introduced through the arXiv preprint Revisiting Feature Prediction for Learning Visual Representations from Video (arXiv:2404.08471) by Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Meta released three pretrained checkpoints, the training and evaluation code, and the masked latent prediction recipe through the facebookresearch/jepa GitHub repository under a Creative Commons Attribution NonCommercial 4.0 license. The model was framed by Meta and by LeCun in particular as a concrete step toward learning world models from passive observation of video, an objective LeCun has argued is a prerequisite for autonomous machine intelligence.
V-JEPA is important for three reasons. First, it provided the largest piece of public evidence at the time that feature prediction alone, with no pixel decoder and no contrastive auxiliary loss, could match or exceed pixel reconstruction based video pretraining at scale. Second, it demonstrated that a single frozen visual backbone could perform competitively across both motion heavy benchmarks such as Something-Something v2 and appearance heavy benchmarks such as Kinetics-400 and ImageNet, suggesting that latent prediction captures more than just shortcut features. Third, by providing a working video implementation of LeCun's JEPA recipe, it set the architectural template that V-JEPA 2 later scaled to over one million hours of video and used as the foundation for downstream robot planning.
The Joint Embedding Predictive Architecture was introduced by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence. In that paper LeCun argued that current generative approaches to self-supervised learning, which reconstruct masked pixels or tokens, expend too much capacity modeling fine perceptual detail that is irrelevant for downstream reasoning. He proposed instead that prediction should occur in an abstract embedding space produced by a learned encoder, so that the system can discard unpredictable or perceptually unimportant content. A JEPA model contains three pieces. An encoder maps an input into a representation. A target encoder, typically an exponential moving average of the online encoder, produces the prediction targets. A predictor maps from one representation to another, optionally conditioned on context tokens that describe what is being predicted from what.
The first concrete instance of the JEPA paradigm was I-JEPA, an image only model published by Assran and collaborators at CVPR 2023. I-JEPA masks large contiguous blocks of an image and trains a predictor to recover their representations from a context view. V-JEPA generalizes the same idea to video by treating short clips as three dimensional tensors of patches and masking large spatiotemporal regions across both frames and pixels. The choice to make V-JEPA non generative is deliberate. The authors view it as a test of LeCun's hypothesis that latent prediction can be a self-sufficient pretraining signal for perception, without the regularization that pixel reconstruction implicitly provides.
V-JEPA inherits the encoder-target encoder-predictor triplet from I-JEPA and adapts it to video. The encoder is a standard vision transformer with three dimensional patches of size 2x16x16, meaning each patch spans two frames and a sixteen by sixteen pixel region. A short video clip of sixteen frames at 224 by 224 resolution is therefore tokenized into a sequence of approximately one thousand five hundred patches, which the encoder processes with standard self attention. The predictor is a narrower transformer with twelve blocks and an embedding dimension of 384. It receives the context tokens emitted by the encoder together with learnable mask tokens that identify the spatiotemporal positions being predicted, and outputs predicted representations at those positions.
The pretraining objective is a smooth L1 loss between the predicted representations and the representations produced by a target encoder applied to the full clip. The target encoder shares architecture with the online encoder and is updated as an exponential moving average of its weights, preventing representational collapse without contrastive negatives. No reconstruction decoder is attached. No labels are used. No pretrained image features are imported.
Two masking strategies are applied simultaneously during pretraining. A short range mask hides a contiguous spatial region across all frames, forcing the predictor to infer occluded content from temporal context. A long range mask hides a larger fraction of patches over the entire clip, forcing the encoder to retain enough global information to be predicted from sparse evidence. Roughly ninety percent of patches are typically masked across the two strategies combined, an aggressive ratio that the authors found necessary to prevent shortcut learning where the predictor could simply interpolate from nearby visible patches.
V-JEPA models were pretrained on a curated mixture the authors named VideoMix2M, which combines approximately two million public videos. The mixture includes HowTo100M, the Kinetics-400 and Kinetics-700 training splits, and the Something-Something v2 training split, with no labels used at any point. Clips are sampled at 5.33 frames per second to produce three second windows of sixteen frames. The dataset is therefore roughly two orders of magnitude smaller than the corpus later assembled for V-JEPA 2 and was small enough to fit within Meta's standard research compute budgets in 2023 and early 2024.
All three released models were trained for 90,000 iterations. The ViT-L/16 and ViT-H/16 variants used a batch size of 3,072 clips, while the higher resolution ViT-H/16-384 variant used a batch size of 2,400 to fit in memory. Training uses the AdamW optimizer with cosine learning rate decay and weight decay scheduling, mixed precision arithmetic, and the standard exponential moving average teacher schedule that ramps from 0.998 to one over the course of training. No data augmentation beyond standard random cropping and horizontal flipping is applied.
Meta released three V-JEPA checkpoints. All share the same patch size and pretraining recipe and differ only in encoder size and input resolution.
| Variant | Encoder | Patch size | Resolution | Frames | Iterations | Batch size |
|---|---|---|---|---|---|---|
| V-JEPA ViT-L/16 | ViT-Large | 2x16x16 | 224x224 | 16 | 90,000 | 3,072 |
| V-JEPA ViT-H/16 | ViT-Huge | 2x16x16 | 224x224 | 16 | 90,000 | 3,072 |
| V-JEPA ViT-H/16-384 | ViT-Huge | 2x16x16 | 384x384 | 16 | 90,000 | 2,400 |
The ViT-L/16 backbone has 24 transformer blocks, a 1,024 dimensional embedding, and 16 attention heads. The ViT-H/16 backbone has 32 blocks, a 1,280 dimensional embedding, and 16 heads. Parameter counts for the encoders are roughly 300 million and 630 million respectively. The predictor is the same compact transformer in all three configurations. The 384 resolution variant takes the same ViT-H/16 encoder but processes larger input clips, which roughly quadruples the patch count and the per step compute. Community ports of the checkpoints have appeared on Hugging Face since release, with the official artifacts distributed through the facebookresearch/jepa repository.
The V-JEPA paper emphasizes frozen feature evaluation, sometimes called probing. In this protocol the pretrained encoder weights are not updated. Instead, a small attentive probe consisting of a cross attention layer over patch features followed by a linear classifier is trained for each downstream task. This protocol is more demanding than full finetuning because the encoder cannot adapt to task specific statistics, and it is widely regarded as a cleaner measure of the quality of the underlying representation. The authors argue that frozen evaluation is also more practical for foundation models, since a single backbone can be deployed across many tasks without copying or finetuning the weights.
The benchmark suite covers two motion oriented video tasks, two appearance oriented video tasks, and three image classification tasks. The video tasks are Kinetics-400 and Something-Something v2 for action classification, plus AVA for spatiotemporal action detection. The image tasks are ImageNet-1K, Places205, and iNaturalist 2021, and are evaluated by treating a single video frame as a degenerate one frame clip. The same probe architecture is used across all tasks so that differences in score reflect differences in the representation rather than the probe.
The headline results reported by Bardes and collaborators are summarized in the table below. All numbers are top-1 accuracy under the frozen attentive probe protocol.
| Benchmark | ViT-L/16 224 | ViT-H/16 224 | ViT-H/16 384 |
|---|---|---|---|
| Kinetics-400 | 80.8 | 82.0 | 81.9 |
| Something-Something v2 | 69.5 | 71.4 | 72.2 |
| ImageNet-1K | 74.8 | 75.9 | 77.4 |
| Places205 | 60.3 | 61.7 | 62.8 |
| iNaturalist 2021 | 67.8 | 67.9 | 72.6 |
On Kinetics-400, V-JEPA matched or beat the best published pixel reconstruction models at the time, including VideoMAE and OmniMAE, while using a frozen rather than finetuned backbone. The gap widened on Something-Something v2, a benchmark designed to test fine grained motion understanding that does not reward appearance shortcuts. V-JEPA ViT-H/16-384 reached 72.2 percent there, several points above the strongest pixel reconstruction baselines under matched evaluation protocols. The authors interpret this as direct support for LeCun's argument that latent prediction encourages the encoder to retain motion semantics that pixel decoders tend to discard or absorb into the decoder.
On image classification the picture was more nuanced. V-JEPA outperformed comparably sized video pretraining baselines on ImageNet, Places205, and iNaturalist, but trailed image only foundation models such as DINOv2 that were trained on much larger curated image corpora. The authors did not present V-JEPA as a replacement for image foundation models. They presented it as evidence that a single video pretrained backbone could be competitive across image and video tasks simultaneously, which is a property that no previous video model had demonstrated with frozen weights.
The paper also reports a sample efficiency comparison. When trained for a fixed compute budget against VideoMAE, OmniMAE, and Hiera, V-JEPA reached comparable downstream accuracy with between 1.5 and 6 times fewer training iterations depending on the baseline. This is consistent with the broader pattern observed across JEPA style methods, in which avoiding pixel reconstruction lets the model spend more of its capacity on representation quality rather than texture modeling.
V-JEPA occupies a specific position in the landscape of self-supervised video pretraining. The table below contrasts it with its closest siblings.
| Property | I-JEPA | V-JEPA | V-JEPA 2 |
|---|---|---|---|
| Released | June 2023 | February 2024 | June 2025 |
| Input modality | Single images | Sixteen frame clips | Sixty four frame clips |
| Training data | ImageNet-1K and -22K | VideoMix2M (~2M videos) | ~1M hours of internet video plus ~1M images |
| Largest encoder | ViT-H/14 (~630M params) | ViT-H/16 (~630M params) | ViT-g (~1B params) |
| Prediction target | Latent representation | Latent representation | Latent representation |
| Downstream evaluation | ImageNet probing | Frozen probes on K400, SSv2, ImageNet, AVA | Frozen probes plus action conditioned planning |
| Robot control | None | None | V-JEPA 2-AC variant supports zero-shot pick-and-place |
| License | CC BY-NC 4.0 | CC BY-NC 4.0 | MIT |
Compared to I-JEPA, V-JEPA adds the temporal dimension. The patches span two frames each, the masking strategy is spatiotemporal rather than purely spatial, and the predictor must reason about motion as well as occlusion. Compared to pixel reconstruction approaches such as VideoMAE, OmniMAE, and Hiera, V-JEPA replaces the pixel decoder with latent prediction and discards reconstruction loss entirely. Compared to contrastive methods that learn from pairs of augmented views, V-JEPA does not require negatives and instead relies on the asymmetric encoder-target encoder dynamic and aggressive masking to avoid representational collapse.
Compared to V-JEPA 2, the original V-JEPA is roughly two orders of magnitude smaller in both data and compute. V-JEPA 2 retains the same latent prediction objective and the same vision transformer encoder family, but scales pretraining to a corpus of approximately one million hours of internet video plus around one million images, introduces an action conditioned variant called V-JEPA 2-AC that is post trained on under sixty two hours of teleoperated robot footage from the public Droid dataset, and demonstrates zero-shot pick-and-place on Franka Panda arms. V-JEPA 2 also introduces three new physical reasoning benchmarks (IntPhys 2, Minimal Video Pairs, and Causal VQA) that V-JEPA itself was not designed to address. The continuity between the two models is deliberate. V-JEPA 2 reuses the architectural recipe of V-JEPA largely without modification, and many of the same authors, including Adrien Bardes, Mahmoud Assran, Yann LeCun, Michael Rabbat, and Nicolas Ballas, appear on both papers.
The release of V-JEPA was covered widely in technical media in February and March 2024, and was framed by Meta as the next step on Yann LeCun's roadmap toward advanced machine intelligence after I-JEPA. LeCun used the launch to reiterate his broader argument that generative pixel prediction is the wrong objective for representation learning and that latent prediction in an abstract space is more aligned with how biological systems learn from observation. Within the self-supervised learning research community, V-JEPA was received as the strongest piece of evidence to date for the JEPA hypothesis in video and as a useful frozen backbone for downstream evaluation. The choice to release only frozen probe results rather than fully finetuned numbers drew some criticism from practitioners who wanted to compare against finetuned VideoMAE and similar baselines, but most acknowledged that the frozen protocol was a tighter test of representation quality.
On the practical side, the released checkpoints were adopted as feature extractors in academic work on action recognition, anticipation, and video question answering. Several derivative projects on Hugging Face host community ports of the original ViT-L and ViT-H checkpoints with various probe heads attached. Because the model was released under a non commercial license, deployment in commercial products has been limited. The companion V-JEPA 2 release in June 2025 was issued under the MIT license, removing this restriction for downstream developers.
Within Meta the V-JEPA codebase served as the engineering foundation for the V-JEPA 2 project, which built directly on the same encoder, predictor, and masking implementations in the facebookresearch/jepa repository. The pretraining recipe also influenced subsequent work on physical reasoning and intuitive physics in video, including the Intuitive Physics from Self-Supervised Video paper by Garrido and collaborators that used a V-JEPA style encoder to study violation of expectation behavior on synthetic stimuli.
V-JEPA has several limitations that its successor was explicitly designed to address. The training corpus of two million clips is small by foundation model standards, and the longest clip the model sees during pretraining is three seconds. This limits the temporal extent of motion patterns the encoder can capture and constrains downstream applicability to tasks that fit within a short temporal window. The frozen probe protocol, while useful as a benchmark, does not produce a model that can plan or act. V-JEPA emits representations, not policies, and the original release contains no action conditioning, no goal conditioning, and no facility for closed loop control. The non commercial license also restricts who can build on the model in production settings.
Additionally, although latent prediction avoids the wasted capacity of pixel decoders, it also lacks a natural diagnostic for what the predictor has learned. Researchers cannot inspect a predicted latent the way they can inspect a reconstructed frame. This makes failure analysis harder and was one of the practical reasons V-JEPA 2 introduced new benchmarks focused on physical plausibility and causal reasoning rather than reconstruction quality.