V-JEPA
Last reviewed
Sources
9 citations
Review status
Source-backed
Revision
v2 ยท 3,642 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
9 citations
Review status
Source-backed
Revision
v2 ยท 3,642 words
Add missing citations, update stale details, or suggest a clearer explanation.
V-JEPA (Video Joint Embedding Predictive Architecture) is a self-supervised video model from Meta AI that learns by predicting masked regions of a video in an abstract latent representation space rather than by reconstructing pixels. The first V-JEPA was released on February 15, 2024, and was the first published video instance of the Joint Embedding Predictive Architecture (JEPA) paradigm proposed by Yann LeCun in 2022. Its successor, V-JEPA 2, released June 11, 2025, scaled the same recipe to roughly 1.2 billion parameters and more than 1 million hours of video, and used it as a world model for zero-shot robot planning. [1][2][6]
In short: V-JEPA is trained on two million publicly available videos using a feature prediction objective in latent space, with no negative samples, no text supervision, no pretrained image encoders, and no pixel-level reconstruction. [1] The model produces general purpose vision transformer backbones that, when frozen, transfer to both motion oriented and appearance oriented downstream tasks through lightweight attentive probes. Meta framed V-JEPA as a concrete step toward learning world models from passive observation of video, an objective LeCun has argued is a prerequisite for autonomous machine intelligence. [2][4]
The model was introduced through the arXiv preprint Revisiting Feature Prediction for Learning Visual Representations from Video (arXiv:2404.08471) by Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. [1] Meta released three pretrained checkpoints, the training and evaluation code, and the masked latent prediction recipe through the facebookresearch/jepa GitHub repository under a Creative Commons Attribution NonCommercial 4.0 license. [3]
V-JEPA is important for three reasons. First, it provided the largest piece of public evidence at the time that feature prediction alone, with no pixel decoder and no contrastive auxiliary loss, could match or exceed pixel reconstruction based video pretraining at scale. [1] Second, it demonstrated that a single frozen visual backbone could perform competitively across both motion heavy benchmarks such as Something-Something v2 and appearance heavy benchmarks such as Kinetics-400 and ImageNet, suggesting that latent prediction captures more than just shortcut features. [1] Third, by providing a working video implementation of LeCun's JEPA recipe, it set the architectural template that V-JEPA 2 later scaled to over one million hours of video and used as the foundation for downstream robot planning. [6]
V-JEPA is a non-generative, self-supervised video representation model that trains an encoder and a predictor to forecast the latent features of masked spatiotemporal regions of a video, producing a frozen backbone that transfers to action recognition, video understanding, and image classification. In Meta's own description, "V-JEPA is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space," and "unlike generative approaches that try to fill in every missing pixel, V-JEPA has the flexibility to discard unpredictable information." [2]
| Attribute | V-JEPA (2024) | V-JEPA 2 (2025) |
|---|---|---|
| Developer | Meta AI (FAIR) | Meta AI (FAIR) |
| Release date | February 15, 2024 | June 11, 2025 |
| Prediction target | Latent representation (not pixels) | Latent representation (not pixels) |
| Training data | ~2 million public videos (VideoMix2M) | >1 million hours of video plus ~1 million images |
| Largest encoder | ViT-H/16 (~630M params) | ViT-g (~1B params), ~1.2B params with predictor |
| Robot control | None | Zero-shot pick-and-place (V-JEPA 2-AC) |
| License | CC BY-NC 4.0 | MIT |
The Joint Embedding Predictive Architecture was introduced by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence. [4] In that paper LeCun argued that current generative approaches to self-supervised learning, which reconstruct masked pixels or tokens, expend too much capacity modeling fine perceptual detail that is irrelevant for downstream reasoning. He proposed instead that prediction should occur in an abstract embedding space produced by a learned encoder, so that the system can discard unpredictable or perceptually unimportant content. A JEPA model contains three pieces. An encoder maps an input into a representation. A target encoder, typically an exponential moving average of the online encoder, produces the prediction targets. A predictor maps from one representation to another, optionally conditioned on context tokens that describe what is being predicted from what.
Meta has tied this directly to its long-term research goal. As the company put it at launch, "Our goal is to build advanced machine intelligence that can learn more like humans do, forming internal models of the world around them to learn, adapt, and forge plans efficiently." [2]
The first concrete instance of the JEPA paradigm was I-JEPA, an image only model published by Assran and collaborators at CVPR 2023. [5] I-JEPA masks large contiguous blocks of an image and trains a predictor to recover their representations from a context view. V-JEPA generalizes the same idea to video by treating short clips as three dimensional tensors of patches and masking large spatiotemporal regions across both frames and pixels. The choice to make V-JEPA non generative is deliberate. The authors view it as a test of LeCun's hypothesis that latent prediction can be a self-sufficient pretraining signal for perception, without the regularization that pixel reconstruction implicitly provides. [1]
V-JEPA inherits the encoder-target encoder-predictor triplet from I-JEPA and adapts it to video. The encoder is a standard vision transformer with three dimensional patches of size 2x16x16, meaning each patch spans two frames and a sixteen by sixteen pixel region. A short video clip of sixteen frames at 224 by 224 resolution is therefore tokenized into a sequence of approximately one thousand five hundred patches, which the encoder processes with standard self attention. The predictor is a narrower transformer with twelve blocks and an embedding dimension of 384. It receives the context tokens emitted by the encoder together with learnable mask tokens that identify the spatiotemporal positions being predicted, and outputs predicted representations at those positions. [1]
The pretraining objective is a smooth L1 loss between the predicted representations and the representations produced by a target encoder applied to the full clip. The target encoder shares architecture with the online encoder and is updated as an exponential moving average of its weights, preventing representational collapse without contrastive negatives. No reconstruction decoder is attached. No labels are used. No pretrained image features are imported.
Two masking strategies are applied simultaneously during pretraining. A short range mask hides a contiguous spatial region across all frames, forcing the predictor to infer occluded content from temporal context. A long range mask hides a larger fraction of patches over the entire clip, forcing the encoder to retain enough global information to be predicted from sparse evidence. Roughly ninety percent of patches are typically masked across the two strategies combined, an aggressive ratio that the authors found necessary to prevent shortcut learning where the predictor could simply interpolate from nearby visible patches. [1]
V-JEPA models were pretrained on a curated mixture the authors named VideoMix2M, which combines approximately two million public videos. [1] The mixture includes HowTo100M, the Kinetics-400 and Kinetics-700 training splits, and the Something-Something v2 training split, with no labels used at any point. Clips are sampled at 5.33 frames per second to produce three second windows of sixteen frames. The dataset is therefore roughly two orders of magnitude smaller than the corpus later assembled for V-JEPA 2 and was small enough to fit within Meta's standard research compute budgets in 2023 and early 2024.
All three released models were trained for 90,000 iterations. The ViT-L/16 and ViT-H/16 variants used a batch size of 3,072 clips, while the higher resolution ViT-H/16-384 variant used a batch size of 2,400 to fit in memory. Training uses the AdamW optimizer with cosine learning rate decay and weight decay scheduling, mixed precision arithmetic, and the standard exponential moving average teacher schedule that ramps from 0.998 to one over the course of training. No data augmentation beyond standard random cropping and horizontal flipping is applied. [1]
Meta released three V-JEPA checkpoints. All share the same patch size and pretraining recipe and differ only in encoder size and input resolution. [3]
| Variant | Encoder | Patch size | Resolution | Frames | Iterations | Batch size |
|---|---|---|---|---|---|---|
| V-JEPA ViT-L/16 | ViT-Large | 2x16x16 | 224x224 | 16 | 90,000 | 3,072 |
| V-JEPA ViT-H/16 | ViT-Huge | 2x16x16 | 224x224 | 16 | 90,000 | 3,072 |
| V-JEPA ViT-H/16-384 | ViT-Huge | 2x16x16 | 384x384 | 16 | 90,000 | 2,400 |
The ViT-L/16 backbone has 24 transformer blocks, a 1,024 dimensional embedding, and 16 attention heads. The ViT-H/16 backbone has 32 blocks, a 1,280 dimensional embedding, and 16 heads. Parameter counts for the encoders are roughly 300 million and 630 million respectively. The predictor is the same compact transformer in all three configurations. The 384 resolution variant takes the same ViT-H/16 encoder but processes larger input clips, which roughly quadruples the patch count and the per step compute. Community ports of the checkpoints have appeared on Hugging Face since release, with the official artifacts distributed through the facebookresearch/jepa repository. [3]
The V-JEPA paper emphasizes frozen feature evaluation, sometimes called probing. In this protocol the pretrained encoder weights are not updated. Instead, a small attentive probe consisting of a cross attention layer over patch features followed by a linear classifier is trained for each downstream task. This protocol is more demanding than full finetuning because the encoder cannot adapt to task specific statistics, and it is widely regarded as a cleaner measure of the quality of the underlying representation. The authors argue that frozen evaluation is also more practical for foundation models, since a single backbone can be deployed across many tasks without copying or finetuning the weights. [1]
The benchmark suite covers two motion oriented video tasks, two appearance oriented video tasks, and three image classification tasks. The video tasks are Kinetics-400 and Something-Something v2 for action classification, plus AVA for spatiotemporal action detection. The image tasks are ImageNet-1K, Places205, and iNaturalist 2021, and are evaluated by treating a single video frame as a degenerate one frame clip. The same probe architecture is used across all tasks so that differences in score reflect differences in the representation rather than the probe.
The headline frozen-evaluation result, reported in the paper abstract, is that the largest model, a ViT-H/16 trained only on videos, reaches 81.9 percent on Kinetics-400, 72.2 percent on Something-Something v2, and 77.9 percent on ImageNet-1K, all under the frozen attentive probe protocol. [1] The per-variant numbers reported by Bardes and collaborators are summarized in the table below. All numbers are top-1 accuracy under the frozen attentive probe protocol.
| Benchmark | ViT-L/16 224 | ViT-H/16 224 | ViT-H/16 384 |
|---|---|---|---|
| Kinetics-400 | 80.8 | 82.0 | 81.9 |
| Something-Something v2 | 69.5 | 71.4 | 72.2 |
| ImageNet-1K | 74.8 | 75.9 | 77.4 |
| Places205 | 60.3 | 61.7 | 62.8 |
| iNaturalist 2021 | 67.8 | 67.9 | 72.6 |
On Kinetics-400, V-JEPA matched or beat the best published pixel reconstruction models at the time, including VideoMAE and OmniMAE, while using a frozen rather than finetuned backbone. [1][7][8] The gap widened on Something-Something v2, a benchmark designed to test fine grained motion understanding that does not reward appearance shortcuts. V-JEPA ViT-H/16-384 reached 72.2 percent there, several points above the strongest pixel reconstruction baselines under matched evaluation protocols. The authors interpret this as direct support for LeCun's argument that latent prediction encourages the encoder to retain motion semantics that pixel decoders tend to discard or absorb into the decoder.
On image classification the picture was more nuanced. V-JEPA outperformed comparably sized video pretraining baselines on ImageNet, Places205, and iNaturalist, but trailed image only foundation models such as DINOv2 that were trained on much larger curated image corpora. The authors did not present V-JEPA as a replacement for image foundation models. They presented it as evidence that a single video pretrained backbone could be competitive across image and video tasks simultaneously, which is a property that no previous video model had demonstrated with frozen weights.
The paper also reports a sample efficiency comparison. When trained for a fixed compute budget against VideoMAE, OmniMAE, and Hiera, V-JEPA reached comparable downstream accuracy with between 1.5 and 6 times fewer training iterations depending on the baseline. [1] Meta described this publicly as an improvement in "training and sample efficiency by a factor between 1.5x and 6x" over prior video pretraining methods. [2] This is consistent with the broader pattern observed across JEPA style methods, in which avoiding pixel reconstruction lets the model spend more of its capacity on representation quality rather than texture modeling.
V-JEPA occupies a specific position in the landscape of self-supervised video pretraining. The table below contrasts it with its closest siblings.
| Property | I-JEPA | V-JEPA | V-JEPA 2 |
|---|---|---|---|
| Released | June 2023 | February 2024 | June 2025 |
| Input modality | Single images | Sixteen frame clips | Sixty four frame clips |
| Training data | ImageNet-1K and -22K | VideoMix2M (~2M videos) | ~1M hours of internet video plus ~1M images |
| Largest encoder | ViT-H/14 (~630M params) | ViT-H/16 (~630M params) | ViT-g (~1B params) |
| Prediction target | Latent representation | Latent representation | Latent representation |
| Downstream evaluation | ImageNet probing | Frozen probes on K400, SSv2, ImageNet, AVA | Frozen probes plus action conditioned planning |
| Robot control | None | None | V-JEPA 2-AC variant supports zero-shot pick-and-place |
| License | CC BY-NC 4.0 | CC BY-NC 4.0 | MIT |
Compared to I-JEPA, V-JEPA adds the temporal dimension. The patches span two frames each, the masking strategy is spatiotemporal rather than purely spatial, and the predictor must reason about motion as well as occlusion. Compared to pixel reconstruction approaches such as VideoMAE, OmniMAE, and Hiera, V-JEPA replaces the pixel decoder with latent prediction and discards reconstruction loss entirely. [1][7][8] Compared to contrastive methods that learn from pairs of augmented views, V-JEPA does not require negatives and instead relies on the asymmetric encoder-target encoder dynamic and aggressive masking to avoid representational collapse.
V-JEPA 2 is the second generation model, released June 11, 2025, and is roughly two orders of magnitude larger than the original V-JEPA in both data and compute. [6] It is a 1.2 billion parameter world model that retains the same latent prediction objective and the same vision transformer encoder family, with the encoder scaled to a ViT-g of about 1 billion parameters. [6] V-JEPA 2 scales pretraining to a corpus of more than 1 million hours of internet video plus around 1 million images, and is then post trained with an action conditioned variant called V-JEPA 2-AC on only 62 hours of teleoperated robot footage from the public Droid dataset. [6][9]
The action conditioned model demonstrates zero-shot pick-and-place on Franka Panda arms in labs where no robot data was collected, achieving success rates of roughly 65 to 80 percent for picking and placing new objects in new and unseen environments. [9] On understanding and prediction benchmarks, V-JEPA 2 reaches state of the art human action anticipation of 39.7 recall-at-5 on Epic-Kitchens-100 and 77.3 percent top-1 on Something-Something v2 for motion understanding. [6][9] V-JEPA 2 also introduces three new physical reasoning benchmarks, IntPhys 2, Minimal Video Pairs (MVPBench), and CausalVQA, that the original V-JEPA was not designed to address, and it is released under the MIT license, removing the non commercial restriction. [6][9]
The continuity between the two models is deliberate. V-JEPA 2 reuses the architectural recipe of V-JEPA largely without modification, and many of the same authors, including Adrien Bardes, Mahmoud Assran, Yann LeCun, Michael Rabbat, and Nicolas Ballas, appear on both papers. [1][6]
The release of V-JEPA was covered widely in technical media in February and March 2024, and was framed by Meta as the next step on Yann LeCun's roadmap toward advanced machine intelligence after I-JEPA. [2] LeCun used the launch to reiterate his broader argument that generative pixel prediction is the wrong objective for representation learning and that latent prediction in an abstract space is more aligned with how biological systems learn from observation. Within the self-supervised learning research community, V-JEPA was received as the strongest piece of evidence to date for the JEPA hypothesis in video and as a useful frozen backbone for downstream evaluation. The choice to release only frozen probe results rather than fully finetuned numbers drew some criticism from practitioners who wanted to compare against finetuned VideoMAE and similar baselines, but most acknowledged that the frozen protocol was a tighter test of representation quality.
On the practical side, the released checkpoints were adopted as feature extractors in academic work on action recognition, anticipation, and video question answering. Several derivative projects on Hugging Face host community ports of the original ViT-L and ViT-H checkpoints with various probe heads attached. Because the model was released under a non commercial license, deployment in commercial products has been limited. The companion V-JEPA 2 release in June 2025 was issued under the MIT license, removing this restriction for downstream developers. [6]
Within Meta the V-JEPA codebase served as the engineering foundation for the V-JEPA 2 project, which built directly on the same encoder, predictor, and masking implementations in the facebookresearch/jepa repository. [3][6] The pretraining recipe also influenced subsequent work on physical reasoning and intuitive physics in video, including the Intuitive Physics from Self-Supervised Video paper by Garrido and collaborators that used a V-JEPA style encoder to study violation of expectation behavior on synthetic stimuli.
V-JEPA has several limitations that its successor was explicitly designed to address. The training corpus of two million clips is small by foundation model standards, and the longest clip the model sees during pretraining is three seconds. [1] This limits the temporal extent of motion patterns the encoder can capture and constrains downstream applicability to tasks that fit within a short temporal window. The frozen probe protocol, while useful as a benchmark, does not produce a model that can plan or act. V-JEPA emits representations, not policies, and the original release contains no action conditioning, no goal conditioning, and no facility for closed loop control. The non commercial license also restricts who can build on the model in production settings.
Additionally, although latent prediction avoids the wasted capacity of pixel decoders, it also lacks a natural diagnostic for what the predictor has learned. Researchers cannot inspect a predicted latent the way they can inspect a reconstructed frame. This makes failure analysis harder and was one of the practical reasons V-JEPA 2 introduced new benchmarks focused on physical plausibility and causal reasoning rather than reconstruction quality. [6]
Imagine covering up most of a short video clip and asking a model to guess what is behind the cover. A pixel based model would try to paint in every hidden detail, wasting effort on things like exact textures or the color of a passing car that nobody could reliably predict. V-JEPA instead works in a kind of mental shorthand. It does not redraw the hidden pixels. It predicts a compressed description of what is hidden, the same way you might guess "a hand is reaching for the cup" without knowing the precise shape of every finger. By practicing this guessing game on two million videos with no labels, V-JEPA learns a general sense of how objects and motion behave, which can then be reused for tasks like recognizing actions. Its bigger successor, V-JEPA 2, uses the same trick at much larger scale to help robots imagine what will happen if they move, so they can plan actions without being explicitly trained on each new object.