V-JEPA

AI Models Computer Vision Machine Learning Open Source AI

18 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 3,642 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

V-JEPA (Video Joint Embedding Predictive Architecture) is a self-supervised video model from Meta AI that learns by predicting masked regions of a video in an abstract latent representation space rather than by reconstructing pixels. The first V-JEPA was released on February 15, 2024, and was the first published video instance of the Joint Embedding Predictive Architecture (JEPA) paradigm proposed by Yann LeCun in 2022. Its successor, V-JEPA 2, released June 11, 2025, scaled the same recipe to roughly 1.2 billion parameters and more than 1 million hours of video, and used it as a world model for zero-shot robot planning. ^[1]^[2]^[6]

In short: V-JEPA is trained on two million publicly available videos using a feature prediction objective in latent space, with no negative samples, no text supervision, no pretrained image encoders, and no pixel-level reconstruction. ^[1] The model produces general purpose vision transformer backbones that, when frozen, transfer to both motion oriented and appearance oriented downstream tasks through lightweight attentive probes. Meta framed V-JEPA as a concrete step toward learning world models from passive observation of video, an objective LeCun has argued is a prerequisite for autonomous machine intelligence. ^[2]^[4]

The model was introduced through the arXiv preprint Revisiting Feature Prediction for Learning Visual Representations from Video (arXiv:2404.08471) by Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. ^[1] Meta released three pretrained checkpoints, the training and evaluation code, and the masked latent prediction recipe through the facebookresearch/jepa GitHub repository under a Creative Commons Attribution NonCommercial 4.0 license. ^[3]

V-JEPA is important for three reasons. First, it provided the largest piece of public evidence at the time that feature prediction alone, with no pixel decoder and no contrastive auxiliary loss, could match or exceed pixel reconstruction based video pretraining at scale. ^[1] Second, it demonstrated that a single frozen visual backbone could perform competitively across both motion heavy benchmarks such as Something-Something v2 and appearance heavy benchmarks such as Kinetics-400 and ImageNet, suggesting that latent prediction captures more than just shortcut features. ^[1] Third, by providing a working video implementation of LeCun's JEPA recipe, it set the architectural template that V-JEPA 2 later scaled to over one million hours of video and used as the foundation for downstream robot planning. ^[6]

What is V-JEPA in one sentence?

V-JEPA is a non-generative, self-supervised video representation model that trains an encoder and a predictor to forecast the latent features of masked spatiotemporal regions of a video, producing a frozen backbone that transfers to action recognition, video understanding, and image classification. In Meta's own description, "V-JEPA is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space," and "unlike generative approaches that try to fill in every missing pixel, V-JEPA has the flexibility to discard unpredictable information." ^[2]

Key facts at a glance

Attribute	V-JEPA (2024)	V-JEPA 2 (2025)
Developer	Meta AI (FAIR)	Meta AI (FAIR)
Release date	February 15, 2024	June 11, 2025
Prediction target	Latent representation (not pixels)	Latent representation (not pixels)
Training data	~2 million public videos (VideoMix2M)	>1 million hours of video plus ~1 million images
Largest encoder	ViT-H/16 (~630M params)	ViT-g (~1B params), ~1.2B params with predictor
Robot control	None	Zero-shot pick-and-place (V-JEPA 2-AC)
License	CC BY-NC 4.0	MIT

What problem does V-JEPA solve, and where does JEPA come from?

The Joint Embedding Predictive Architecture was introduced by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence. ^[4] In that paper LeCun argued that current generative approaches to self-supervised learning, which reconstruct masked pixels or tokens, expend too much capacity modeling fine perceptual detail that is irrelevant for downstream reasoning. He proposed instead that prediction should occur in an abstract embedding space produced by a learned encoder, so that the system can discard unpredictable or perceptually unimportant content. A JEPA model contains three pieces. An encoder maps an input into a representation. A target encoder, typically an exponential moving average of the online encoder, produces the prediction targets. A predictor maps from one representation to another, optionally conditioned on context tokens that describe what is being predicted from what.

Meta has tied this directly to its long-term research goal. As the company put it at launch, "Our goal is to build advanced machine intelligence that can learn more like humans do, forming internal models of the world around them to learn, adapt, and forge plans efficiently." ^[2]

The first concrete instance of the JEPA paradigm was I-JEPA, an image only model published by Assran and collaborators at CVPR 2023. ^[5] I-JEPA masks large contiguous blocks of an image and trains a predictor to recover their representations from a context view. V-JEPA generalizes the same idea to video by treating short clips as three dimensional tensors of patches and masking large spatiotemporal regions across both frames and pixels. The choice to make V-JEPA non generative is deliberate. The authors view it as a test of LeCun's hypothesis that latent prediction can be a self-sufficient pretraining signal for perception, without the regularization that pixel reconstruction implicitly provides. ^[1]

How does V-JEPA work?

V-JEPA inherits the encoder-target encoder-predictor triplet from I-JEPA and adapts it to video. The encoder is a standard vision transformer with three dimensional patches of size 2x16x16, meaning each patch spans two frames and a sixteen by sixteen pixel region. A short video clip of sixteen frames at 224 by 224 resolution is therefore tokenized into a sequence of approximately one thousand five hundred patches, which the encoder processes with standard self attention. The predictor is a narrower transformer with twelve blocks and an embedding dimension of 384. It receives the context tokens emitted by the encoder together with learnable mask tokens that identify the spatiotemporal positions being predicted, and outputs predicted representations at those positions. ^[1]

The pretraining objective is a smooth L1 loss between the predicted representations and the representations produced by a target encoder applied to the full clip. The target encoder shares architecture with the online encoder and is updated as an exponential moving average of its weights, preventing representational collapse without contrastive negatives. No reconstruction decoder is attached. No labels are used. No pretrained image features are imported.

Two masking strategies are applied simultaneously during pretraining. A short range mask hides a contiguous spatial region across all frames, forcing the predictor to infer occluded content from temporal context. A long range mask hides a larger fraction of patches over the entire clip, forcing the encoder to retain enough global information to be predicted from sparse evidence. Roughly ninety percent of patches are typically masked across the two strategies combined, an aggressive ratio that the authors found necessary to prevent shortcut learning where the predictor could simply interpolate from nearby visible patches. ^[1]

What data and compute was V-JEPA trained on?

V-JEPA models were pretrained on a curated mixture the authors named VideoMix2M, which combines approximately two million public videos. ^[1] The mixture includes HowTo100M, the Kinetics-400 and Kinetics-700 training splits, and the Something-Something v2 training split, with no labels used at any point. Clips are sampled at 5.33 frames per second to produce three second windows of sixteen frames. The dataset is therefore roughly two orders of magnitude smaller than the corpus later assembled for V-JEPA 2 and was small enough to fit within Meta's standard research compute budgets in 2023 and early 2024.

All three released models were trained for 90,000 iterations. The ViT-L/16 and ViT-H/16 variants used a batch size of 3,072 clips, while the higher resolution ViT-H/16-384 variant used a batch size of 2,400 to fit in memory. Training uses the AdamW optimizer with cosine learning rate decay and weight decay scheduling, mixed precision arithmetic, and the standard exponential moving average teacher schedule that ramps from 0.998 to one over the course of training. No data augmentation beyond standard random cropping and horizontal flipping is applied. ^[1]

What are the V-JEPA model variants?

Meta released three V-JEPA checkpoints. All share the same patch size and pretraining recipe and differ only in encoder size and input resolution. ^[3]

Variant	Encoder	Patch size	Resolution	Frames	Iterations	Batch size
V-JEPA ViT-L/16	ViT-Large	2x16x16	224x224	16	90,000	3,072
V-JEPA ViT-H/16	ViT-Huge	2x16x16	224x224	16	90,000	3,072
V-JEPA ViT-H/16-384	ViT-Huge	2x16x16	384x384	16	90,000	2,400

The ViT-L/16 backbone has 24 transformer blocks, a 1,024 dimensional embedding, and 16 attention heads. The ViT-H/16 backbone has 32 blocks, a 1,280 dimensional embedding, and 16 heads. Parameter counts for the encoders are roughly 300 million and 630 million respectively. The predictor is the same compact transformer in all three configurations. The 384 resolution variant takes the same ViT-H/16 encoder but processes larger input clips, which roughly quadruples the patch count and the per step compute. Community ports of the checkpoints have appeared on Hugging Face since release, with the official artifacts distributed through the facebookresearch/jepa repository. ^[3]

How is V-JEPA evaluated?

The V-JEPA paper emphasizes frozen feature evaluation, sometimes called probing. In this protocol the pretrained encoder weights are not updated. Instead, a small attentive probe consisting of a cross attention layer over patch features followed by a linear classifier is trained for each downstream task. This protocol is more demanding than full finetuning because the encoder cannot adapt to task specific statistics, and it is widely regarded as a cleaner measure of the quality of the underlying representation. The authors argue that frozen evaluation is also more practical for foundation models, since a single backbone can be deployed across many tasks without copying or finetuning the weights. ^[1]

The benchmark suite covers two motion oriented video tasks, two appearance oriented video tasks, and three image classification tasks. The video tasks are Kinetics-400 and Something-Something v2 for action classification, plus AVA for spatiotemporal action detection. The image tasks are ImageNet-1K, Places205, and iNaturalist 2021, and are evaluated by treating a single video frame as a degenerate one frame clip. The same probe architecture is used across all tasks so that differences in score reflect differences in the representation rather than the probe.

What benchmark results did V-JEPA achieve?

The headline frozen-evaluation result, reported in the paper abstract, is that the largest model, a ViT-H/16 trained only on videos, reaches 81.9 percent on Kinetics-400, 72.2 percent on Something-Something v2, and 77.9 percent on ImageNet-1K, all under the frozen attentive probe protocol. ^[1] The per-variant numbers reported by Bardes and collaborators are summarized in the table below. All numbers are top-1 accuracy under the frozen attentive probe protocol.

Benchmark	ViT-L/16 224	ViT-H/16 224	ViT-H/16 384
Kinetics-400	80.8	82.0	81.9
Something-Something v2	69.5	71.4	72.2
ImageNet-1K	74.8	75.9	77.4
Places205	60.3	61.7	62.8
iNaturalist 2021	67.8	67.9	72.6

On Kinetics-400, V-JEPA matched or beat the best published pixel reconstruction models at the time, including VideoMAE and OmniMAE, while using a frozen rather than finetuned backbone. ^[1]^[7]^[8] The gap widened on Something-Something v2, a benchmark designed to test fine grained motion understanding that does not reward appearance shortcuts. V-JEPA ViT-H/16-384 reached 72.2 percent there, several points above the strongest pixel reconstruction baselines under matched evaluation protocols. The authors interpret this as direct support for LeCun's argument that latent prediction encourages the encoder to retain motion semantics that pixel decoders tend to discard or absorb into the decoder.

On image classification the picture was more nuanced. V-JEPA outperformed comparably sized video pretraining baselines on ImageNet, Places205, and iNaturalist, but trailed image only foundation models such as DINOv2 that were trained on much larger curated image corpora. The authors did not present V-JEPA as a replacement for image foundation models. They presented it as evidence that a single video pretrained backbone could be competitive across image and video tasks simultaneously, which is a property that no previous video model had demonstrated with frozen weights.

The paper also reports a sample efficiency comparison. When trained for a fixed compute budget against VideoMAE, OmniMAE, and Hiera, V-JEPA reached comparable downstream accuracy with between 1.5 and 6 times fewer training iterations depending on the baseline. ^[1] Meta described this publicly as an improvement in "training and sample efficiency by a factor between 1.5x and 6x" over prior video pretraining methods. ^[2] This is consistent with the broader pattern observed across JEPA style methods, in which avoiding pixel reconstruction lets the model spend more of its capacity on representation quality rather than texture modeling.

How does V-JEPA differ from I-JEPA, pixel reconstruction methods, and V-JEPA 2?

V-JEPA occupies a specific position in the landscape of self-supervised video pretraining. The table below contrasts it with its closest siblings.

Property	I-JEPA	V-JEPA	V-JEPA 2
Released	June 2023	February 2024	June 2025
Input modality	Single images	Sixteen frame clips	Sixty four frame clips
Training data	ImageNet-1K and -22K	VideoMix2M (~2M videos)	~1M hours of internet video plus ~1M images
Largest encoder	ViT-H/14 (~630M params)	ViT-H/16 (~630M params)	ViT-g (~1B params)
Prediction target	Latent representation	Latent representation	Latent representation
Downstream evaluation	ImageNet probing	Frozen probes on K400, SSv2, ImageNet, AVA	Frozen probes plus action conditioned planning
Robot control	None	None	V-JEPA 2-AC variant supports zero-shot pick-and-place
License	CC BY-NC 4.0	CC BY-NC 4.0	MIT

Compared to I-JEPA, V-JEPA adds the temporal dimension. The patches span two frames each, the masking strategy is spatiotemporal rather than purely spatial, and the predictor must reason about motion as well as occlusion. Compared to pixel reconstruction approaches such as VideoMAE, OmniMAE, and Hiera, V-JEPA replaces the pixel decoder with latent prediction and discards reconstruction loss entirely. ^[1]^[7]^[8] Compared to contrastive methods that learn from pairs of augmented views, V-JEPA does not require negatives and instead relies on the asymmetric encoder-target encoder dynamic and aggressive masking to avoid representational collapse.

What is V-JEPA 2?

V-JEPA 2 is the second generation model, released June 11, 2025, and is roughly two orders of magnitude larger than the original V-JEPA in both data and compute. ^[6] It is a 1.2 billion parameter world model that retains the same latent prediction objective and the same vision transformer encoder family, with the encoder scaled to a ViT-g of about 1 billion parameters. ^[6] V-JEPA 2 scales pretraining to a corpus of more than 1 million hours of internet video plus around 1 million images, and is then post trained with an action conditioned variant called V-JEPA 2-AC on only 62 hours of teleoperated robot footage from the public Droid dataset. ^[6]^[9]

The action conditioned model demonstrates zero-shot pick-and-place on Franka Panda arms in labs where no robot data was collected, achieving success rates of roughly 65 to 80 percent for picking and placing new objects in new and unseen environments. ^[9] On understanding and prediction benchmarks, V-JEPA 2 reaches state of the art human action anticipation of 39.7 recall-at-5 on Epic-Kitchens-100 and 77.3 percent top-1 on Something-Something v2 for motion understanding. ^[6]^[9] V-JEPA 2 also introduces three new physical reasoning benchmarks, IntPhys 2, Minimal Video Pairs (MVPBench), and CausalVQA, that the original V-JEPA was not designed to address, and it is released under the MIT license, removing the non commercial restriction. ^[6]^[9]

The continuity between the two models is deliberate. V-JEPA 2 reuses the architectural recipe of V-JEPA largely without modification, and many of the same authors, including Adrien Bardes, Mahmoud Assran, Yann LeCun, Michael Rabbat, and Nicolas Ballas, appear on both papers. ^[1]^[6]

How was V-JEPA received and used?

The release of V-JEPA was covered widely in technical media in February and March 2024, and was framed by Meta as the next step on Yann LeCun's roadmap toward advanced machine intelligence after I-JEPA. ^[2] LeCun used the launch to reiterate his broader argument that generative pixel prediction is the wrong objective for representation learning and that latent prediction in an abstract space is more aligned with how biological systems learn from observation. Within the self-supervised learning research community, V-JEPA was received as the strongest piece of evidence to date for the JEPA hypothesis in video and as a useful frozen backbone for downstream evaluation. The choice to release only frozen probe results rather than fully finetuned numbers drew some criticism from practitioners who wanted to compare against finetuned VideoMAE and similar baselines, but most acknowledged that the frozen protocol was a tighter test of representation quality.

On the practical side, the released checkpoints were adopted as feature extractors in academic work on action recognition, anticipation, and video question answering. Several derivative projects on Hugging Face host community ports of the original ViT-L and ViT-H checkpoints with various probe heads attached. Because the model was released under a non commercial license, deployment in commercial products has been limited. The companion V-JEPA 2 release in June 2025 was issued under the MIT license, removing this restriction for downstream developers. ^[6]

Within Meta the V-JEPA codebase served as the engineering foundation for the V-JEPA 2 project, which built directly on the same encoder, predictor, and masking implementations in the facebookresearch/jepa repository. ^[3]^[6] The pretraining recipe also influenced subsequent work on physical reasoning and intuitive physics in video, including the Intuitive Physics from Self-Supervised Video paper by Garrido and collaborators that used a V-JEPA style encoder to study violation of expectation behavior on synthetic stimuli.

What are V-JEPA's limitations?

V-JEPA has several limitations that its successor was explicitly designed to address. The training corpus of two million clips is small by foundation model standards, and the longest clip the model sees during pretraining is three seconds. ^[1] This limits the temporal extent of motion patterns the encoder can capture and constrains downstream applicability to tasks that fit within a short temporal window. The frozen probe protocol, while useful as a benchmark, does not produce a model that can plan or act. V-JEPA emits representations, not policies, and the original release contains no action conditioning, no goal conditioning, and no facility for closed loop control. The non commercial license also restricts who can build on the model in production settings.

Additionally, although latent prediction avoids the wasted capacity of pixel decoders, it also lacks a natural diagnostic for what the predictor has learned. Researchers cannot inspect a predicted latent the way they can inspect a reconstructed frame. This makes failure analysis harder and was one of the practical reasons V-JEPA 2 introduced new benchmarks focused on physical plausibility and causal reasoning rather than reconstruction quality. ^[6]

ELI5: What does V-JEPA actually do?

Imagine covering up most of a short video clip and asking a model to guess what is behind the cover. A pixel based model would try to paint in every hidden detail, wasting effort on things like exact textures or the color of a passing car that nobody could reliably predict. V-JEPA instead works in a kind of mental shorthand. It does not redraw the hidden pixels. It predicts a compressed description of what is hidden, the same way you might guess "a hand is reaching for the cup" without knowing the precise shape of every finger. By practicing this guessing game on two million videos with no labels, V-JEPA learns a general sense of how objects and motion behave, which can then be reused for tasks like recognizing actions. Its bigger successor, V-JEPA 2, uses the same trick at much larger scale to help robots imagine what will happen if they move, so they can plan actions without being explicitly trained on each new object.

References

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., and Ballas, N. (2024). *Revisiting Feature Prediction for Learning Visual Representations from Video*. arXiv:2404.08471. https://arxiv.org/abs/2404.08471 ↩
Meta AI Research (2024, February 15). *V-JEPA: The next step toward advanced machine intelligence*. Meta AI Blog. https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/ ↩
Facebook AI Research (2024). *facebookresearch/jepa: PyTorch code and models for V-JEPA self-supervised learning from video*. GitHub repository. https://github.com/facebookresearch/jepa ↩
LeCun, Y. (2022). *A Path Towards Autonomous Machine Intelligence*. OpenReview. https://openreview.net/forum?id=BZ5a1r-kVsf ↩
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., and Ballas, N. (2023). *Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture*. Proceedings of CVPR 2023. ↩
Assran, M., Bardes, A., Fan, D., et al. (2025). *V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning*. arXiv:2506.09985. https://arxiv.org/abs/2506.09985 ↩
Tong, Z., Song, Y., Wang, J., and Wang, L. (2022). *VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training*. NeurIPS 2022. ↩
Girdhar, R., El-Nouby, A., Singh, M., Alwala, K. V., Joulin, A., and Misra, I. (2023). *OmniMAE: Single Model Masked Pretraining on Images and Videos*. CVPR 2023. ↩
Meta AI Research (2025, June 11). *Introducing the V-JEPA 2 world model and new benchmarks for physical reasoning*. Meta AI Blog. https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DINOv2 DINOv3 I-JEPA Joint Embedding Predictive Architecture Robotics Models V-JEPA 2

What is V-JEPA in one sentence?

Key facts at a glance

What problem does V-JEPA solve, and where does JEPA come from?

How does V-JEPA work?

What data and compute was V-JEPA trained on?

What are the V-JEPA model variants?

How is V-JEPA evaluated?

What benchmark results did V-JEPA achieve?

How does V-JEPA differ from I-JEPA, pixel reconstruction methods, and V-JEPA 2?

What is V-JEPA 2?

How was V-JEPA received and used?

What are V-JEPA's limitations?

ELI5: What does V-JEPA actually do?

See also

References

Improve this article

Related Articles

Wan 2.1-VACE

V-JEPA 2

DeepSeek-OCR

OpenPose

Rerun (rerun.io)

olmOCR

What links here

Related Articles

Wan 2.1-VACE

V-JEPA 2

DeepSeek-OCR

OpenPose

Rerun (rerun.io)

olmOCR

What links here