V-JEPA 2
Last reviewed
May 16, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 2,941 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 2,941 words
Add missing citations, update stale details, or suggest a clearer explanation.
V-JEPA 2 (Video Joint Embedding Predictive Architecture 2) is an open-source video world model developed by Meta AI and released on June 11, 2025. The model is the second iteration of the V-JEPA line and the most prominent published embodiment of Yann LeCun's Joint Embedding Predictive Architecture (JEPA) paradigm for autonomous machine intelligence. V-JEPA 2 is pretrained via self-supervised learning on more than one million hours of internet video together with roughly one million images, and is designed to support three capabilities at once: visual understanding, future prediction, and action planning. A companion variant, V-JEPA 2-AC, is post-trained on under sixty-two hours of unlabeled teleoperated robot footage from the public Droid dataset and enables zero-shot robot control on Franka Panda arms in laboratories the model has never seen.
The model was published in the arXiv preprint V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (arXiv:2506.09985) by a team led by Mido Assran with twenty-nine co-authors, including Yann LeCun, Michael Rabbat, Nicolas Ballas, Adrien Bardes, and David Fan. Meta released code, multiple checkpoints, and evaluation probes under an MIT license through the facebookresearch/vjepa2 GitHub repository and a companion Hugging Face collection.
V-JEPA 2 is notable for three reasons. First, it predicts future states in a learned latent representation space rather than in pixel space, in keeping with LeCun's long argued thesis that generative pixel prediction wastes capacity on perceptually irrelevant detail. Second, it produces a single visual backbone that achieves competitive or state-of-the-art results across motion understanding, action anticipation, and video question answering when paired with a frozen probe or a language model. Third, by demonstrating zero-shot pick-and-place on a real robot trained only on internet video plus a small amount of unlabeled teleoperation data, it provides the first large-scale public evidence that a non-generative world model can support real robot planning without environment-specific data collection.
The JEPA framework was introduced by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence. In that paper LeCun proposed that systems aiming for human-like reasoning should learn predictive world models in an abstract embedding space rather than by reconstructing raw inputs. A JEPA model contains an encoder that maps an input into a representation, a predictor that maps from one such representation to another conditioned on context or actions, and a training objective that aligns the predicted representation with a target representation produced by a target encoder. Because the prediction is in latent space, the system is free to discard unpredictable low-level detail such as exact pixel values, background noise, and viewing angle, and instead capture the structural regularities that govern how a scene evolves.
Meta's research group released several JEPA implementations before V-JEPA 2. I-JEPA, presented in 2023, applied the framework to still images and showed that masked-region prediction in latent space could produce strong visual representations without contrastive learning, data augmentation, or pixel reconstruction. V-JEPA, released in early 2024, extended the same recipe to short video clips, training a vision transformer to predict the embeddings of masked spatiotemporal tubes given the embeddings of the visible portion. V-JEPA 2 scales this approach by orders of magnitude in both data and parameter count, adds a second post-training stage focused on action-conditioned prediction, and introduces three new benchmarks aimed at separating genuine physical understanding from textual or statistical shortcuts.
| Model | Year | Modality | Pretraining data | Parameter count | Primary contribution |
|---|---|---|---|---|---|
| I-JEPA | 2023 | Still images | ImageNet (1.3M images) | ViT-H up to 632M | First JEPA-based vision model; latent-space masked prediction |
| V-JEPA | 2024 | Short videos | ~2M public videos | ViT-L and ViT-H | Extends latent masked prediction to spatiotemporal tubes |
| V-JEPA 2 | June 2025 | Long-form internet video plus images | >1M hours of video plus ~1M images | ViT-L 300M, ViT-H 600M, ViT-g 1B, plus 1.2B world-model configuration | Internet-scale self-supervised video pretraining for understanding, prediction, and planning |
| V-JEPA 2-AC | June 2025 | Robot teleoperation | <62 hours of unlabeled Droid footage | 300M action-conditioned predictor over frozen ViT-g encoder | Latent action-conditioned world model enabling zero-shot robot planning |
V-JEPA 2 follows the canonical JEPA template with two transformer modules. The encoder is a vision transformer applied to a video clip that has been tokenized into spatiotemporal patches. The predictor is a separate transformer that takes the encoder's output for a context region and a set of position queries and outputs predicted embeddings for the masked region. The training objective is a regression loss between the predicted embeddings and target embeddings produced by an exponential-moving-average copy of the encoder. To prevent representational collapse, V-JEPA 2 inherits the variance, invariance, and covariance style of design choices that have proved stable across the JEPA family, including stop-gradient through the target encoder and careful regularization of the predictor's output distribution.
The released V-JEPA 2 checkpoints span four sizes: ViT-L/16 with 300 million parameters at 256 pixel input resolution, ViT-H/16 with 600 million parameters at 256 resolution, ViT-g/16 with one billion parameters at 256 resolution, and a ViT-g/16 trained at 384 resolution. The largest configuration used in the world-model evaluations is described in the paper as a 1.2 billion parameter model when combined with the predictor that is used for downstream video understanding tasks. The model ingests sixty-four frame clips during training and supports variable sampling rates at inference.
For V-JEPA 2-AC, the action-conditioned variant, Meta keeps the V-JEPA 2 encoder frozen and trains a new 300 million parameter transformer predictor with block-causal attention. This predictor takes a sequence of past encoded frames plus a sequence of low-dimensional action vectors (joint commands for the robot arm) and autoregressively predicts the encoder's representation of the next frame. Because both the past states and the predicted states live in the same latent space, planning at test time can be performed entirely in that space without ever generating pixels. Image-goal planning is implemented using model-predictive control: candidate action sequences are rolled out through the action-conditioned predictor, the resulting predicted latent is compared with the encoded goal image, and the cross-entropy method is used to refine the action distribution.
V-JEPA 2 pretraining proceeds in a single self-supervised stage on the combined video-and-image corpus. Meta describes the video portion as drawing on more than one million hours of internet-scale footage, with the precise composition not fully disclosed but spanning diverse domains and camera viewpoints. During pretraining the model is given a tokenized video clip from which a substantial fraction of patches are masked, and it is trained to predict the embedding of those masked tubes. Crucially, the loss is computed in feature space against the target encoder rather than against the raw pixel values, which the team has repeatedly argued is essential for learning representations that are robust to nuisance variation.
The action-conditioned post-training stage for V-JEPA 2-AC uses less than sixty-two hours of unlabeled video drawn from the Droid dataset, the open-source teleoperation corpus collected from a fleet of Franka Panda arms across many institutions. "Unlabeled" here means that the post-training does not depend on natural language task descriptions or reward labels, only on the synchronized pairing of robot proprioception and video. The total compute and wall-clock cost of the action-conditioned post-training is small compared to internet-scale pretraining, reflecting the JEPA philosophy that most knowledge about the physical world should already be present in the encoder after large-scale passive observation.
For downstream evaluations the team uses two main protocols. The first freezes the encoder and trains a lightweight attentive read-out probe for classification or anticipation tasks. The second aligns the frozen encoder with a large language model through a projection layer and a short instruction-tuning phase to support video question answering.
V-JEPA 2 was evaluated across a wide suite of established video understanding benchmarks. Two of the headline results sit on Something-Something v2, which tests fine-grained motion understanding, and Epic-Kitchens-100, which tests action anticipation one second into the future. The model also sets new state-of-the-art numbers on two video question-answering benchmarks that are sensitive to temporal and causal reasoning when paired with a language model.
| Benchmark | Capability tested | V-JEPA 2 score | Notes |
|---|---|---|---|
| Something-Something v2 | Fine-grained motion classification | 77.3 top-1 accuracy | Frozen encoder with attentive read-out; outperforms InternVideo and VideoMAEv2 |
| Epic-Kitchens-100 | Action anticipation at one second horizon | 39.7 recall at 5 | New state of the art; previous best methods were specifically designed for anticipation |
| Perception Test | Multimodal video question answering | 84.0 | Frozen encoder aligned with a large language model |
| TempCompass | Temporal reasoning in video QA | 76.9 | State of the art with language model alignment |
Meta also introduced three new benchmarks alongside V-JEPA 2 to probe physical understanding more rigorously. IntPhys 2 evaluates the ability of a model to detect physically implausible events, with human raters reaching approximately eighty-five to ninety-five percent accuracy. MVPBench, or Minimal Video Pairs Bench, presents the same video question with two nearly identical clips that differ in the answer, removing many of the surface-level shortcuts that video question-answering models can otherwise exploit. CausalVQA targets cause-and-effect reasoning, counterfactual questions, and short-horizon anticipation. On these new benchmarks V-JEPA 2 generally leads other open video foundation models but still trails human performance by a wide margin, which the authors argue is exactly the headroom that motivates further research on world modeling.
The action-conditioned variant was deployed zero-shot on Franka Panda arms in two different laboratories, neither of which contributed data to the encoder or predictor training. Tasks were specified as a single goal image showing the desired final scene; the controller used model-predictive control through the latent predictor to pick action sequences. Reported success rates are 100 percent on simple reach tasks, an average of 65 percent on grasp tasks, an average of 75 percent on reach-with-object tasks, and an average of 65 percent on full pick-and-place tasks across the two labs. The results are notable because the policy is goal-conditioned rather than language-conditioned, has never seen the specific robots or objects in question, and never receives a reward signal.
V-JEPA 2 arrived in the middle of a broader shift among large research labs toward what are loosely called world foundation models. These systems share the goal of letting agents predict and plan in a learned model of the physical world, but they differ sharply in what they actually predict and how they are intended to be used. The table below summarizes how V-JEPA 2 compares to two prominent contemporaries, Google DeepMind's Genie line and NVIDIA's Cosmos suite.
| System | Developer | Architecture style | What it predicts | Primary use case | Approximate scale | Openness |
|---|---|---|---|---|---|---|
| V-JEPA 2 | Meta AI | Joint embedding predictive (non-generative) | Future latent representations of video | Understanding, prediction, robot planning | 1.2B parameters; >1M hours video pretraining | Open weights and code under MIT |
| Genie 3 | Google DeepMind | Autoregressive generative simulator | Pixel-level next frames conditioned on actions | Interactive, navigable simulated worlds | Reported in the eleven billion parameter range | Closed; access via research preview |
| NVIDIA Cosmos | NVIDIA | Generative world foundation model family (Predict, Transfer, Reason) | Pixel-level video and physics-aware synthetic data | Simulation, synthetic data for robotics and AV | 7B and 14B parameter variants | Open weights with permissive license |
Meta's blog post on V-JEPA 2 reports that the system is approximately thirty times faster than NVIDIA Cosmos on physical reasoning tasks, which is consistent with the architectural difference: V-JEPA 2 predicts a small set of latent features per frame, whereas pixel-generative systems must produce a full image. The trade-off is that V-JEPA 2 cannot be used to render a video for a human viewer; its outputs are only meaningful when fed back into the planner or into a downstream probe. World Labs, founded in 2024 to build 3D world models, occupies yet another point in the design space, focusing on generating navigable 3D scenes from images rather than predicting video futures.
The broader debate that V-JEPA 2 is intended to advance, and that LeCun has championed publicly for years, is whether the next generation of foundation models should be generative or predictive in latent space. V-JEPA 2's results are widely cited as the strongest existing evidence that the latent predictive route can scale.
Meta released V-JEPA 2 through the facebookresearch/vjepa2 repository on GitHub. The majority of the code is licensed under the MIT License, with a small number of utility files under Apache 2.0. Pretrained encoder weights for all four model sizes are available, along with the action-conditioned predictor checkpoint trained from the ViT-g encoder. The release also includes evaluation probes for Something-Something v2, Diving48, and Epic-Kitchens-100, plus integration with PyTorch Hub and a companion Hugging Face collection.
In a follow-up release Meta also published V-JEPA 2.1, a refinement of the same pretraining recipe that broadens the lineup of available encoder sizes to include a ViT-B/16 with eighty million parameters at 384 resolution as well as an even larger ViT-G/16 with two billion parameters at 384 resolution. V-JEPA 2.1 is described as unlocking denser features for downstream tasks while keeping the same overall self-supervised objective.
Meta also published a public Hugging Face leaderboard for the newly introduced IntPhys 2, MVPBench, and CausalVQA benchmarks. The intent, as expressed in the original blog post, is to encourage the community to develop and openly evaluate world models on tasks that go beyond conventional video classification.
For the embodied AI community, the most significant aspect of V-JEPA 2 is the V-JEPA 2-AC demonstration. Prior work on learned robot policies typically required tens of thousands of robot trajectories collected on the specific embodiment and in similar environments, or relied on heavily engineered simulation pipelines. V-JEPA 2-AC reuses representations learned from internet video, post-trains for a short time on a publicly available teleoperation corpus, and is then deployed on robots in unfamiliar laboratories without any additional data collection. This pattern, sometimes referred to as zero-shot embodied transfer, is widely seen as a prerequisite for scaling general-purpose robot intelligence in the same way that internet-scale pretraining scaled language models.
The results also reframe the role of simulation in robotics. Generative world models such as Genie and Cosmos are positioned as ways to synthesize training data for downstream controllers, effectively replacing or augmenting classical simulators. V-JEPA 2 instead skips the generation step and uses the learned predictor directly inside the control loop. Both approaches are likely to coexist, and several research groups have begun exploring hybrid systems that use pixel-generative world models for data augmentation and JEPA-style latent predictors for closed-loop control.
Open questions remain. V-JEPA 2-AC is currently evaluated on tabletop manipulation with a relatively short planning horizon and a single arm; it has not yet been demonstrated on dexterous bimanual tasks, mobile manipulation, or long-horizon plans that require explicit task decomposition. The benchmarks introduced alongside the model also reveal substantial gaps between V-JEPA 2 and human performance on physical plausibility detection and causal reasoning, suggesting that the JEPA recipe will need to scale further or be combined with more structured objectives before it can match human-level physical intuition.
Reception in the AI research and trade press emphasized three points. First, the open-source release with permissive licensing was contrasted favorably with the closed nature of contemporaneous world models from Google DeepMind and Wayve. Second, the zero-shot robot demonstrations were treated as the most impressive evidence to date for LeCun's thesis that latent-space prediction is sufficient for real-world control. Third, observers noted that V-JEPA 2 is one of the first foundation models from a major industrial lab whose primary downstream target is embodied AI rather than chatbots or content generation, and that its release accelerates a shift in industrial AI investment toward physical world modeling.
The paper has been cited as the canonical reference for scaling JEPA-style training to internet-scale video, and the Hugging Face leaderboard for IntPhys 2, MVPBench, and CausalVQA has attracted submissions from groups working on competing architectures, including pixel-generative systems and language-model-centric video agents.