V-JEPA 2

AI Models Computer Vision Open Source AI Robotics World Models

16 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v3 · 3,182 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

V-JEPA 2 (Video Joint Embedding Predictive Architecture 2) is an open-source video world model released by Meta AI on June 11, 2025 that learns to understand, predict, and plan in the physical world by watching internet video. It is a 1.2 billion parameter model pretrained via self-supervised learning on more than one million hours of internet video together with roughly one million images, and it is the most prominent published embodiment of Yann LeCun's Joint Embedding Predictive Architecture (JEPA) paradigm for autonomous machine intelligence. ^[1]^[2] Unlike generative world models that synthesize pixels, V-JEPA 2 predicts future states in a learned representation space, which Meta says lets it run roughly 30 times faster than NVIDIA's Cosmos on physical reasoning tasks. ^[2] A companion variant, V-JEPA 2-AC, is post-trained on under 62 hours of unlabeled teleoperated robot footage from the public Droid dataset and enables zero-shot robot control on Franka Panda arms in laboratories the model has never seen. ^[1]^[2]

The model was published in the arXiv preprint V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (arXiv:2506.09985) by a team led by Mido Assran, including Yann LeCun, Michael Rabbat, Nicolas Ballas, Adrien Bardes, and David Fan. ^[1] Meta released code, multiple checkpoints, and evaluation probes under an MIT license through the facebookresearch/vjepa2 GitHub repository and a companion Hugging Face collection. ^[4] Meta describes the work as "meaningful progress toward our ultimate goal of developing advanced machine intelligence (AMI)," arguing that world models are "essential to building AI agents that can think before they act." ^[3]

V-JEPA 2 is notable for three reasons. First, it predicts future states in a learned latent representation space rather than in pixel space, in keeping with LeCun's long argued thesis that generative pixel prediction wastes capacity on perceptually irrelevant detail. ^[6] Second, it produces a single visual backbone that achieves competitive or state-of-the-art results across motion understanding, action anticipation, and video question answering when paired with a frozen probe or a language model. ^[1] Third, by demonstrating zero-shot pick-and-place on a real robot trained only on internet video plus a small amount of unlabeled teleoperation data, it provides the first large-scale public evidence that a non-generative world model can support real robot planning without environment-specific data collection. ^[1]^[2]

What is the JEPA paradigm behind V-JEPA 2?

The JEPA framework was introduced by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence. ^[6] In that paper LeCun proposed that systems aiming for human-like reasoning should learn predictive world models in an abstract embedding space rather than by reconstructing raw inputs. A JEPA model contains an encoder that maps an input into a representation, a predictor that maps from one such representation to another conditioned on context or actions, and a training objective that aligns the predicted representation with a target representation produced by a target encoder. Because the prediction is in latent space, the system is free to discard unpredictable low-level detail such as exact pixel values, background noise, and viewing angle, and instead capture the structural regularities that govern how a scene evolves.

Meta's research group released several JEPA implementations before V-JEPA 2. I-JEPA, presented in 2023, applied the framework to still images and showed that masked-region prediction in latent space could produce strong visual representations without contrastive learning, data augmentation, or pixel reconstruction. ^[7] V-JEPA, released in early 2024, extended the same recipe to short video clips, training a vision transformer to predict the embeddings of masked spatiotemporal tubes given the embeddings of the visible portion. ^[8] V-JEPA 2 scales this approach by orders of magnitude in both data and parameter count, adds a second post-training stage focused on action-conditioned prediction, and introduces three new benchmarks aimed at separating genuine physical understanding from textual or statistical shortcuts. ^[2]

JEPA lineage

Model	Year	Modality	Pretraining data	Parameter count	Primary contribution
I-JEPA	2023	Still images	ImageNet (1.3M images)	ViT-H up to 632M	First JEPA-based vision model; latent-space masked prediction
V-JEPA	2024	Short videos	~2M public videos	ViT-L and ViT-H	Extends latent masked prediction to spatiotemporal tubes
V-JEPA 2	June 2025	Long-form internet video plus images	>1M hours of video plus ~1M images	ViT-g encoder over 1B params; 1.2B world-model configuration	Internet-scale self-supervised video pretraining for understanding, prediction, and planning
V-JEPA 2-AC	June 2025	Robot teleoperation	<62 hours of unlabeled Droid footage	300M action-conditioned predictor over frozen ViT-g encoder	Latent action-conditioned world model enabling zero-shot robot planning

How is V-JEPA 2 built?

V-JEPA 2 follows the canonical JEPA template with two transformer modules. As Meta describes it, the architecture has "two main components: an encoder, which takes in raw video and outputs embeddings that capture useful semantic information about the state of the observed world, and a predictor, which takes in a video embedding and additional context about what to predict and outputs predicted embeddings." ^[2] The encoder is a vision transformer applied to a video clip that has been tokenized into spatiotemporal patches; the predictor is a separate transformer that takes the encoder's output for a context region and a set of position queries and outputs predicted embeddings for the masked region. The training objective is a regression loss between the predicted embeddings and target embeddings produced by an exponential-moving-average copy of the encoder. To prevent representational collapse, V-JEPA 2 inherits the design choices that have proved stable across the JEPA family, including stop-gradient through the target encoder and careful regularization of the predictor's output distribution.

The released V-JEPA 2 encoder uses a ViT-g backbone with over one billion parameters, and the full world-model configuration combining the encoder with its predictor is described in the paper as a 1.2 billion parameter model. ^[1]^[3] The video-and-image pretraining corpus, named VideoMix22M, comprises roughly 22 million samples drawn from public sources. ^[2] The model ingests 64-frame clips during training and supports variable sampling rates at inference.

For V-JEPA 2-AC, the action-conditioned variant, Meta keeps the V-JEPA 2 encoder frozen and trains a new 300 million parameter transformer predictor with block-causal attention. ^[1] This predictor takes a sequence of past encoded frames plus a sequence of low-dimensional action vectors (joint commands for the robot arm) and autoregressively predicts the encoder's representation of the next frame. Because both the past states and the predicted states live in the same latent space, planning at test time can be performed entirely in that space without ever generating pixels. Image-goal planning is implemented using model-predictive control: candidate action sequences are rolled out through the action-conditioned predictor, the resulting predicted latent is compared with the encoded goal image, and the cross-entropy method is used to refine the action distribution. ^[1]

How was V-JEPA 2 trained?

V-JEPA 2 pretraining proceeds in a single self-supervised stage on the combined video-and-image corpus. The paper states the model is "pre-trained on a video and image dataset comprising over 1 million hours of internet video," with the precise composition not fully disclosed but spanning diverse domains and camera viewpoints. ^[1] During pretraining the model is given a tokenized video clip from which a substantial fraction of patches are masked, and it is trained to predict the embedding of those masked tubes. Crucially, the loss is computed in feature space against the target encoder rather than against the raw pixel values, which the team has repeatedly argued is essential for learning representations that are robust to nuisance variation. ^[6]

The action-conditioned post-training stage for V-JEPA 2-AC uses less than 62 hours of unlabeled video drawn from the Droid dataset, the open-source teleoperation corpus collected from a fleet of Franka Panda arms across many institutions. ^[1] "Unlabeled" here means that the post-training does not depend on natural language task descriptions or reward labels, only on the synchronized pairing of robot proprioception and video. The total compute and wall-clock cost of the action-conditioned post-training is small compared to internet-scale pretraining, reflecting the JEPA philosophy that most knowledge about the physical world should already be present in the encoder after large-scale passive observation.

For downstream evaluations the team uses two main protocols. The first freezes the encoder and trains a lightweight attentive read-out probe for classification or anticipation tasks. The second aligns the frozen encoder with a large language model through a projection layer and a short instruction-tuning phase to support video question answering. ^[1]

How well does V-JEPA 2 perform on benchmarks?

V-JEPA 2 was evaluated across a wide suite of established video understanding benchmarks. Two of the headline results sit on Something-Something v2, which tests fine-grained motion understanding, and Epic-Kitchens-100, which tests action anticipation one second into the future. The paper reports that V-JEPA 2 "achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models." ^[1] Aligned with a large language model, it also sets state-of-the-art video question-answering numbers at the 8 billion parameter scale. ^[1]

Headline V-JEPA 2 benchmark scores

Benchmark	Capability tested	V-JEPA 2 score	Notes
Something-Something v2	Fine-grained motion classification	77.3 top-1 accuracy	Frozen encoder with attentive read-out
Epic-Kitchens-100	Action anticipation at one second horizon	39.7 recall at 5	New state of the art; surpasses previous task-specific models
Perception Test	Multimodal video question answering	84.0	Frozen encoder aligned with a large language model (8B scale)
TempCompass	Temporal reasoning in video QA	76.9	Frozen encoder aligned with a large language model (8B scale)

Meta also introduced three new benchmarks alongside V-JEPA 2 to probe physical understanding more rigorously, describing them as "three new benchmarks to help the research community evaluate how well their existing models learn and reason about the world using video." ^[2] IntPhys 2 evaluates the ability of a model to detect physically implausible events, a task on which human raters reach approximately 85 to 95 percent accuracy. ^[2] MVPBench, or Minimal Video Pairs Bench, presents the same video question with two nearly identical clips that differ in the answer, removing many of the surface-level shortcuts that video question-answering models can otherwise exploit. ^[2] CausalVQA targets cause-and-effect reasoning, counterfactual questions, and short-horizon anticipation. ^[2] On these new benchmarks V-JEPA 2 generally leads other open video foundation models but still trails human performance by a wide margin, which the authors argue is exactly the headroom that motivates further research on world modeling.

How well does V-JEPA 2-AC control real robots?

The action-conditioned variant was deployed zero-shot on Franka Panda arms in two different laboratories, neither of which contributed data to the encoder or predictor training. ^[1] Tasks were specified as a single goal image showing the desired final scene; the controller used model-predictive control through the latent predictor to pick action sequences, executing a plan in roughly 16 seconds per step. ^[2] Reported success rates run from 100 percent on simple reach tasks down to the 65 to 80 percent range on grasp and pick-and-place tasks across the two labs. ^[1]^[2] The results are notable because the policy is goal-conditioned rather than language-conditioned, has never seen the specific robots or objects in question, and never receives a reward signal. The paper underscores that this "is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward." ^[1]

How does V-JEPA 2 differ from other world models?

V-JEPA 2 arrived in the middle of a broader shift among large research labs toward what are loosely called world foundation models. These systems share the goal of letting agents predict and plan in a learned model of the physical world, but they differ sharply in what they actually predict and how they are intended to be used. The table below summarizes how V-JEPA 2 compares to two prominent contemporaries, Google DeepMind's Genie line and NVIDIA's Cosmos suite.

Comparison with other 2025 world models

System	Developer	Architecture style	What it predicts	Primary use case	Approximate scale	Openness
V-JEPA 2	Meta AI	Joint embedding predictive (non-generative)	Future latent representations of video	Understanding, prediction, robot planning	1.2B parameters; >1M hours video pretraining	Open weights and code under MIT
Genie 3	Google DeepMind	Autoregressive generative simulator	Pixel-level next frames conditioned on actions	Interactive, navigable simulated worlds	Reported in the eleven billion parameter range	Closed; access via research preview
NVIDIA Cosmos	NVIDIA	Generative world foundation model family (Predict, Transfer, Reason)	Pixel-level video and physics-aware synthetic data	Simulation, synthetic data for robotics and AV	7B and 14B parameter variants	Open weights with permissive license

Meta's blog post on V-JEPA 2 reports that the system is roughly 30 times faster than NVIDIA Cosmos on physical reasoning tasks, which is consistent with the architectural difference: V-JEPA 2 predicts a small set of latent features per frame, whereas pixel-generative systems must produce a full image. ^[2] The trade-off is that V-JEPA 2 cannot be used to render a video for a human viewer; its outputs are only meaningful when fed back into the planner or into a downstream probe. World Labs, founded in 2024 to build 3D world models, occupies yet another point in the design space, focusing on generating navigable 3D scenes from images rather than predicting video futures.

The broader debate that V-JEPA 2 is intended to advance, and that LeCun has championed publicly for years, is whether the next generation of foundation models should be generative or predictive in latent space. V-JEPA 2's results are widely cited as the strongest existing evidence that the latent predictive route can scale.

Is V-JEPA 2 open source?

Meta released V-JEPA 2 through the facebookresearch/vjepa2 repository on GitHub. ^[4] The majority of the code is licensed under the MIT License, with a small number of utility files under Apache 2.0. Pretrained encoder weights for all model sizes are available, along with the action-conditioned predictor checkpoint trained from the ViT-g encoder. The release also includes evaluation probes for Something-Something v2, Diving48, and Epic-Kitchens-100, plus integration with PyTorch Hub and a companion Hugging Face collection. ^[4]

In a follow-up release Meta also published V-JEPA 2.1, a refinement of the same pretraining recipe that broadens the lineup of available encoder sizes to include a ViT-B/16 with 80 million parameters at 384 resolution as well as an even larger ViT-G/16 with two billion parameters at 384 resolution. V-JEPA 2.1 is described as unlocking denser features for downstream tasks while keeping the same overall self-supervised objective. ^[4]

Meta also published a public Hugging Face leaderboard for the newly introduced IntPhys 2, MVPBench, and CausalVQA benchmarks. The intent, as expressed in the original blog post, is to encourage the community to develop and openly evaluate world models on tasks that go beyond conventional video classification. ^[2]

Why does V-JEPA 2 matter for embodied AI?

For the embodied AI community, the most significant aspect of V-JEPA 2 is the V-JEPA 2-AC demonstration. Prior work on learned robot policies typically required tens of thousands of robot trajectories collected on the specific embodiment and in similar environments, or relied on heavily engineered simulation pipelines. V-JEPA 2-AC reuses representations learned from internet video, post-trains for a short time on a publicly available teleoperation corpus, and is then deployed on robots in unfamiliar laboratories without any additional data collection. ^[1] This pattern, sometimes referred to as zero-shot embodied transfer, is widely seen as a prerequisite for scaling general-purpose robot intelligence in the same way that internet-scale pretraining scaled language models.

The results also reframe the role of simulation in robotics. Generative world models such as Genie and Cosmos are positioned as ways to synthesize training data for downstream controllers, effectively replacing or augmenting classical simulators. V-JEPA 2 instead skips the generation step and uses the learned predictor directly inside the control loop. Both approaches are likely to coexist, and several research groups have begun exploring hybrid systems that use pixel-generative world models for data augmentation and JEPA-style latent predictors for closed-loop control.

Open questions remain. V-JEPA 2-AC is currently evaluated on tabletop manipulation with a relatively short planning horizon and a single arm; it has not yet been demonstrated on dexterous bimanual tasks, mobile manipulation, or long-horizon plans that require explicit task decomposition. The benchmarks introduced alongside the model also reveal substantial gaps between V-JEPA 2 and human performance on physical plausibility detection and causal reasoning, suggesting that the JEPA recipe will need to scale further or be combined with more structured objectives before it can match human-level physical intuition. ^[2]

Reception

Reception in the AI research and trade press emphasized three points. First, the open-source release with permissive licensing was contrasted favorably with the closed nature of contemporaneous world models from Google DeepMind and Wayve. ^[9] Second, the zero-shot robot demonstrations were treated as the most impressive evidence to date for LeCun's thesis that latent-space prediction is sufficient for real-world control. ^[11] Third, observers noted that V-JEPA 2 is one of the first foundation models from a major industrial lab whose primary downstream target is embodied AI rather than chatbots or content generation, and that its release accelerates a shift in industrial AI investment toward physical world modeling. ^[10]

The paper has been cited as the canonical reference for scaling JEPA-style training to internet-scale video, and the Hugging Face leaderboard for IntPhys 2, MVPBench, and CausalVQA has attracted submissions from groups working on competing architectures, including pixel-generative systems and language-model-centric video agents.

References

Assran, M. et al. (2025). *V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning*. arXiv:2506.09985. https://arxiv.org/abs/2506.09985 ↩
Meta AI (June 11, 2025). "Introducing the V-JEPA 2 world model and new benchmarks for physical reasoning." https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/ ↩
Meta Newsroom (June 11, 2025). "Our new model helps AI think before it acts." https://about.fb.com/news/2025/06/our-new-model-helps-ai-think-before-it-acts/ ↩
facebookresearch (2025). vjepa2 GitHub repository. https://github.com/facebookresearch/vjepa2 ↩
Meta AI Research. "Introducing V-JEPA 2." https://ai.meta.com/research/vjepa/
LeCun, Y. (2022). *A Path Towards Autonomous Machine Intelligence*. Open Review. https://openreview.net/forum?id=BZ5a1r-kVsf ↩
Meta AI (June 2023). "I-JEPA: The first AI model based on Yann LeCun's vision for more human-like AI." https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/ ↩
Meta AI (February 2024). "V-JEPA: The next step toward advanced machine intelligence." https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/ ↩
TechCrunch (June 11, 2025). "Meta's V-JEPA 2 model teaches AI to understand its surroundings." https://techcrunch.com/2025/06/11/metas-v-jepa-2-model-teaches-ai-to-understand-its-surroundings/ ↩
MarkTechPost (June 12, 2025). "Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning." https://www.marktechpost.com/2025/06/12/meta-ai-releases-v-jepa-2-open-source-self-supervised-world-models-for-understanding-prediction-and-planning/ ↩
The Robot Report (June 2025). "Meta V-JEPA 2 world model uses raw video to train robots." https://www.therobotreport.com/meta-v-jepa-2-world-model-uses-raw-video-train-robots/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

DINOv2 DINOv3 Joint Embedding Predictive Architecture Marble (World Labs)Spatial intelligence V-JEPA World Labs

What is the JEPA paradigm behind V-JEPA 2?

JEPA lineage

How is V-JEPA 2 built?

How was V-JEPA 2 trained?

How well does V-JEPA 2 perform on benchmarks?

Headline V-JEPA 2 benchmark scores

How well does V-JEPA 2-AC control real robots?

How does V-JEPA 2 differ from other world models?

Comparison with other 2025 world models

Is V-JEPA 2 open source?

Why does V-JEPA 2 matter for embodied AI?

Reception

See also

References

Improve this article

Related Articles

GAIA-3 (Wayve)

GAIA-2 (Wayve)

NVIDIA Cosmos

World Labs

Marble (World Labs)

Genie 3

What links here

Related Articles

GAIA-3 (Wayve)

GAIA-2 (Wayve)

NVIDIA Cosmos

World Labs

Marble (World Labs)

Genie 3

What links here