Joint Embedding Predictive Architecture
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,876 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,876 words
Add missing citations, update stale details, or suggest a clearer explanation.
Joint Embedding Predictive Architecture (JEPA) is a family of self-supervised, non-generative neural network architectures proposed by Yann LeCun in his June 2022 position paper A Path Towards Autonomous Machine Intelligence. Unlike generative models that reconstruct raw pixels, tokens, or audio samples, a JEPA learns by predicting the abstract latent representation of one part of an input from the representation of another part, with prediction performed in an embedding space produced by a learned encoder rather than in the original signal space.[1][2] LeCun positioned JEPA as the centerpiece of a longer-term roadmap toward systems capable of physical reasoning, planning, and human-like learning efficiency, explicitly as an alternative to the autoregressive large language model paradigm.[1][3]
The architecture's core hypothesis is that pixel- or token-level reconstruction wastes capacity on unpredictable details (the precise texture of grass, the exact phrasing of a paraphrase) and that a model forced to predict in a learned latent space will discard such irrelevant information and retain semantic structure.[2][4] Since the original 2022 proposal, Meta AI researchers have released a succession of concrete JEPA instantiations: I-JEPA for images (Assran et al., CVPR 2023), V-JEPA for video (Bardes et al., February 2024), V-JEPA 2 as a 1.2-billion-parameter video world model with a robotics-capable variant (Assran et al., June 2025), and LeJEPA as a provably collapse-free training recipe (Balestriero and LeCun, November 2025).[5][6][7][8] Following LeCun's departure from Meta in November 2025, he founded the Paris-based startup AMI Labs (Advanced Machine Intelligence) to commercialize world-model approaches built around JEPA principles, raising roughly USD 1 billion at a USD 3.5 billion valuation in March 2026.[9][10]
The intellectual context for JEPA is LeCun's 2022 position paper A Path Towards Autonomous Machine Intelligence (version 0.9.2, dated 27 June 2022), posted to OpenReview rather than submitted to a peer-reviewed venue.[1] The paper argues that contemporary AI systems lack three capabilities exhibited by humans and animals: learning efficient world models from observation, reasoning and planning over hierarchical time horizons, and acquiring representations that support generalization across tasks.[1] LeCun proposes a modular cognitive architecture consisting of six interconnected components: a configurator that orchestrates the other modules based on task context; a perception module that produces a low-dimensional estimate of the current world state from sensors; a world model that predicts plausible future states given imagined action sequences; a cost module that scores states with respect to intrinsic and task-specific objectives; an actor that proposes action sequences; and a short-term memory that retains world states and associated costs.[11][12]
The world-model module is the architectural locus for JEPA. LeCun argues that the world model must be trained largely by self-supervised observation and must produce predictions at multiple levels of temporal and spatial abstraction so that a planning module can reason about both millisecond-scale motor control and minute-scale or hour-scale task structure.[11] This requirement leads to the proposal of a hierarchical JEPA (H-JEPA), in which JEPA modules are stacked so that representations at higher levels are coarser, more invariant, and more useful for long-horizon planning.[12][1] The paper grounds JEPA in the broader formalism of energy-based models (EBMs), in which a scalar energy function $F(x,y)$ assigns low values to compatible pairs $(x,y)$ and higher values to incompatible ones, with inference performed by minimizing energy with respect to the unobserved variable.[13][4]
A defining feature of the proposal, and a central source of subsequent debate, is its explicit rejection of generative architectures for world modeling. LeCun argues that requiring a model to predict every pixel or token of a high-dimensional future signal is intractable and counterproductive, because much of that signal is genuinely unpredictable and forcing the model to attempt prediction wastes capacity on noise.[2][14] Joint-embedding architectures sidestep this by allowing the predictor to operate in a representation space chosen by the encoder, which can itself learn to discard unpredictable nuisance information.[4]
At its most general, a JEPA consists of three trainable components.[2][4] A context encoder $E_\theta$ maps an observed signal $x$ to a latent representation $s_x = E_\theta(x)$. A target encoder $\bar{E}\theta$, typically a weight-tied or exponential-moving-average (EMA) copy of the context encoder, maps a target signal $y$ to a representation $s_y = \bar{E}\theta(y)$. A predictor $P_\phi$, conditioned on $s_x$ and additional information $\Delta_y$ (such as the position of $y$ relative to $x$), produces a predicted latent $\hat{s}y = P\phi(s_x, \Delta_y)$. Training minimizes a distance between $\hat{s}_y$ and $\mathrm{sg}(s_y)$, where $\mathrm{sg}$ denotes a stop-gradient operator that blocks gradients from flowing back through the target branch, so that the predictor must learn to forecast the target encoder's output without the trivial collapse of both encoders to a constant.[7][15]
A common option, present in LeCun's original formulation, is to introduce a latent variable $z$ that captures aspects of $y$ that are not determined by $x$ alone, producing the latent-variable JEPA (LV-JEPA). The energy is then defined as $F_w(x,y) = \min_z E_w(x,y,z)$, with $z$ regularized to carry as little information as possible so that the predictor cannot use it as a shortcut and so that varying $z$ enumerates multiple plausible futures.[13][4] This construction enables non-deterministic prediction in the latent space, useful for video futures or for stochastic environments where many continuations of $x$ are equally consistent with the data.[13]
JEPA is fundamentally different from a generative or reconstructive architecture along two axes. First, prediction is in latent space, not signal space: the loss is a function of $s_y$, never of $y$ itself, so the model never reconstructs raw pixels, tokens, or audio samples.[2][4] Second, the target itself is computed by a learned network: $s_y$ is the output of an encoder whose representation is shaped by the same training process, rather than a hand-specified reconstruction target. This means JEPA training can in principle ignore arbitrary nuisance information by simply absorbing it into the encoder's null space, an option not available to a generative model that must reconstruct every pixel.[15][4]
The most acute design problem in any JEPA is representation collapse: if both encoders simply output a constant, the loss is trivially zero and the model is useless.[13][15] LeCun's 2022 paper outlines two broad families of collapse-prevention strategies. Contrastive methods explicitly raise the energy of negative pairs $(x, y')$ with $y'$ sampled from unrelated data, but suffer from poor scaling with embedding dimension.[13] Regularized methods constrain the encoders or latent variable directly: maximizing the information content of $s_x$ and $s_y$ (so that constant outputs are penalized), minimizing the information content of $z$ (so that the predictor cannot ignore $x$), and decorrelating embedding dimensions to prevent partial collapse.[13][16]
In practice, every published JEPA from Meta has prevented collapse with a combination of (i) stop-gradient on the target branch, (ii) EMA updates of the target encoder's weights as a slowly-moving copy of the context encoder, and (iii) careful design of the prediction task such that the predictor must use information from $x$ to perform well. This recipe is mechanically similar to the BYOL and SimSiam family of self-supervised learning methods, as well as to Meta's earlier data2vec; commentators have observed that I-JEPA's update rule is, at the level of the optimization, an L1-regression analogue of BYOL on EMA latents.[4][16] The November 2025 LeJEPA paper later replaces this stack of heuristics with an analytically derived regularizer (see below).[8]
The first concrete JEPA implementation was I-JEPA, introduced in the paper Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas, posted to arXiv as 2301.08243 on 19 January 2023 and published at CVPR 2023.[5][17] Meta AI announced the work in a blog post on 13 June 2023, framing it as "the first AI model based on Yann LeCun's vision for more human-like AI".[18]
I-JEPA operates on images split into a grid of non-overlapping patches in the standard Vision Transformer (ViT) style. A context encoder processes a single contiguous block of visible patches; a target encoder (updated by EMA) processes the entire image and produces patch-level latent representations; and a small predictor transformer is conditioned on the context-encoder output plus learnable position-encoded mask tokens for the target locations.[5][17] The model is trained to predict the target encoder's latent representations of several spatially distinct target blocks, each block being a sufficiently large, semantically coherent region of the image (typically 15–20% of patches per block, with about four target blocks per image).[5]
A central empirical finding of the paper is that the masking strategy is the dominant design lever for representation quality. Targets must be large and spatially coherent rather than scattered single patches, so that predicting them requires high-level semantic information rather than local interpolation; and the context block must be spatially distributed and informative, with random patches surrounding the targets removed so that the model cannot exploit short-range continuity to cheat.[5][19] The loss is an L1 (smooth L1) regression between predicted and target latents on the masked positions; no negative samples, contrastive losses, or hand-crafted data augmentations (such as color jitter or random crops used in SimCLR and DINO) are required.[5][18]
The largest I-JEPA model in the paper is a ViT-Huge/14 (632M parameters) pretrained on ImageNet-1K for 300 epochs. Meta AI reported training this model on 16 NVIDIA A100 GPUs in under 72 hours, requiring fewer than 1,200 GPU-hours of pretraining, more than 2.5× faster than a ViT-S/16 pretrained with iBOT and more than 10× more efficient than a ViT-H/14 pretrained with a masked autoencoder (MAE) baseline.[5][18] On the standard low-shot ImageNet evaluation with only 1% of labels (roughly 12 labeled images per class), I-JEPA outperformed pixel-reconstruction methods including MAE and joint-embedding methods such as DINO, iBOT, and MSN that rely on hand-crafted augmentations; with higher input resolution, I-JEPA also led on full linear-probe and semi-supervised ImageNet evaluations.[5][18][19] Downstream evaluations on object counting (Clevr/Count) and depth prediction (Clevr/Dist) showed particularly large gains over invariance-based methods such as DINO, which the authors attributed to I-JEPA preserving local structure that invariance objectives discard.[5]
The official I-JEPA codebase and four pretrained checkpoints, ViT-H/14 and ViT-H/16 trained on ImageNet-1K, plus ViT-H/14 and ViT-g/16 trained on ImageNet-22K, were released on GitHub under Meta's standard research license. The repository was archived in read-only mode on 1 August 2024.[20]
On 15 February 2024, Meta AI announced V-JEPA alongside a paper titled Revisiting Feature Prediction for Learning Visual Representations from Video by Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas (arXiv 2404.08471).[6][21] V-JEPA extended the I-JEPA recipe to spatiotemporal video patches and is pretrained as a feature-prediction model: it uses no pretrained image encoders, no text supervision, no negative pairs, no augmentations, and no pixel reconstruction; the loss is purely the prediction of one set of latent video features from another.[6]
V-JEPA tokenizes video into 16×16×2 spatiotemporal patches (16-pixel spatial extent, 2-frame temporal extent) processed by a ViT context encoder; the predictor is a narrow 12-block transformer with embedding dimension 384, conditioned on the context features and on learnable mask tokens with positional embeddings indicating the spatiotemporal locations of targets.[6][21] The masking strategy is multi-block in space and tube-shaped in time: every frame within a clip uses the same spatial mask, producing a contiguous spatiotemporal "tube" of masked features. The paper uses two variants, short-range masks covering roughly 15% of each frame across 8 blocks, and long-range masks covering roughly 70% per frame across 2 blocks, yielding an effective mask of approximately 90% of the video.[21] The authors ablate against random-tube and causal masking strategies and find multi-block superior.[21]
V-JEPA models are pretrained on VideoMix2M, a dataset of approximately 2 million videos assembled from three public corpora: Kinetics-710 (action recognition), Something-Something v2 (fine-grained temporal interactions), and HowTo100M (instructional clips). Overlaps with the validation sets of Kinetics-400 and Something-Something v2 were removed before pretraining.[21] The loss is an L1 latent-regression loss against EMA-encoded targets, mirroring I-JEPA.[21]
The largest reported V-JEPA model is a ViT-H/16 trained on video only. Evaluated with a frozen backbone and a lightweight attentive probe (that is, without fine-tuning any of the pretrained parameters), the model achieved 81.9% top-1 accuracy on Kinetics-400 action recognition, 72.2% on Something-Something v2, and 77.9% on ImageNet-1K (despite never having been trained on still images).[6][21] On a Something-Something v2 frozen evaluation, V-JEPA reached 69.5% versus 65.5% for VideoMAE, and on a K400 frozen evaluation it scored 80.8% versus 65.6% for OmniMAE, while consuming substantially fewer pretraining samples.[21] Meta's announcement summarized the practical advantage of feature prediction as a 1.5× to 6× improvement in training and sample efficiency over comparable pixel-reconstruction or contrastive baselines.[6]
Like I-JEPA, V-JEPA was released as open code and weights, although the original release was distributed under a Creative Commons CC BY-NC license that restricted commercial use, with later releases (notably V-JEPA 2) relicensed under MIT.[22]
On 11 June 2025, Meta AI released V-JEPA 2, described as the company's first video-trained world model, alongside an arXiv paper (2506.09985) titled V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning led by Mido (Mahmoud) Assran with twenty-nine co-authors including Yann LeCun, Nicolas Ballas, and Michael Rabbat.[7][23] V-JEPA 2 is a 1.2-billion-parameter model that extends the V-JEPA recipe to a far larger pretraining corpus and adds an action-conditioned post-training phase intended for robotic control.[23][7]
The first stage of V-JEPA 2 is purely action-free self-supervised pretraining on a video and image dataset containing more than 1 million hours of internet video and 1 million images, drawn from publicly available sources.[7][23] The model uses the same masked latent-prediction objective and ViT-based encoder/predictor as V-JEPA, scaled up substantially in parameter count and data. After pretraining, on motion-centric understanding the model reached 77.3% top-1 accuracy on Something-Something v2 with frozen evaluation, and on egocentric action anticipation it achieved 39.7 recall-at-5 on Epic-Kitchens-100, surpassing prior task-specific models.[23][24] When the frozen V-JEPA 2 backbone was aligned with a large language model, the resulting 8B-parameter system reached 84.0 on PerceptionTest and 76.9 on TempCompass, then-state-of-the-art video question-answering scores at that scale.[23] On ImageNet, V-JEPA 2 reached 84.6%, a 4.6-point improvement over the original V-JEPA.[25]
The second stage post-trains the model into V-JEPA 2-AC (action-conditioned), a latent world model that predicts future latent states given a candidate action. The post-training uses less than 62 hours of unlabeled robot videos drawn from the publicly available DROID dataset.[23][7] At inference, the model can be combined with model-predictive control: given a current observation and a goal image, V-JEPA 2-AC searches over action sequences in the latent space to minimize the predicted distance to the goal latent, then executes the first action and replans.[7][26]
Meta evaluated V-JEPA 2-AC on Franka Emika robot arms in two physical lab environments that were not present in the DROID training data, and reported success rates of approximately 65% to 80% for pick-and-place tasks involving previously unseen objects, all without any task-specific reward engineering, demonstration collection, or fine-tuning on the deployment robots.[7][26] Meta described this as the first time a JEPA-style world model had achieved zero-shot transfer to new robotic embodiments.[7]
Alongside V-JEPA 2, Meta released three new public benchmarks aimed at measuring physical-world reasoning in video models, accompanied by a Hugging Face leaderboard.[7][26] IntPhys 2 generates pairs of synthetic videos that are identical up to a critical frame, after which one video contains a physics-violating event (object pass-through, gravity inversion, etc.) and the other does not; models must identify the violation, mirroring the "violation-of-expectation" paradigm used in developmental cognitive science to study how infants acquire intuitive physics.[7] Minimal Video Pairs (MVPBench) poses multiple-choice physical-understanding questions in matched-pair form designed to defeat the textual and visual shortcuts on which prior video-language models had been shown to rely.[7] CausalVQA evaluates causal counterfactuals ("what would have happened if…"), anticipation ("what might happen next"), and planning ("what action would achieve this goal").[7] Meta reported that top contemporary models, including V-JEPA 2, performed substantially below humans (which scored 85–95%) on all three benchmarks, framing the gap as an open research direction rather than a solved problem.[7]
V-JEPA 2 code and weights were published on GitHub at facebookresearch/vjepa2 under the MIT license (with limited components under CC BY-NC), making it freely usable for commercial applications, a notable contrast to the non-commercial license of the original V-JEPA.[22]
In November 2025, Randall Balestriero and Yann LeCun posted LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics (arXiv 2511.08544, v1 dated 11 November 2025).[8][27] The paper is widely described as LeCun's final research output during his tenure at Meta, posted shortly before his confirmed departure in mid-November 2025.[27][9]
LeJEPA presents a theoretical analysis arguing that the optimal latent distribution for a JEPA, the distribution of embeddings that minimizes downstream prediction risk under broad assumptions, is the isotropic Gaussian. The authors then introduce Sketched Isotropic Gaussian Regularization (SIGReg), a regularizer that explicitly drives the encoder's output distribution toward isotropic Gaussian via random one-dimensional projections, and combine it with a simple predictive loss.[8][27] The resulting training recipe replaces the stack of heuristics that prior JEPAs relied on: SIGReg eliminates the need for stop-gradient, teacher–student EMA, predictor-only updates, and elaborate learning-rate schedules, while provably preventing collapse by construction. The full method is reported to require approximately 50 lines of code, has linear time and memory complexity in the batch dimension, and produces stable training across architectures (ResNets, ViTs, ConvNets) and modalities.[8][28]
Empirically, LeJEPA achieved 79% top-1 ImageNet-1K linear-probe accuracy with a ViT-H/14, and the authors reported stable training of ViT-g models with 1.8 billion parameters without any of the conventional self-supervised heuristics.[8] Commentators interpreted the paper as both a mathematical capstone to LeCun's seven-year tenure at Meta and a blueprint for what his subsequent startup would build.[27][28]
A central but as-of-mid-2026 still-mostly-aspirational element of LeCun's 2022 program is the hierarchical JEPA (H-JEPA). The proposal is to stack JEPA modules so that each higher level takes the lower level's representations as input and learns to predict at progressively coarser temporal and spatial scales: a low-level JEPA might predict the next 200 milliseconds of pixels-in-feature-space, while a higher level might predict the next minute of action-space outcomes.[1][11] In LeCun's framing, this hierarchy is what enables a planning module to reason about action sequences of arbitrary horizon by performing gradient-based optimization at an appropriate level of the hierarchy.[1] V-JEPA 2's two-stage pretraining (action-free representation learning followed by action-conditioned planning) is the closest concrete realization to date, but it is a two-level system rather than a deep hierarchy, and Meta has not publicly demonstrated the long-horizon hierarchical planning that the 2022 paper sketched.[7][23]
LeCun's broader vision also positions JEPA as the perceptual and predictive substrate beneath a richer agent architecture that includes the configurator, cost, actor, and short-term memory modules described above. The 2022 paper proposes that this combined system would be trained by self-supervised observation rather than reinforcement learning or imitation, with intrinsic motivation supplied by the cost module so that the agent autonomously seeks informative experiences. As of 2026, no public system implements the full six-module architecture; JEPA is the only piece for which there are large-scale empirical demonstrations.[11][14]
The JEPA literature explicitly positions itself against three other families of self-supervised learning.[29][30]
Versus pixel-reconstruction (MAE and variants). A masked autoencoder masks a fraction of input patches and trains a decoder to reconstruct the raw pixels of the masked regions. JEPA instead predicts the latent representations of the masked regions, never seeing or reconstructing pixels.[29] Empirically, this lets I-JEPA outperform MAE on linear-probe and low-shot ImageNet evaluations while using roughly 10× fewer GPU-hours of pretraining, which the authors attribute to capacity being spent on semantic rather than pixel-level prediction.[5][18]
Versus invariance-based contrastive methods (SimCLR, DINO, MoCo). Contrastive methods take two augmented views of the same image and pull their embeddings together while pushing embeddings of other images apart; DINO uses self-distillation with multi-crop augmentation. These methods produce strong global representations but rely heavily on hand-crafted augmentations (color jitter, cropping) that bake in assumptions about what should be invariant.[30] JEPA dispenses with augmentations entirely: the only "view" transformation is the choice of which patches go to the context encoder versus the target encoder. As a result, JEPA tends to preserve local structure that DINO and SimCLR collapse, leading to gains on object-counting and depth-estimation tasks.[5]
Versus generative autoregressive models (GPT-style LLMs, video diffusion). Autoregressive models such as large language models predict the next discrete token; video-diffusion models denoise pixel sequences. Both operate in the signal space. LeCun has argued repeatedly, most prominently in a September 2024 X (Twitter) post stating that "pure auto-regressive LLMs are a dead end on the way towards human-level AI", that token-level prediction cannot recover a faithful world model because most real-world signals contain too much unpredictable detail for next-token loss to be a useful gradient signal.[14][3] JEPA's latent prediction is the proposed alternative.[3][14] However, LeCun has also publicly qualified the "dead end" claim as a long-run statement, acknowledging that LLMs remain "very useful in the short term".[14]
The JEPA program has occupied a distinctive position in the AI research discourse: simultaneously well-cited as a self-supervised learning technique and the subject of pointed disagreement about its sufficiency as a path toward general intelligence.
The technical I-JEPA and V-JEPA papers have been widely adopted as baselines in self-supervised computer-vision benchmarks, and the JEPA pattern has been extended by external groups to other modalities, including Point-JEPA for 3D point clouds at WACV 2025, Audio-JEPA for audio representation learning (arXiv 2507.02915), Signal-JEPA for EEG/ECG, and VL-JEPA for vision–language tasks.[31][32][33] Meta itself has published or co-released related variants including 3D-JEPA inside the Locate-3D system and EB-JEPA, an Apache-licensed library for experimentation with energy-based and action-conditioned JEPA variants.[22][34]
The broader claim that JEPA-style world models supersede LLMs has been more contested. Demis Hassabis of Google DeepMind has publicly debated LeCun on the role of language models in achieving general intelligence, with the debate amplified at Davos in January 2026 when Anthropic CEO Dario Amodei argued that current-architecture LLMs would substantially automate software development within a year, a position incompatible with LeCun's "dead-end" framing.[3] Gary Marcus, a long-standing critic of pure-neural approaches, has argued that LeCun's framing of "world models" understates the need for explicit symbolic structure and that JEPA-trained encoders, however clever, remain "opaque, uninterpretable neural networks" that do not solve the grounding problem.[35] Melanie Mitchell, a Santa Fe Institute computer scientist, has expressed parallel skepticism about both LLM and embedding-based approaches' ability to acquire genuine world models from passive observation alone.[36]
In a November 2025 public lecture, LeCun characterized the LLM-only path to superintelligence as "complete bullshit", a rhetorically sharper version of his long-standing argument.[3] Shortly thereafter, in December 2025, he confirmed the founding of AMI Labs (Advanced Machine Intelligence) as a Paris-headquartered startup to commercialize world-model AI; Alex LeBrun (formerly CEO of medical-AI startup Nabla) serves as CEO and LeCun as executive chairman.[9][10] In March 2026, AMI Labs closed a seed round reported at approximately USD 1.03 billion at a USD 3.5 billion valuation, described as the largest seed round in European history, co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions.[9][10] The company is targeting world-model systems for industrial process monitoring, smart glasses, robotics, and autonomous vehicles, with JEPA-style architectures as the core technical bet.[10]
JEPA architectures have moved beyond pure benchmarks into several application areas, though most remain at the research-prototype stage.
Computer-vision representation learning. I-JEPA and its successors function as drop-in self-supervised pretraining for downstream image-understanding tasks. The released ViT-H and ViT-g checkpoints have been used by external groups for image classification, depth estimation, object counting, and dense prediction tasks, with the benefit of not requiring labeled pretraining data or augmentation pipelines.[5][19]
Video understanding and action recognition. V-JEPA is a strong frozen-backbone feature extractor for video classification, action anticipation, and video question answering, with V-JEPA 2 substantially extending these capabilities at the 1.2B parameter scale.[6][23]
Robotics and embodied AI. V-JEPA 2-AC demonstrated zero-shot robotic pick-and-place on Franka arms in unseen labs, using only 62 hours of unlabeled robot video as post-training data, a striking sample-efficiency result relative to traditional behavior-cloning or reinforcement-learning baselines.[7][26] This is the most concrete realization to date of LeCun's claim that latent world models can support embodied AI without massive on-robot data collection.[7]
Domain-specific extensions. Variants such as Point-JEPA, Audio-JEPA, Signal-JEPA, and 3D-JEPA have applied the latent-prediction recipe to point clouds, audio, biomedical signals, and 3D scene understanding, generally showing favorable sample efficiency relative to reconstruction baselines.[31][32][34]
Important limitations remain, however, and are openly acknowledged in the V-JEPA 2 paper and Meta's blog post.[7][23] First, the physical-reasoning gap between current JEPA models and humans is large: V-JEPA 2 trails human accuracy by 20–30 absolute percentage points on IntPhys 2, MVPBench, and CausalVQA, indicating that latent prediction over internet video does not by itself produce robust causal understanding.[7] Second, JEPAs as currently trained are passive observers: they cannot plan over long horizons, and the hierarchical multi-level JEPA that LeCun's 2022 paper envisaged has not been publicly demonstrated at scale.[11][7] Third, the practical training of JEPAs has historically depended on the EMA/stop-gradient heuristic stack that LeJEPA explicitly identifies as theoretically unsatisfying; whether the LeJEPA recipe will hold up at the scales now being attempted by AMI Labs is an open question as of mid-2026.[8][27] Finally, JEPA's claim to general intelligence remains contested by major figures in the field, and there is no published evidence, within or outside Meta, that a JEPA-trained system can match LLM-class reasoning, dialogue, or knowledge-intensive performance, nor that it would on the timelines implied by AMI Labs's billion-dollar seed round.[3][35][36]