I-JEPA
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,546 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,546 words
Add missing citations, update stale details, or suggest a clearer explanation.
I-JEPA (Image-based Joint-Embedding Predictive Architecture) is a self-supervised learning method for computer vision developed by Meta AI. It was introduced in the paper "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas, first posted to arXiv in January 2023 and published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) in 2023.[1][2] I-JEPA is the first concrete image instantiation of the Joint-Embedding Predictive Architecture (JEPA), a design that LeCun proposed in 2022 as a path toward machines that build internal models of the world.[3]
The central idea is to learn by prediction in representation space rather than in pixel space: from a single context block of an image, the model predicts the learned representations of several other target blocks in the same image. This lets I-JEPA avoid both the pixel-level reconstruction used by masked autoencoders (MAE) and the hand-crafted data augmentations used by invariance-based methods such as DINO.[1][3]
Self-supervised learning for images had, before I-JEPA, largely split into two families. Invariance-based (joint-embedding) methods such as DINO, SimCLR, and iBOT train a network to produce similar embeddings for two augmented views of the same image. These methods produce strong representations, but they depend on a curated set of image transformations (random cropping, color jitter, blurring) that bake in biases and can discard information that matters for some downstream tasks.[1] Generative methods such as MAE instead mask out parts of an image and train the model to reconstruct the missing pixels. They need no augmentations, but reconstructing pixels forces the network to model low-level detail that is often irrelevant to semantic understanding, and the resulting off-the-shelf representations tend to lag behind invariance-based ones on tasks like linear classification.[1][3]
I-JEPA targets the gap between these families. It keeps the augmentation-free property of generative methods while predicting in an abstract representation space, so it does not have to spend capacity filling in pixel-level texture. Meta's announcement framed this as a deliberate move toward LeCun's view that intelligent systems should learn by predicting high-level outcomes rather than reconstructing every detail of their input, and noted that pixel-reconstruction objectives waste effort on details a model cannot reliably produce (the blog gives the example of generative models struggling to render human hands).[3]
I-JEPA uses three components, all built on the vision transformer (ViT). An image is first split into non-overlapping patches.[1]
| Component | Role |
|---|---|
| Context encoder | A ViT that encodes the visible patches of a single context block into a sequence of representations. |
| Target encoder | A ViT that encodes the full image into patch-level representations; the target blocks to be predicted are taken from this output. |
| Predictor | A narrow ViT that takes the context representations plus positional tokens for the masked target locations, and predicts the target-encoder representations at those locations. |
Training minimizes the distance between the predictor's outputs and the corresponding target-encoder representations, averaged over the target blocks. Crucially, the loss is computed in representation space, not pixel space.[1]
The target encoder is not trained by gradient descent on this loss. Instead its weights are an exponential moving average (EMA) of the context-encoder weights, updated with a momentum value of 0.996 in the paper's main configuration.[1] This stop-gradient plus EMA arrangement gives the predictor stable targets and prevents representation collapse, a known failure mode in which a network maps every input to the same constant embedding. The same EMA-target idea appears in earlier joint-embedding methods such as BYOL and DINO.[1]
The masking strategy is what steers I-JEPA toward semantic features. The paper introduces a multi-block masking scheme with two requirements: target blocks must be sampled at a sufficiently large scale to be semantically meaningful, and the context block must be sufficiently informative and spatially distributed.[1][2] In the main configuration the model samples 4 target blocks per image, each covering a random fraction in the range (0.15, 0.2) of the image area with an aspect ratio in (0.75, 1.5). The context block covers a larger fraction in the range (0.85, 1.0), and any patches that overlap with the target blocks are removed from the context so the model cannot simply copy them.[1] Because the targets are sizable blocks rather than scattered individual patches, predicting them requires the model to capture object-level structure rather than fill in local texture.
The released code provides several pretrained backbones at different scales and resolutions.[4] The ViT-Huge model has roughly 632 million parameters.[3]
| Backbone | Patch size | Resolution | Epochs | Pretraining data |
|---|---|---|---|---|
| ViT-H | 14x14 | 224x224 | 300 | ImageNet-1K |
| ViT-H | 16x16 | 448x448 | 300 | ImageNet-1K |
| ViT-H | 14x14 | 224x224 | 66 | ImageNet-22K |
| ViT-g | 16x16 | 224x224 | 44 | ImageNet-22K |
The paper also reports results for smaller ViT-B/16 and ViT-L/16 backbones trained on ImageNet-1K.[1]
I-JEPA's headline evaluation is linear probing on ImageNet-1K, where a linear classifier is trained on top of frozen features. The table below lists I-JEPA's top-1 accuracies alongside two augmentation-free baselines from the paper's comparison.[1]
| Method | Backbone | Epochs | Linear-probe top-1 |
|---|---|---|---|
| I-JEPA | ViT-B/16 | 600 | 72.9% |
| I-JEPA | ViT-L/16 | 600 | 77.5% |
| I-JEPA | ViT-H/14 | 300 | 79.3% |
| I-JEPA | ViT-H/16 (448px) | 300 | 81.1% |
| MAE | ViT-H/14 | 1600 | 77.2% |
| data2vec | ViT-L/16 | 1600 | 77.3% |
At ViT-H/14, I-JEPA reaches 79.3% top-1, ahead of MAE's ViT-H/14 result of 77.2% while training for far fewer epochs (300 versus 1600). At higher resolution the ViT-H/16 model evaluated at 448 pixels reaches 81.1%, which is competitive with strong invariance-based methods that do use view augmentations, such as iBOT ViT-L/16 (81.0%) and DINO ViT-B/8 (80.1%).[1]
I-JEPA is also strong in the low-shot (semi-supervised) regime, where only a small fraction of ImageNet labels are available. With 1% of ImageNet-1K labels (roughly 12 or 13 labeled images per class), it reaches 73.3% top-1 with ViT-H/14 and 77.3% with ViT-H/16 at 448 pixels.[1][3] Meta described the 1% setting as state-of-the-art for low-shot classification at the time.[3] Beyond classification, the paper reports that I-JEPA features transfer to tasks including object counting and depth prediction, supporting the claim that representation-space prediction yields broadly useful, off-the-shelf features.[2]
A repeated claim is computational efficiency. The authors trained a ViT-H/14 on ImageNet using 16 NVIDIA A100 GPUs in under 72 hours.[1][2] In GPU-hour terms, pretraining the ViT-H/14 took fewer than 1,200 GPU hours, which the paper reports as more than 2.5 times faster than a ViT-S/16 pretrained with iBOT and more than 10 times more efficient than a ViT-H/14 pretrained with MAE.[1] Meta's blog summarized this by saying that competing methods typically use two to ten times more GPU-hours while achieving worse error rates.[3] The efficiency comes partly from predicting in representation space (no pixel decoder to run) and partly from the lightweight predictor and the masking scheme, which lets the context encoder process only a subset of patches.
I-JEPA is one member of a broader family of architectures that LeCun outlined in his 2022 position paper on autonomous machine intelligence. A JEPA does not predict the raw future or the raw missing input; it predicts an abstract representation of it, which lets the model discard unpredictable, low-level detail and focus on structure that can actually be anticipated.[3] I-JEPA is the image instantiation of this principle.
Meta later extended the same idea to video with V-JEPA, released in 2024, which predicts representations of masked spatio-temporal regions in video clips. The simplest way to state the relationship: I-JEPA is the image version and V-JEPA is the video version of the joint-embedding predictive architecture.[3] A follow-up, V-JEPA 2, was released in 2025 and scaled the approach to over one million hours of video, targeting understanding, prediction, and planning for physical-world tasks. The JEPA line of work is related to, but distinct from, Meta's DINOv2, which produces general-purpose visual features through a different (distillation-based, invariance-style) self-supervised recipe.