I-JEPA

Computer Vision Machine Learning Meta AI

8 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,544 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

I-JEPA (Image-based Joint-Embedding Predictive Architecture) is a self-supervised learning method for computer vision developed by Meta AI. It was introduced in the paper "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas, first posted to arXiv in January 2023 and published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) in 2023.^[1]^[2] I-JEPA is the first concrete image instantiation of the Joint-Embedding Predictive Architecture (JEPA), a design that LeCun proposed in 2022 as a path toward machines that build internal models of the world.^[3]

The central idea is to learn by prediction in representation space rather than in pixel space: from a single context block of an image, the model predicts the learned representations of several other target blocks in the same image. This lets I-JEPA avoid both the pixel-level reconstruction used by masked autoencoders (MAE) and the hand-crafted data augmentations used by invariance-based methods such as DINO.^[1]^[3]

Background and motivation

Self-supervised learning for images had, before I-JEPA, largely split into two families. Invariance-based (joint-embedding) methods such as DINO, SimCLR, and iBOT train a network to produce similar embeddings for two augmented views of the same image. These methods produce strong representations, but they depend on a curated set of image transformations (random cropping, color jitter, blurring) that bake in biases and can discard information that matters for some downstream tasks.^[1] Generative methods such as MAE instead mask out parts of an image and train the model to reconstruct the missing pixels. They need no augmentations, but reconstructing pixels forces the network to model low-level detail that is often irrelevant to semantic understanding, and the resulting off-the-shelf representations tend to lag behind invariance-based ones on tasks like linear classification.^[1]^[3]

I-JEPA targets the gap between these families. It keeps the augmentation-free property of generative methods while predicting in an abstract representation space, so it does not have to spend capacity filling in pixel-level texture. Meta's announcement framed this as a deliberate move toward LeCun's view that intelligent systems should learn by predicting high-level outcomes rather than reconstructing every detail of their input, and noted that pixel-reconstruction objectives waste effort on details a model cannot reliably produce (the blog gives the example of generative models struggling to render human hands).^[3]

How it works

I-JEPA uses three components, all built on the vision transformer (ViT). An image is first split into non-overlapping patches.^[1]

Component	Role
Context encoder	A ViT that encodes the visible patches of a single context block into a sequence of representations.
Target encoder	A ViT that encodes the full image into patch-level representations; the target blocks to be predicted are taken from this output.
Predictor	A narrow ViT that takes the context representations plus positional tokens for the masked target locations, and predicts the target-encoder representations at those locations.

Training minimizes the distance between the predictor's outputs and the corresponding target-encoder representations, averaged over the target blocks. Crucially, the loss is computed in representation space, not pixel space.^[1]

The target encoder is not trained by gradient descent on this loss. Instead its weights are an exponential moving average (EMA) of the context-encoder weights, updated with a momentum value of 0.996 in the paper's main configuration.^[1] This stop-gradient plus EMA arrangement gives the predictor stable targets and prevents representation collapse, a known failure mode in which a network maps every input to the same constant embedding. The same EMA-target idea appears in earlier joint-embedding methods such as BYOL and DINO.^[1]

Multi-block masking

The masking strategy is what steers I-JEPA toward semantic features. The paper introduces a multi-block masking scheme with two requirements: target blocks must be sampled at a sufficiently large scale to be semantically meaningful, and the context block must be sufficiently informative and spatially distributed.^[1]^[2] In the main configuration the model samples 4 target blocks per image, each covering a random fraction in the range (0.15, 0.2) of the image area with an aspect ratio in (0.75, 1.5). The context block covers a larger fraction in the range (0.85, 1.0), and any patches that overlap with the target blocks are removed from the context so the model cannot simply copy them.^[1] Because the targets are sizable blocks rather than scattered individual patches, predicting them requires the model to capture object-level structure rather than fill in local texture.

Model sizes

The released code provides several pretrained backbones at different scales and resolutions.^[4] The ViT-Huge model has roughly 632 million parameters.^[3]

Backbone	Patch size	Resolution	Epochs	Pretraining data
ViT-H	14x14	224x224	300	ImageNet-1K
ViT-H	16x16	448x448	300	ImageNet-1K
ViT-H	14x14	224x224	66	ImageNet-22K
ViT-g	16x16	224x224	44	ImageNet-22K

The paper also reports results for smaller ViT-B/16 and ViT-L/16 backbones trained on ImageNet-1K.^[1]

Results

I-JEPA's headline evaluation is linear probing on ImageNet-1K, where a linear classifier is trained on top of frozen features. The table below lists I-JEPA's top-1 accuracies alongside two augmentation-free baselines from the paper's comparison.^[1]

Method	Backbone	Epochs	Linear-probe top-1
I-JEPA	ViT-B/16	600	72.9%
I-JEPA	ViT-L/16	600	77.5%
I-JEPA	ViT-H/14	300	79.3%
I-JEPA	ViT-H/16 (448px)	300	81.1%
MAE	ViT-H/14	1600	77.2%
data2vec	ViT-L/16	1600	77.3%

At ViT-H/14, I-JEPA reaches 79.3% top-1, ahead of MAE's ViT-H/14 result of 77.2% while training for far fewer epochs (300 versus 1600). At higher resolution the ViT-H/16 model evaluated at 448 pixels reaches 81.1%, which is competitive with strong invariance-based methods that do use view augmentations, such as iBOT ViT-L/16 (81.0%) and DINO ViT-B/8 (80.1%).^[1]

I-JEPA is also strong in the low-shot (semi-supervised) regime, where only a small fraction of ImageNet labels are available. With 1% of ImageNet-1K labels (roughly 12 or 13 labeled images per class), it reaches 73.3% top-1 with ViT-H/14 and 77.3% with ViT-H/16 at 448 pixels.^[1]^[3] Meta described the 1% setting as state-of-the-art for low-shot classification at the time.^[3] Beyond classification, the paper reports that I-JEPA features transfer to tasks including object counting and depth prediction, supporting the claim that representation-space prediction yields broadly useful, off-the-shelf features.^[2]

Training efficiency

A repeated claim is computational efficiency. The authors trained a ViT-H/14 on ImageNet using 16 NVIDIA A100 GPUs in under 72 hours.^[1]^[2] In GPU-hour terms, pretraining the ViT-H/14 took fewer than 1,200 GPU hours, which the paper reports as more than 2.5 times faster than a ViT-S/16 pretrained with iBOT and more than 10 times more efficient than a ViT-H/14 pretrained with MAE.^[1] Meta's blog summarized this by saying that competing methods typically use two to ten times more GPU-hours while achieving worse error rates.^[3] The efficiency comes partly from predicting in representation space (no pixel decoder to run) and partly from the lightweight predictor and the masking scheme, which lets the context encoder process only a subset of patches.

Relationship to JEPA and V-JEPA

I-JEPA is one member of a broader family of architectures that LeCun outlined in his 2022 position paper on autonomous machine intelligence. A JEPA does not predict the raw future or the raw missing input; it predicts an abstract representation of it, which lets the model discard unpredictable, low-level detail and focus on structure that can actually be anticipated.^[3] I-JEPA is the image instantiation of this principle.

Meta later extended the same idea to video with V-JEPA, released in 2024, which predicts representations of masked spatio-temporal regions in video clips. The simplest way to state the relationship: I-JEPA is the image version and V-JEPA is the video version of the joint-embedding predictive architecture.^[3] A follow-up, V-JEPA 2, was released in 2025 and scaled the approach to over one million hours of video, targeting understanding, prediction, and planning for physical-world tasks. The JEPA line of work is related to, but distinct from, Meta's DINOv2, which produces general-purpose visual features through a different (distillation-based, invariance-style) self-supervised recipe.

References

Assran, Mahmoud; Duval, Quentin; Misra, Ishan; Bojanowski, Piotr; Vincent, Pascal; Rabbat, Michael; LeCun, Yann; Ballas, Nicolas. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture." arXiv:2301.08243, January 2023. https://arxiv.org/abs/2301.08243 ↩
Assran, Mahmoud et al. "Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. https://openaccess.thecvf.com/content/CVPR2023/html/Assran_Self-Supervised_Learning_From_Images_With_a_Joint-Embedding_Predictive_Architecture_CVPR_2023_paper.html ↩
Meta AI. "I-JEPA: The first AI model based on Yann LeCun's vision for more human-like AI." Meta AI Blog, June 13, 2023. https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/ ↩
facebookresearch/ijepa. "Official codebase for I-JEPA." GitHub. https://github.com/facebookresearch/ijepa ↩
LeCun, Yann. "A Path Towards Autonomous Machine Intelligence." OpenReview, 2022. https://openreview.net/forum?id=BZ5a1r-kVsf
Bardes, Adrien et al. "Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA)." Meta AI, 2024. https://ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Joint Embedding Predictive Architecture V-JEPA V-JEPA 2

Background and motivation

How it works

Multi-block masking

Model sizes

Results

Training efficiency

Relationship to JEPA and V-JEPA

See also

References

Improve this article

Related Articles

DINO (computer vision)

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv2

DINOv3

SAM 2

Detectron2

What links here

Related Articles

DINO (computer vision)

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv2

DINOv3

SAM 2

Detectron2

What links here