See also: Machine learning terms
Self-supervised learning (SSL) is a subfield of machine learning that focuses on learning representations of data without human-provided labels by exploiting the structure and inherent properties of the data itself. Rather than requiring manually annotated labels, SSL algorithms generate supervisory signals directly from the input data by defining pretext tasks that force models to learn meaningful internal representations. This approach has gained significant traction since the mid-2010s because it enables algorithms to learn useful features from large volumes of unlabeled data, thereby reducing the reliance on expensive labeled datasets. The learned representations can then be fine-tuned for a wide range of downstream tasks, such as image classification, natural language processing, speech recognition, and reinforcement learning.
Self-supervised learning has become the dominant pretraining paradigm behind modern foundation models. Nearly all large-scale AI systems introduced since 2018, including BERT, GPT, T5, CLIP, DINO, and Wav2Vec 2.0, rely on self-supervised pretraining as the first stage of their training pipeline. By learning from raw, unlabeled corpora of text, images, audio, or video, these models acquire general-purpose representations that transfer effectively to hundreds of specialized tasks with minimal labeled data.
Imagine you are given a picture book, but someone has covered up parts of every picture with sticky notes. Your job is to guess what is hidden under each sticky note. Nobody tells you the answers, but by looking at thousands of pictures and figuring out what fits, you start to understand what dogs, trees, and cars look like. That is essentially what self-supervised learning does: the computer hides parts of its own data (words in a sentence, patches of an image, segments of audio) and then tries to guess what was hidden. By solving these "fill in the blank" puzzles millions of times, the computer builds a deep understanding of language, images, or sound, all without a human teacher labeling every example.
Another analogy: think of a jigsaw puzzle. Nobody tells you what the finished picture should look like, but by fitting the pieces together you learn a lot about shapes, colors, and scenes. Self-supervised learning creates its own jigsaw puzzles from raw data, and the process of solving them teaches the model useful patterns it can later apply to real tasks like translating languages or identifying objects in photos.
The conventional supervised machine learning paradigm requires large amounts of labeled data to train accurate models. However, obtaining such labeled data can be expensive, time-consuming, and infeasible for some domains. Medical imaging, for instance, requires expert radiologists to annotate each scan, while speech transcription in low-resource languages may lack native transcribers entirely. Moreover, supervised learning models may not generalize well to new, unseen data, as they are often biased towards the specific distribution of the training set. In contrast, self-supervised learning aims to leverage the abundance of unlabeled data available in the wild, allowing models to learn meaningful representations without explicit supervision.
Unsupervised machine learning methods, such as clustering and dimensionality reduction, have long been used to analyze and discover structures in data without relying on labels. Self-supervised learning builds upon these foundations by focusing on learning rich, high-level representations of data that can be used as a starting point for various downstream tasks. By doing so, SSL bridges the gap between unsupervised learning and supervised learning, exploiting the benefits of both paradigms.
The term "self-supervised learning" was popularized by Yann LeCun, who argued that it more accurately describes the mechanism at work compared to the broader label of "unsupervised learning." In self-supervised learning, the supervision signal is not absent; it is derived automatically from the data. For example, predicting a masked word in a sentence or predicting the next frame in a video provides a concrete training objective, even though no human annotator supplied the target.
In a landmark 2021 blog post co-authored with Yann LeCun, Meta AI described self-supervised learning as the "dark matter of intelligence." The analogy draws on cosmology: just as dark matter constitutes the vast majority of the universe's mass yet remains invisible, the vast majority of learning that biological organisms perform is self-supervised rather than supervised. Humans and animals learn to understand the world largely through observation, not through labeled examples. A child does not need someone to label every object in a room to learn what a chair looks like; instead, the child builds mental models by predicting what will happen next and filling in gaps in perception.
LeCun argued that self-supervised learning is one of the most promising paths toward building AI systems that approach human-level common sense. Supervised learning, by analogy, accounts for only the thin "icing" on the cake of intelligence, while self-supervised learning provides the bulk of the "cake" itself. This framing has been influential in motivating research into methods like JEPA that seek to learn world models through prediction in abstract representation spaces.
Self-supervised learning occupies a distinct position among the major machine learning paradigms. The following table summarizes the key differences.
| Aspect | Supervised learning | Unsupervised learning | Semi-supervised learning | Self-supervised learning |
|---|---|---|---|---|
| Labels required | Full labeled dataset | No labels | Small labeled set + large unlabeled set | No labels (labels derived from data) |
| Training signal | Human-provided labels | Data structure (clusters, distributions) | Combination of labels and consistency regularization | Pretext task generated from the data itself |
| Typical goal | Predict target variable | Discover hidden patterns | Improve supervised model with unlabeled data | Learn general-purpose representations |
| Common workflow | Train directly on labeled data | Clustering, density estimation, dimensionality reduction | Joint training on labeled and unlabeled data | Pre-train on pretext task, then fine-tune on downstream task |
| Data efficiency | Requires large labeled datasets | Works on unlabeled data | Reduces labeling cost | Leverages massive unlabeled corpora |
| Examples | Image classification with ImageNet labels, sentiment analysis | K-means clustering, PCA, autoencoders | FixMatch, MixMatch, pseudo-labeling | BERT (masked language modeling), SimCLR (contrastive learning), MAE (masked image modeling) |
A key distinction is that semi-supervised learning and self-supervised learning both use unlabeled data, but they do so differently. Semi-supervised methods typically train a single model jointly on a small labeled set and a large unlabeled set, using techniques such as consistency regularization or pseudo-labeling. Self-supervised methods, by contrast, define an explicit pretext task that requires no labels at all. The resulting pretrained model is then adapted to downstream tasks through transfer learning, usually via fine-tuning or linear probing.
The roots of self-supervised learning trace back to early work on distributed word representations and autoencoders. Autoencoders, which learn to compress and reconstruct their input through a bottleneck layer, represent one of the earliest forms of learning useful representations without labels. Denoising autoencoders (Vincent et al., 2008) took this further by corrupting the input and training the network to recover the original, an approach that foreshadowed modern masked prediction methods.
In 2013, Tomas Mikolov and colleagues at Google published Word2Vec, which learned word embeddings by predicting context words given a target word (skip-gram) or predicting a target word given its context (continuous bag of words, CBOW). Although the term "self-supervised" was not commonly used at the time, Word2Vec exemplified the core principle: constructing supervision from the data itself. Later methods such as GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2016) extended this idea to capture subword information and global co-occurrence statistics.
The release of ELMo in 2018 by Peters et al. demonstrated that deep, context-dependent word representations trained with language modeling objectives could substantially improve downstream NLP tasks. This set the stage for the transformer-based self-supervised revolution that followed with BERT and GPT.
Self-supervised methods can be organized by the type of pretext task they use to extract supervision from raw data. The three broad families are predictive methods, contrastive methods, and generative (reconstructive) methods.
Predictive pretext tasks require the model to predict some part of the input from the remaining parts. Examples include:
Contrastive methods learn representations by pulling together embeddings of semantically similar ("positive") pairs and pushing apart embeddings of dissimilar ("negative") pairs. The model does not reconstruct the input; instead, it learns an embedding space where similarity reflects semantic relatedness. SimCLR, MoCo, and CLIP are well-known contrastive approaches.
Generative pretext tasks require the model to reconstruct the original input from a corrupted or partial version. Variational autoencoders (VAEs), denoising autoencoders, and masked autoencoders (MAE) fall into this category. While generative models like GANs can also be used for representation learning, their primary training signal comes from an adversarial game rather than direct reconstruction.
Natural language processing has been one of the most successful application domains for self-supervised learning. The key insight is that raw text contains rich structure that can be exploited as a training signal. Three principal pretext tasks have dominated NLP pretraining: masked language modeling, next-token prediction, and span corruption.
BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. at Google in October 2018, pioneered the masked language modeling (MLM) approach. During pretraining, 15% of the input tokens are selected at random. Of these selected tokens, 80% are replaced with a special [MASK] token, 10% are replaced with a random word, and 10% are left unchanged. The model must predict the original identity of each selected token using the bidirectional context provided by the surrounding words.
BERT also employed a secondary pretext task called next sentence prediction (NSP), in which the model received two sentences and predicted whether the second sentence followed the first in the original document. BERT was pretrained on the BooksCorpus (800 million words) and English Wikipedia (2,500 million words).
The BERT-Base configuration uses 12 transformer encoder layers, 12 attention heads, and a hidden size of 768, totaling 110 million parameters. BERT-Large scales to 24 layers, 16 heads, and a hidden size of 1,024, for a total of 340 million parameters. After pretraining, BERT set new state-of-the-art results on 11 NLP benchmarks, including SQuAD and GLUE.
Several variants refined the MLM objective:
The GPT (Generative Pre-trained Transformer) series, developed by OpenAI, uses autoregressive language modeling as its self-supervised pretext task. The model processes a sequence of tokens from left to right and predicts the next token at each position. The training objective is to maximize the likelihood of the next token given all preceding tokens, using causal (left-to-right) attention masking so that each position can only attend to earlier positions.
GPT-1 (Radford et al., 2018) demonstrated that generative pretraining on a large text corpus followed by discriminative fine-tuning could achieve strong performance across diverse NLP tasks. GPT-2 (2019) scaled the approach to 1.5 billion parameters and showed that language models could perform tasks in a zero-shot setting without any fine-tuning. GPT-3 (2020) further scaled to 175 billion parameters and introduced few-shot in-context learning, where the model could perform new tasks by conditioning on a handful of examples in the prompt. GPT-4 (2023) and GPT-5 (2025) continued this trajectory, combining autoregressive pretraining with reinforcement learning from human feedback (RLHF) and other alignment techniques.
The autoregressive pretraining approach has also been adopted by many other large language models, including LLaMA (Meta), Mistral, Falcon, Qwen (Alibaba), and DeepSeek.
T5 (Text-to-Text Transfer Transformer), introduced by Raffel et al. at Google in 2019, unified all NLP tasks into a text-to-text format, where both the input and output are text strings. Its self-supervised pretraining objective is span corruption, a denoising task that randomly selects and drops out 15% of the tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single unique sentinel token (e.g., <X>, <Y>). The model is then trained to generate the missing spans, delimited by the corresponding sentinel tokens.
For example, given the input sentence "The quick brown fox jumps over the lazy dog," the words "brown fox" and "lazy" might be dropped out. The corrupted input would become "The quick <X> jumps over the <Y> dog," and the target output would be "<X> brown fox <Y> lazy <Z>." Because the model must reconstruct multiple consecutive tokens at once, span corruption encourages the learning of richer contextual representations than single-token masking.
T5 uses an encoder-decoder transformer architecture, unlike the encoder-only BERT or decoder-only GPT. The original T5 paper systematically evaluated a wide range of pretraining objectives, architectures, and dataset sizes using the Colossal Clean Crawled Corpus (C4), a 750 GB dataset derived from Common Crawl. T5 models range from T5-Small (60 million parameters) to T5-11B (11 billion parameters). Subsequent versions include mT5 (multilingual T5), Flan-T5 (instruction-tuned), and UL2, which combines multiple pretraining objectives.
| Method | Pretext task | Architecture | Directionality | Key innovation | Notable models |
|---|---|---|---|---|---|
| Masked language modeling | Predict masked tokens from context | Encoder-only | Bidirectional | Learns from both left and right context simultaneously | BERT, RoBERTa, ALBERT |
| Next-token prediction | Predict next token autoregressively | Decoder-only | Left-to-right (causal) | Scales naturally to generation tasks; enables in-context learning | GPT, LLaMA, Mistral |
| Span corruption | Reconstruct corrupted spans delimited by sentinels | Encoder-decoder | Bidirectional encoder, autoregressive decoder | Predicts multi-token spans; shorter target sequences reduce training cost | T5, mT5, UL2 |
| Replaced token detection | Detect which tokens were replaced by a generator | Encoder-only | Bidirectional | All tokens contribute to the loss, not just 15% | ELECTRA |
| Permutation language modeling | Predict tokens in random order to capture bidirectional context | Encoder-only | Permuted | Combines benefits of autoregressive and bidirectional models | XLNet |
Applying self-supervised learning to images presents a different set of challenges compared to text. Language has a natural sequential structure and discrete tokens, whereas images are high-dimensional, continuous signals without an obvious ordering. Early SSL methods in vision used hand-designed pretext tasks such as predicting image rotations, solving jigsaw puzzles, or colorizing grayscale images. While these methods yielded useful representations, they were eventually surpassed by contrastive learning and masked image modeling approaches that learn more general features.
Before the contrastive learning era, researchers devised several creative pretext tasks for visual SSL:
While these methods produced representations that outperformed random initialization, they often encoded task-specific biases. For example, a rotation predictor might focus on texture cues rather than semantic content. The move to contrastive and masked modeling methods addressed this limitation by learning more general-purpose features.
SimCLR (A Simple Framework for Contrastive Learning of Visual Representations), introduced by Chen et al. at Google Research in February 2020, demonstrated that a straightforward contrastive framework could match or exceed earlier, more complex methods. The SimCLR pipeline has four components:
SimCLR requires very large batch sizes (4,096 in the original paper) to provide a sufficient number of negative examples. After pretraining, the projection head is discarded, and the encoder is used for downstream tasks. SimCLR achieved 76.5% top-1 accuracy on ImageNet with linear evaluation using a ResNet-50 backbone. SimCLR v2 (2020) scaled the framework to larger ResNet models and introduced a semi-supervised variant that leveraged a small amount of labeled data.
MoCo (Momentum Contrast), introduced by He et al. at Meta AI Research in November 2019, tackled the large-batch requirement of contrastive learning by maintaining a dynamic queue of negative representations. The key components are:
MoCo v1 achieved competitive results with SimCLR while using a batch size of only 256. MoCo v2 (2020) incorporated improvements inspired by SimCLR, including an MLP projection head and stronger augmentations. MoCo v3 (Chen et al., 2021) adapted the framework for Vision Transformers (ViT) and addressed training instability issues that arise when using transformers with contrastive objectives.
BYOL (Bootstrap Your Own Latent), published by Grill et al. at DeepMind in June 2020, challenged the assumption that negative pairs are essential for contrastive learning. BYOL uses two networks:
The online network is trained to predict the target network's representation of a differently augmented view of the same image. Because the target network updates slowly via the momentum mechanism, it provides a stable regression target. BYOL avoids the need for negative examples entirely, which makes it more robust to batch size variations.
BYOL achieved 74.3% top-1 accuracy on ImageNet using linear evaluation with a ResNet-50 encoder and 79.6% with a larger ResNet. It demonstrated stable performance across batch sizes ranging from 256 to 4,096, whereas SimCLR's performance degrades significantly with smaller batches.
SimSiam (Exploring Simple Siamese Representation Learning), published by Chen and He at Meta AI in 2021, further simplified non-contrastive learning by removing the momentum encoder entirely. SimSiam uses a simple Siamese network with a shared encoder and a prediction MLP applied to one branch. The critical innovation is a stop-gradient operation: one branch's output is treated as a fixed target (receives no gradient), while the other branch is trained to predict it.
This design means SimSiam requires neither negative pairs (like SimCLR), nor a momentum encoder (like BYOL and MoCo), nor large batches. Despite its simplicity, SimSiam achieves competitive performance on ImageNet, with 71.3% top-1 accuracy under linear evaluation using ResNet-50. The paper provided theoretical analysis suggesting that SimSiam implicitly performs a form of Expectation-Maximization (EM) optimization, alternating between clustering assignments and representation updates.
DINO (Self-Distillation with No Labels), introduced by Caron et al. at Meta AI Research in April 2021, is a self-supervised method based on self-distillation using Vision Transformers. DINO uses a student-teacher framework where both networks share the same architecture:
A notable discovery from DINO was that self-supervised Vision Transformers learn to segment objects without any explicit segmentation supervision. The self-attention maps of the final layer's [CLS] token clearly delineate object boundaries, a property that does not emerge as clearly in supervised ViTs or in convolutional networks.
DINO achieved 80.1% top-1 accuracy on ImageNet with linear evaluation using ViT-Base. DINOv2 (Oquab et al., 2023) scaled the method to 142 million curated images and a ViT-Giant model with over 1 billion parameters, using a combination of self-distillation and masked image modeling. DINOv2 achieved state-of-the-art results across classification, segmentation, and depth estimation without any fine-tuning, producing visual features that work out-of-the-box with simple linear classifiers. DINOv3 (2025) further scaled to 1.7 billion images and a 7-billion-parameter ViT teacher, narrowing the gap with fully supervised models across vision benchmarks.
BEiT (BERT Pre-Training of Image Transformers), introduced by Bao et al. at Microsoft Research in 2021, was the first method to make self-supervised pre-training of Vision Transformers outperform supervised pre-training. BEiT adapts the masked language modeling concept from BERT to images:
By predicting discrete visual tokens rather than raw pixels, BEiT avoids the pixel-level regression problem and instead frames pre-training as a classification task over visual vocabulary. BEiT-Base achieved 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming the DeiT supervised baseline (81.8%). BEiT-Large reached 86.3% using only ImageNet-1K data.
MAE (Masked Autoencoders Are Scalable Vision Learners), introduced by He et al. at Meta AI Research in November 2021, adapted the masked prediction idea from NLP to computer vision with a simpler approach than BEiT. MAE applies the following procedure:
The asymmetric encoder-decoder design is critical: because the encoder processes only 25% of the patches, pretraining is highly efficient in both computation and memory. The high masking ratio forces the model to develop a holistic understanding of the image rather than relying on local interpolation from nearby visible patches. Unlike BEiT, MAE reconstructs raw pixel values directly using mean squared error loss, making it simpler (no need for a pre-trained tokenizer) and 3.5 times faster per epoch.
MAE with a ViT-Huge encoder achieved 87.8% top-1 accuracy on ImageNet after fine-tuning, establishing a new state of the art for self-supervised methods at the time. The approach also scales well to video (VideoMAE) and audio (Audio-MAE).
| Method | Year | Authors / Lab | Approach | Requires negatives | Architecture | ImageNet top-1 (linear eval) |
|---|---|---|---|---|---|---|
| SimCLR | 2020 | Chen et al. / Google | Contrastive (NT-Xent loss, large batches) | Yes | ResNet-50 | 76.5% |
| MoCo v2 | 2020 | He et al. / Meta AI | Contrastive (momentum queue) | Yes | ResNet-50 | 71.1% |
| BYOL | 2020 | Grill et al. / DeepMind | Non-contrastive (online-target prediction) | No | ResNet-50 | 74.3% |
| SimSiam | 2021 | Chen & He / Meta AI | Non-contrastive (stop-gradient Siamese) | No | ResNet-50 | 71.3% |
| Barlow Twins | 2021 | Zbontar et al. / Meta AI | Non-contrastive (redundancy reduction) | No | ResNet-50 | 73.2% |
| BEiT | 2021 | Bao et al. / Microsoft | Masked image modeling (token prediction) | No | ViT-Base (fine-tuned) | 83.2% |
| DINO | 2021 | Caron et al. / Meta AI | Self-distillation (student-teacher) | No | ViT-Base | 80.1% |
| MAE | 2022 | He et al. / Meta AI | Masked image modeling (pixel reconstruction) | No | ViT-Huge (fine-tuned) | 87.8% |
| DINOv2 | 2023 | Oquab et al. / Meta AI | Self-distillation at scale (142M images) | No | ViT-Giant | 86.5% |
Speech and audio present unique challenges for self-supervised learning. Audio signals are continuous waveforms with temporal structure, and the relationship between acoustic features and linguistic content is complex and variable across speakers, accents, and recording conditions. Self-supervised methods for speech typically operate on raw waveform inputs or spectral features and learn representations that encode phonetic, speaker, and prosodic information.
Wav2Vec 2.0, introduced by Baevski et al. at Meta AI Research in June 2020, combines contrastive learning with quantization to learn speech representations from raw audio. The architecture has three components:
Wav2Vec 2.0 demonstrated dramatic improvements in low-resource speech recognition. When pre-trained on 53,000 hours of unlabeled audio from LibriVox and fine-tuned on just 10 minutes of labeled data, it achieved a word error rate of 4.8/8.2 on the LibriSpeech test-clean/test-other benchmarks. Using only one hour of labeled data, it outperformed the previous state of the art trained on 100 times more labeled data.
HuBERT (Hidden-Unit BERT), introduced by Hsu et al. at Meta AI Research in June 2021, takes a different approach to self-supervised speech representation learning. Instead of contrastive learning with online quantization, HuBERT uses an offline clustering step to generate pseudo-labels:
HuBERT matches or exceeds Wav2Vec 2.0 performance across all fine-tuning subsets of LibriSpeech. When pretrained on the Libri-Light 60,000-hour dataset, HuBERT achieved state-of-the-art results on several speech processing benchmarks. The approach has also been extended to multilingual settings and speech generation tasks.
Beyond Wav2Vec 2.0 and HuBERT, several other self-supervised methods have advanced speech representation learning:
A fundamental distinction in self-supervised learning, particularly in computer vision, is between contrastive and non-contrastive methods. Both approaches learn by comparing different views of the same input, but they differ in how they prevent the model from learning trivial solutions (representational collapse).
Contrastive methods explicitly use negative examples to prevent collapse. The loss function pulls together representations of positive pairs (different views of the same input) while pushing apart representations of negative pairs (views of different inputs). The InfoNCE loss, used in many contrastive frameworks, formalizes this as a softmax classification problem over one positive and many negative examples.
The effectiveness of contrastive learning depends on the quality and quantity of negative examples. SimCLR addresses this by using very large batch sizes (4,096), while MoCo maintains a separate queue of negatives. A key challenge is that contrastive methods can become inefficient in high-dimensional representation spaces, where the number of negatives required for effective training grows substantially.
Non-contrastive methods avoid the need for negative examples entirely. Instead, they prevent collapse through architectural asymmetries, regularization techniques, or information-theoretic objectives. The major non-contrastive approaches include:
| Aspect | Contrastive methods | Non-contrastive methods |
|---|---|---|
| Negative examples | Required; more negatives generally improve performance | Not required |
| Batch size sensitivity | Performance often depends on large batch sizes or external memory | Generally more robust to batch size |
| Loss function | InfoNCE, NT-Xent | MSE, cross-correlation, variance/covariance regularization |
| Collapse prevention | Explicit repulsion of negatives | Architectural asymmetry, regularization, or information-theoretic constraints |
| Computational cost | Can be expensive due to large batches or memory banks | Typically lower, but requires careful design to avoid collapse |
| Examples | SimCLR, MoCo, CLIP | BYOL, DINO, SimSiam, Barlow Twins, VICReg |
At a higher level of abstraction, self-supervised learning methods can be divided into two broad paradigms: joint-embedding methods and generative methods. This distinction, highlighted in the "Cookbook of Self-Supervised Learning" survey (Balestriero et al., 2023), captures a fundamental design choice about how the learning signal is constructed.
Joint-embedding methods (also called embedding-based or energy-based methods) map two views of the same input into a shared representation space and train the model to make the two embeddings similar. The model never reconstructs the raw input. SimCLR, MoCo, BYOL, DINO, Barlow Twins, and VICReg all fall into this category. The advantage is that the model is free to discard low-level details (exact pixel values, noise) and focus on high-level semantic content.
Generative methods reconstruct some form of the original input from a corrupted or partial version. Masked language modeling (BERT), next-token prediction (GPT), and masked image modeling (MAE, BEiT) are generative in nature. These methods provide a dense training signal (every masked position contributes to the loss), but they require the model to allocate capacity to low-level reconstruction, which may not always be useful for downstream tasks.
In practice, the most successful recent systems combine elements of both. DINOv2, for example, uses a joint-embedding self-distillation objective alongside a masked image modeling objective. The I-JEPA and V-JEPA frameworks represent a hybrid approach that performs prediction in representation space rather than input space.
The Joint-Embedding Predictive Architecture (JEPA) is a framework for self-supervised learning proposed by Yann LeCun in his 2022 position paper "A Path Towards Autonomous Machine Intelligence." JEPA represents a departure from both contrastive methods and pixel-level generative methods, instead learning to predict in a learned abstract representation space.
In a JEPA, two encoder networks map inputs x and y into embedding spaces, producing representations sx and sy. A predictor network takes sx (and optionally a latent variable z) and predicts sy. The key principles are:
By predicting in representation space rather than in input space, JEPA avoids the need to model irrelevant low-level details (exact pixel values, background textures) and instead focuses on capturing high-level semantic content. This is a central motivation: LeCun argues that predicting every pixel in an image or every sample in an audio waveform wastes model capacity on perceptually irrelevant variation.
I-JEPA (Image-based JEPA), introduced by Assran et al. at Meta AI in 2023, applies the JEPA framework to images. The method works as follows:
I-JEPA differs from MAE in that it predicts in representation space, not in pixel space. This design learns representations that emphasize semantic content over low-level texture and color information.
V-JEPA (Video JEPA), introduced by Bardes et al. at Meta AI in 2024, extends the JEPA framework to video. The model predicts masked spatio-temporal regions in a learned latent space, learning from the temporal structure of video without any text supervision, negative examples, or pixel-level reconstruction. V-JEPA pretraining is based solely on an unsupervised feature prediction objective.
V-JEPA 2 (2025) scaled the approach to over one million hours of internet video data and combined it with a small amount of robot interaction data. It achieved 77.3% top-1 accuracy on Something-Something v2 for motion understanding and state-of-the-art performance on human action anticipation on Epic-Kitchens-100.
LeJEPA, introduced by LeCun and Balestriero at Meta in late 2025, simplified the JEPA framework by combining the JEPA predictive loss with SIGReg (Sketched Isotropic Gaussian Regularization). LeJEPA removes the need for many of the engineering heuristics that earlier self-supervised methods relied on, such as momentum encoders, stop-gradients, and asymmetric architectures. The method can be implemented in approximately 50 lines of code, making it one of the most accessible self-supervised learning algorithms to date.
| Feature | Contrastive SSL | Generative SSL (MAE) | JEPA |
|---|---|---|---|
| Prediction space | Embedding similarity | Input (pixel) space | Learned representation space |
| Negative examples | Required | Not applicable | Not required |
| What is predicted | Whether two views match | Missing pixels or tokens | Abstract representations of missing regions |
| Low-level detail modeling | Avoided via embedding space | Required (pixel reconstruction) | Avoided by design |
| Flexibility | Primarily two views of the same input | Masked input reconstruction | Spatial, temporal, and cross-modal prediction |
| Examples | SimCLR, MoCo, CLIP | MAE, BEiT | I-JEPA, V-JEPA, LeJEPA |
Self-supervised learning has extended beyond single modalities to learn joint representations across vision, language, and audio.
CLIP (Contrastive Language-Image Pre-training), introduced by Radford et al. at OpenAI in January 2021, learns visual representations from natural language supervision. CLIP jointly trains an image encoder (a Vision Transformer or ResNet) and a text encoder (a transformer-based language model) on 400 million image-text pairs collected from the internet. The contrastive objective maximizes the cosine similarity between matching image-text pairs and minimizes it for non-matching pairs within each minibatch of 32,768 examples.
CLIP enables zero-shot image classification: given an image and a set of textual class descriptions, the model selects the description whose embedding is most similar to the image embedding. Without any fine-tuning on ImageNet, CLIP matched the accuracy of a fully supervised ResNet-50. CLIP representations have become widely used as conditioning signals in text-to-image generation models such as Stable Diffusion and DALL-E.
Self-supervised learning is most commonly used as the first stage of a two-stage pipeline: pre-training followed by adaptation. This pipeline has become the dominant approach for building modern AI systems and is central to the concept of pre-trained models.
A large neural network (typically a transformer) is trained on a pretext task using a large corpus of unlabeled data. The goal is to learn general-purpose representations that capture the structure of the data domain. This stage is computationally expensive (often requiring hundreds or thousands of GPU-hours) but needs to be performed only once.
The pretrained model is adapted to a specific downstream task using one of several strategies:
This two-stage pipeline is the foundation of the foundation model paradigm, in which a single large pre-trained model serves as a starting point for many different tasks and applications.
Evaluating the quality of self-supervised representations is a critical and nuanced problem. Because SSL methods do not optimize for any specific downstream task, researchers use several standardized evaluation protocols to assess how useful the learned representations are.
Linear probing (also called linear evaluation) is the most widely used evaluation protocol for SSL. A linear classifier (single fully-connected layer) is trained on top of the frozen pretrained encoder using a labeled dataset such as ImageNet. The pretrained encoder's weights are not updated during this process. High linear probing accuracy indicates that the pretrained features are linearly separable with respect to the downstream task, meaning they already encode semantically meaningful information.
Linear probing is favored because it isolates the quality of the representations from the capacity of the downstream model. If a complex neural network is used for evaluation, it might compensate for poor representations through its own learning.
In fine-tuning evaluation, the entire pretrained model is trained end-to-end on the downstream task. All parameters are updated, allowing the representations to adapt to the specific task and dataset. Fine-tuning generally produces higher accuracy than linear probing because the model can adjust its features. However, it provides less insight into the intrinsic quality of the pretrained representations, since a powerful model architecture can partially compensate for weaker pre-training.
k-NN evaluation extracts features from the frozen pretrained encoder for both training and test images, then classifies each test image by majority vote among its k nearest neighbors in the training set (measured by Euclidean distance or cosine similarity in the feature space). This protocol requires no training at all, making it fast and computationally lightweight. k-NN accuracy is highly correlated with linear probing accuracy when embedding normalization is applied, and the two metrics can often be used interchangeably.
| Protocol | Pretrained weights | Evaluation model | Computational cost | What it measures |
|---|---|---|---|---|
| Linear probing | Frozen | Single linear layer | Low | Quality of frozen representations |
| Fine-tuning | Updated | Full pretrained model | High | Upper bound on task performance with pretrained initialization |
| k-NN | Frozen | None (nearest neighbor lookup) | Very low | Cluster structure of the representation space |
| Few-shot | Frozen or minimal adaptation | Linear or lightweight head | Low | Generalization from very few labeled examples |
Self-supervised learning is the engine behind the foundation model paradigm, in which a single large model is pretrained on broad data and then adapted to many downstream tasks.
The effectiveness of self-supervised pretraining scales with both model size and data volume. In NLP, scaling from GPT-1 (117M parameters, approximately 5 GB text) to GPT-3 (175B parameters, approximately 570 GB text) led to dramatic improvements in few-shot and zero-shot capabilities. In vision, DINOv2 demonstrated that self-supervised models trained on 142 million curated images produce features that rival or exceed supervised pretraining for classification, segmentation, and depth estimation. DINOv3 (2025) pushed this further with 1.7 billion images and a 7-billion-parameter ViT.
Self-supervised pretrained models require far less labeled data for downstream tasks compared to training from scratch. Wav2Vec 2.0 demonstrated that just 10 minutes of labeled speech data, combined with self-supervised pretraining, can achieve competitive speech recognition. In NLP, BERT fine-tuned on a few thousand labeled examples routinely outperforms models trained from scratch on much larger labeled datasets.
Large self-supervised models exhibit capabilities that are not explicitly trained for:
| Modality | Method | Pretext task | Key result |
|---|---|---|---|
| Text | BERT | Masked language modeling | SOTA on 11 NLP benchmarks at release (2018) |
| Text | GPT-3 | Next-token prediction | Few-shot learning without fine-tuning (2020) |
| Text | T5 | Span corruption | Unified text-to-text framework across NLP tasks (2019) |
| Images | SimCLR | Contrastive (augmented views) | 76.5% ImageNet linear eval with ResNet-50 (2020) |
| Images | MoCo | Contrastive (momentum queue) | Decoupled batch size from number of negatives (2019) |
| Images | BYOL | Non-contrastive (self-prediction) | 74.3% ImageNet without negative examples (2020) |
| Images | SimSiam | Non-contrastive (stop-gradient) | Simplified non-contrastive SSL without momentum (2021) |
| Images | BEiT | Masked image modeling (token prediction) | First SSL to outperform supervised ViT pre-training (2021) |
| Images | DINO | Self-distillation | Emergent object segmentation in ViT attention maps (2021) |
| Images | MAE | Masked image modeling | 87.8% ImageNet fine-tuned with ViT-Huge (2022) |
| Images | I-JEPA | JEPA (representation prediction) | Semantic features without pixel reconstruction (2023) |
| Speech | Wav2Vec 2.0 | Contrastive with quantization | Competitive ASR with 10 min labeled data (2020) |
| Speech | HuBERT | Masked prediction with offline clustering | Matched/exceeded Wav2Vec 2.0 across benchmarks (2021) |
| Multimodal | CLIP | Image-text contrastive | Zero-shot ImageNet classification matching supervised ResNet-50 (2021) |
| Video | V-JEPA | Spatio-temporal representation prediction | Action understanding without text or pixel reconstruction (2024) |
Self-supervised learning has shown transformative results in a variety of domains, including:
Computer vision: SSL techniques have been used to learn powerful representations from large-scale image datasets, which can then be fine-tuned for tasks like object detection, image segmentation, and classification. DINOv2 features serve as general-purpose visual features for medical imaging, autonomous driving, and robotics applications.
Natural language processing: Language models like BERT and GPT have achieved state-of-the-art results on numerous NLP benchmarks by leveraging self-supervised pre-training on large text corpora. The GPT series demonstrated that scaling autoregressive pretraining leads to emergent few-shot and zero-shot capabilities.
Speech recognition: Wav2Vec 2.0 and HuBERT have dramatically reduced the amount of labeled data required for speech recognition systems, enabling competitive performance in low-resource languages where labeled transcriptions are scarce.
Reinforcement learning: SSL has been used to learn useful features from raw sensory data in reinforcement learning settings, enabling agents to learn more efficiently and generalize better across tasks. V-JEPA 2 has demonstrated the potential for self-supervised video models to support planning in physical environments.
Medical imaging: Self-supervised pretraining on unlabeled medical scans, followed by fine-tuning on small labeled datasets, has improved diagnostic accuracy for radiology, pathology, and dermatology applications. Studies have shown that SSL-pretrained models can match fully supervised models while requiring 5 to 10 times fewer labeled examples.
Robotics: Self-supervised visual and multimodal representations are used for manipulation tasks, navigation, and sim-to-real transfer, where labeled data is particularly difficult to collect.
Despite its successes, self-supervised learning faces several ongoing challenges: