# Self-Supervised Learning

> Source: https://aiwiki.ai/wiki/self-supervised_learning
> Updated: 2026-06-20
> Categories: Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

Self-supervised learning (SSL) is a machine learning approach in which a model learns representations from unlabeled data by generating its own supervisory signal from the structure of the data itself, rather than from human-provided labels. It works by defining a pretext task, such as predicting a masked word in a sentence or a hidden patch of an image, so that the target for each training example is derived automatically from the input. Self-supervised pretraining is the dominant first stage behind nearly every modern [foundation model](/wiki/foundation_models), including [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [GPT](/wiki/generative_pre-trained_transformer), [CLIP](/wiki/clip), and [DINO](/wiki/dino), and [Yann LeCun](/wiki/yann_lecun) has called it "the dark matter of intelligence."[18]

## What is self-supervised learning?

Self-supervised learning (SSL) is a subfield of [machine learning](/wiki/machine_learning) that focuses on learning representations of data without human-provided labels by exploiting the structure and inherent properties of the data itself. Rather than requiring manually annotated labels, SSL algorithms generate supervisory signals directly from the input data by defining pretext tasks that force models to learn meaningful internal representations. This approach has gained significant traction since the mid-2010s because it enables algorithms to learn useful features from large volumes of unlabeled data, thereby reducing the reliance on expensive labeled datasets. The learned representations can then be [fine-tuned](/wiki/fine_tuning) for a wide range of downstream tasks, such as [image classification](/wiki/image_classification_models), [natural language processing](/wiki/natural_language_processing), [speech recognition](/wiki/speech_recognition), and [reinforcement learning](/wiki/reinforcement_learning).

Self-supervised learning has become the dominant pretraining paradigm behind modern [foundation models](/wiki/foundation_models). Nearly all large-scale AI systems introduced since 2018, including [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [GPT](/wiki/generative_pre-trained_transformer), [T5](/wiki/t5), [CLIP](/wiki/clip), [DINO](/wiki/dino), and Wav2Vec 2.0, rely on self-supervised pretraining as the first stage of their training pipeline. By learning from raw, unlabeled corpora of text, images, audio, or video, these models acquire general-purpose representations that transfer effectively to hundreds of specialized tasks with minimal labeled data.

## ELI5 (Explain like I'm 5)

Imagine you are given a picture book, but someone has covered up parts of every picture with sticky notes. Your job is to guess what is hidden under each sticky note. Nobody tells you the answers, but by looking at thousands of pictures and figuring out what fits, you start to understand what dogs, trees, and cars look like. That is essentially what self-supervised learning does: the computer hides parts of its own data (words in a sentence, patches of an image, segments of audio) and then tries to guess what was hidden. By solving these "fill in the blank" puzzles millions of times, the computer builds a deep understanding of language, images, or sound, all without a human teacher labeling every example.

Another analogy: think of a jigsaw puzzle. Nobody tells you what the finished picture should look like, but by fitting the pieces together you learn a lot about shapes, colors, and scenes. Self-supervised learning creates its own jigsaw puzzles from raw data, and the process of solving them teaches the model useful patterns it can later apply to real tasks like translating languages or identifying objects in photos.

## Motivation and background

### Why is self-supervised learning needed?

The conventional [supervised machine learning](/wiki/supervised_machine_learning) paradigm requires large amounts of labeled data to train accurate models. However, obtaining such labeled data can be expensive, time-consuming, and infeasible for some domains. Medical imaging, for instance, requires expert radiologists to annotate each scan, while speech transcription in low-resource languages may lack native transcribers entirely. Moreover, supervised learning models may not generalize well to new, unseen data, as they are often biased towards the specific distribution of the training set. In contrast, self-supervised learning aims to leverage the abundance of unlabeled data available in the wild, allowing models to learn meaningful representations without explicit supervision.

### Unsupervised learning and representation learning

[Unsupervised machine learning](/wiki/unsupervised_machine_learning) methods, such as clustering and [dimensionality reduction](/wiki/dimensionality_reduction), have long been used to analyze and discover structures in data without relying on labels. Self-supervised learning builds upon these foundations by focusing on learning rich, high-level representations of data that can be used as a starting point for various downstream tasks. By doing so, SSL bridges the gap between unsupervised learning and supervised learning, exploiting the benefits of both paradigms.

The term "self-supervised learning" was popularized by [Yann LeCun](/wiki/yann_lecun), who argued that it more accurately describes the mechanism at work compared to the broader label of "unsupervised learning."[18] In self-supervised learning, the supervision signal is not absent; it is derived automatically from the data. For example, predicting a masked word in a sentence or predicting the next frame in a video provides a concrete training objective, even though no human annotator supplied the target.

### The "dark matter of intelligence"

In a landmark March 2021 blog post co-authored with Ishan Misra, Yann LeCun and Meta AI described self-supervised learning as the "dark matter of intelligence."[18] The analogy draws on cosmology: just as dark matter constitutes the vast majority of the universe's mass yet remains invisible, the vast majority of learning that biological organisms perform is self-supervised rather than supervised. As the post puts it, "common sense is the dark matter of artificial intelligence," and "as babies, we learn how the world works largely by observation."[18] Humans and animals learn to understand the world largely through observation, not through labeled examples. A child does not need someone to label every object in a room to learn what a chair looks like; instead, the child builds mental models by predicting what will happen next and filling in gaps in perception.

LeCun argued that self-supervised learning is one of the most promising paths toward building AI systems that approach human-level common sense. The Meta AI post states directly: "We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems."[18] This framing builds on LeCun's earlier and widely cited "cake" analogy, first presented at NIPS 2016 and updated in 2019: "If intelligence is a cake, the bulk of the cake is self-supervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning."[24][25] By analogy, supervised learning accounts for only the thin icing on the cake of intelligence, while self-supervised learning provides the bulk of the cake itself. This framing has been influential in motivating research into methods like JEPA that seek to learn world models through prediction in abstract representation spaces.

### How does self-supervised learning compare to other learning paradigms?

Self-supervised learning occupies a distinct position among the major machine learning paradigms. The following table summarizes the key differences.

| Aspect | [Supervised learning](/wiki/supervised_machine_learning) | [Unsupervised learning](/wiki/unsupervised_machine_learning) | [Semi-supervised learning](/wiki/semi-supervised_learning) | Self-supervised learning |
|---|---|---|---|---|
| Labels required | Full labeled dataset | No labels | Small labeled set + large unlabeled set | No labels (labels derived from data) |
| Training signal | Human-provided labels | Data structure (clusters, distributions) | Combination of labels and consistency regularization | Pretext task generated from the data itself |
| Typical goal | Predict target variable | Discover hidden patterns | Improve supervised model with unlabeled data | Learn general-purpose representations |
| Common workflow | Train directly on labeled data | Clustering, density estimation, dimensionality reduction | Joint training on labeled and unlabeled data | Pre-train on pretext task, then fine-tune on downstream task |
| Data efficiency | Requires large labeled datasets | Works on unlabeled data | Reduces labeling cost | Leverages massive unlabeled corpora |
| Examples | Image classification with ImageNet labels, sentiment analysis | K-means clustering, PCA, autoencoders | FixMatch, MixMatch, pseudo-labeling | BERT (masked language modeling), SimCLR (contrastive learning), MAE (masked image modeling) |

A key distinction is that [semi-supervised learning](/wiki/semi-supervised_learning) and self-supervised learning both use unlabeled data, but they do so differently. Semi-supervised methods typically train a single model jointly on a small labeled set and a large unlabeled set, using techniques such as consistency regularization or pseudo-labeling. Self-supervised methods, by contrast, define an explicit pretext task that requires no labels at all. The resulting pretrained model is then adapted to downstream tasks through [transfer learning](/wiki/transfer_learning), usually via fine-tuning or linear probing.

### Historical context

The roots of self-supervised learning trace back to early work on distributed word representations and autoencoders. Autoencoders, which learn to compress and reconstruct their input through a bottleneck layer, represent one of the earliest forms of learning useful representations without labels. Denoising autoencoders (Vincent et al., 2008) took this further by corrupting the input and training the network to recover the original, an approach that foreshadowed modern masked prediction methods.

In 2013, Tomas Mikolov and colleagues at Google published [Word2Vec](/wiki/word2vec), which learned word [embeddings](/wiki/embeddings) by predicting context words given a target word (skip-gram) or predicting a target word given its context (continuous bag of words, CBOW).[20] Although the term "self-supervised" was not commonly used at the time, Word2Vec exemplified the core principle: constructing supervision from the data itself. Later methods such as GloVe (Pennington et al., 2014) and [fastText](/wiki/fasttext) (Bojanowski et al., 2016) extended this idea to capture subword information and global co-occurrence statistics.

The release of [ELMo](/wiki/elmo) in 2018 by Peters et al. demonstrated that deep, context-dependent word representations trained with language modeling objectives could substantially improve downstream NLP tasks. This set the stage for the [transformer](/wiki/transformer)-based self-supervised revolution that followed with BERT and GPT.

## Taxonomy of pretext tasks

Self-supervised methods can be organized by the type of pretext task they use to extract supervision from raw data. The three broad families are predictive methods, contrastive methods, and generative (reconstructive) methods.

### Predictive methods

Predictive pretext tasks require the model to predict some part of the input from the remaining parts. Examples include:

- **Masked prediction:** Hide a portion of the input (tokens, patches, audio frames) and train the model to recover the hidden content. BERT's masked language modeling and MAE's masked image modeling are prominent examples.
- **Autoregressive prediction:** Predict the next element in a sequence given all preceding elements. The GPT family of models uses next-token prediction as its pretext task.
- **Span corruption:** Replace contiguous spans of tokens with sentinel tokens and train the model to reconstruct the original spans. T5 uses this approach.

### Contrastive methods

Contrastive methods learn representations by pulling together embeddings of semantically similar ("positive") pairs and pushing apart embeddings of dissimilar ("negative") pairs. The model does not reconstruct the input; instead, it learns an embedding space where similarity reflects semantic relatedness. [SimCLR](/wiki/simclr), [MoCo](/wiki/moco), and CLIP are well-known contrastive approaches.

### Generative and reconstructive methods

Generative pretext tasks require the model to reconstruct the original input from a corrupted or partial version. [Variational autoencoders](/wiki/variational_autoencoder) (VAEs), denoising autoencoders, and masked autoencoders (MAE) fall into this category. While generative models like [GANs](/wiki/generative_adversarial_network) can also be used for representation learning, their primary training signal comes from an adversarial game rather than direct reconstruction.

## Self-supervised learning in natural language processing

Natural language processing has been one of the most successful application domains for self-supervised learning. The key insight is that raw text contains rich structure that can be exploited as a training signal. Three principal pretext tasks have dominated NLP pretraining: masked language modeling, next-token prediction, and span corruption.

### Masked language modeling (BERT)

BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. at [Google](/wiki/google) in October 2018, pioneered the masked language modeling (MLM) approach.[1] During pretraining, 15% of the input tokens are selected at random. Of these selected tokens, 80% are replaced with a special [MASK] token, 10% are replaced with a random word, and 10% are left unchanged. The model must predict the original identity of each selected token using the bidirectional context provided by the surrounding words.

BERT also employed a secondary pretext task called next sentence prediction (NSP), in which the model received two sentences and predicted whether the second sentence followed the first in the original document. BERT was pretrained on the BooksCorpus (800 million words) and English Wikipedia (2,500 million words).

The BERT-Base configuration uses 12 transformer encoder layers, 12 [attention](/wiki/attention) heads, and a hidden size of 768, totaling 110 million parameters. BERT-Large scales to 24 layers, 16 heads, and a hidden size of 1,024, for a total of 340 million parameters. After pretraining, BERT "obtains new state-of-the-art results on eleven natural language processing tasks," including pushing the [GLUE](/wiki/glue_benchmark) score to 80.5% and [SQuAD](/wiki/squad) v1.1 test F1 to 93.2.[1]

Several variants refined the MLM objective:

- **[RoBERTa](/wiki/roberta)** (Liu et al., 2019) removed the NSP task, used dynamic masking instead of static masking, trained on a larger dataset (160 GB of text), and used larger batch sizes, achieving 2 to 20 percent improvements over BERT on several benchmarks.[22]
- **[ALBERT](/wiki/albert)** (Lan et al., 2019) introduced factorized embedding parameterization and cross-layer parameter sharing to drastically reduce the model's parameter count while maintaining performance. It replaced NSP with a sentence-order prediction task.
- **[ELECTRA](/wiki/electra)** (Clark et al., 2020) replaced masked token prediction with a replaced token detection task: a small generator network replaces some tokens, and the main discriminator network predicts which tokens were replaced. Because every input token contributes to the loss (not just the 15% that are masked), ELECTRA trains more efficiently and matches RoBERTa and XLNet performance with less than 25% of the compute.[23]

### Next-token prediction (GPT)

The GPT (Generative Pre-trained Transformer) series, developed by [OpenAI](/wiki/openai), uses autoregressive language modeling as its self-supervised pretext task. The model processes a sequence of tokens from left to right and predicts the next token at each position. The training objective is to maximize the likelihood of the next token given all preceding tokens, using causal (left-to-right) attention masking so that each position can only attend to earlier positions.

GPT-1 (Radford et al., 2018) demonstrated that generative pretraining on a large text corpus followed by discriminative fine-tuning could achieve strong performance across diverse NLP tasks.[2] GPT-2 (2019) scaled the approach to 1.5 billion parameters and showed that language models could perform tasks in a zero-shot setting without any fine-tuning. [GPT-3](/wiki/gpt-3) (2020) further scaled to 175 billion parameters and introduced few-shot in-context learning, where the model could perform new tasks by conditioning on a handful of examples in the prompt. GPT-4 (2023) and GPT-5 (2025) continued this trajectory, combining autoregressive pretraining with [reinforcement learning from human feedback](/wiki/rlhf) (RLHF) and other alignment techniques.

The autoregressive pretraining approach has also been adopted by many other [large language models](/wiki/large_language_model), including [LLaMA](/wiki/llama) (Meta), [Mistral](/wiki/mistral), Falcon, [Qwen](/wiki/qwen) (Alibaba), and [DeepSeek](/wiki/deepseek).

### Span corruption (T5)

[T5](/wiki/t5) (Text-to-Text Transfer Transformer), introduced by Raffel et al. at Google in 2019, unified all NLP tasks into a text-to-text format, where both the input and output are text strings.[3] Its self-supervised pretraining objective is span corruption, a denoising task that randomly selects and drops out 15% of the tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single unique sentinel token (e.g., \<X\>, \<Y\>). The model is then trained to generate the missing spans, delimited by the corresponding sentinel tokens.

For example, given the input sentence "The quick brown fox jumps over the lazy dog," the words "brown fox" and "lazy" might be dropped out. The corrupted input would become "The quick \<X\> jumps over the \<Y\> dog," and the target output would be "\<X\> brown fox \<Y\> lazy \<Z\>." Because the model must reconstruct multiple consecutive tokens at once, span corruption encourages the learning of richer contextual representations than single-token masking.

T5 uses an encoder-decoder transformer architecture, unlike the encoder-only BERT or decoder-only GPT. The original T5 paper systematically evaluated a wide range of pretraining objectives, architectures, and dataset sizes using the Colossal Clean Crawled Corpus (C4), a 750 GB dataset derived from [Common Crawl](/wiki/common_crawl).[3] T5 models range from T5-Small (60 million parameters) to T5-11B (11 billion parameters). Subsequent versions include mT5 (multilingual T5), Flan-T5 (instruction-tuned), and UL2, which combines multiple pretraining objectives.

### Comparison of NLP self-supervised objectives

| Method | Pretext task | Architecture | Directionality | Key innovation | Notable models |
|---|---|---|---|---|---|
| Masked language modeling | Predict masked tokens from context | Encoder-only | Bidirectional | Learns from both left and right context simultaneously | [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [RoBERTa](/wiki/roberta), [ALBERT](/wiki/albert) |
| Next-token prediction | Predict next token autoregressively | Decoder-only | Left-to-right (causal) | Scales naturally to generation tasks; enables in-context learning | [GPT](/wiki/generative_pre-trained_transformer), [LLaMA](/wiki/llama), [Mistral](/wiki/mistral) |
| Span corruption | Reconstruct corrupted spans delimited by sentinels | Encoder-decoder | Bidirectional encoder, autoregressive decoder | Predicts multi-token spans; shorter target sequences reduce training cost | [T5](/wiki/t5), mT5, UL2 |
| Replaced token detection | Detect which tokens were replaced by a generator | Encoder-only | Bidirectional | All tokens contribute to the loss, not just 15% | [ELECTRA](/wiki/electra) |
| Permutation language modeling | Predict tokens in random order to capture bidirectional context | Encoder-only | Permuted | Combines benefits of autoregressive and bidirectional models | [XLNet](/wiki/xlnet) |

## Self-supervised learning in computer vision

Applying self-supervised learning to images presents a different set of challenges compared to text. Language has a natural sequential structure and discrete tokens, whereas images are high-dimensional, continuous signals without an obvious ordering. Early SSL methods in vision used hand-designed pretext tasks such as predicting image rotations, solving jigsaw puzzles, or colorizing grayscale images. While these methods yielded useful representations, they were eventually surpassed by contrastive learning and masked image modeling approaches that learn more general features.

### Early pretext tasks

Before the contrastive learning era, researchers devised several creative pretext tasks for visual SSL:

- **Rotation prediction** (Gidaris et al., 2018): The model receives an image rotated by 0, 90, 180, or 270 degrees and must predict which rotation was applied. Learning to solve this task requires understanding object orientation and scene layout.
- **Jigsaw puzzles** (Noroozi and Favaro, 2016): An image is divided into a grid of patches, the patches are shuffled, and the model must predict the correct spatial arrangement. This forces the model to learn about spatial relationships between object parts.
- **Colorization** (Zhang et al., 2016): A color image is converted to grayscale, and the model must predict the original colors. Accurate colorization requires understanding object semantics (grass is green, sky is blue).
- **Inpainting** (Pathak et al., 2016): A region of the image is removed, and the model must fill in the missing content, learning about object structure and scene context.

While these methods produced representations that outperformed random initialization, they often encoded task-specific biases. For example, a rotation predictor might focus on texture cues rather than semantic content. The move to contrastive and masked modeling methods addressed this limitation by learning more general-purpose features.

### SimCLR

SimCLR (A Simple Framework for Contrastive Learning of Visual Representations), introduced by Chen et al. at Google Research in February 2020, demonstrated that a straightforward contrastive framework could match or exceed earlier, more complex methods.[4] The SimCLR pipeline has four components:

1. **[Data augmentation](/wiki/data_augmentation):** Two random augmentations (random crop-and-resize, color distortion, Gaussian blur) are applied to each image in a minibatch, producing two correlated "views" of the same image.
2. **Base encoder:** A [convolutional neural network](/wiki/convolutional_neural_network) (typically [ResNet](/wiki/resnet)-50) extracts feature vectors from each augmented view.
3. **Projection head:** A small multilayer perceptron (MLP) maps the encoder output to a lower-dimensional space where the contrastive loss is applied.
4. **Contrastive loss (NT-Xent):** The Normalized Temperature-scaled Cross-Entropy loss maximizes the cosine similarity between the two views of the same image (positive pair) while minimizing similarity with all other images in the batch (negative pairs). A temperature parameter (typically 0.1) controls the sharpness of the distribution.

SimCLR requires very large batch sizes (4,096 in the original paper) to provide a sufficient number of negative examples. After pretraining, the projection head is discarded, and the encoder is used for downstream tasks. SimCLR achieved 76.5% top-1 accuracy on [ImageNet](/wiki/imagenet) with linear evaluation using a ResNet-50 backbone, a 7% relative improvement over the previous state of the art that matched the performance of a supervised ResNet-50.[4] SimCLR v2 (2020) scaled the framework to larger ResNet models and introduced a semi-supervised variant that leveraged a small amount of labeled data.

### MoCo

MoCo (Momentum Contrast), introduced by He et al. at [Meta](/wiki/meta_ai) AI Research in November 2019, tackled the large-batch requirement of contrastive learning by maintaining a dynamic queue of negative representations.[5] The key components are:

- **Query encoder:** Processes the current image view and is updated by standard backpropagation.
- **Momentum encoder:** Processes a second view of the same image. Its parameters are not updated directly by gradient descent; instead, they are updated as an exponential moving average of the query encoder's parameters, using a momentum coefficient m (typically 0.999).
- **Queue:** A first-in-first-out queue stores the representations produced by the momentum encoder from recent minibatches. This decouples the number of negative examples from the batch size, allowing contrastive training with thousands of negatives even on modest hardware.

MoCo v1 achieved competitive results with SimCLR while using a batch size of only 256.[5] MoCo v2 (2020) incorporated improvements inspired by SimCLR, including an MLP projection head and stronger augmentations. MoCo v3 (Chen et al., 2021) adapted the framework for [Vision Transformers](/wiki/vision_transformer) (ViT) and addressed training instability issues that arise when using transformers with contrastive objectives.

### BYOL

BYOL (Bootstrap Your Own Latent), published by Grill et al. at [DeepMind](/wiki/deepmind) in June 2020, challenged the assumption that negative pairs are essential for contrastive learning.[6] BYOL uses two networks:

- **Online network:** Consists of an encoder, a projector, and a predictor. The predictor is a key architectural element that prevents representational collapse.
- **Target network:** Consists of an encoder and a projector (no predictor). Its parameters are updated as an exponential moving average of the online network's parameters.

The online network is trained to predict the target network's representation of a differently augmented view of the same image. Because the target network updates slowly via the momentum mechanism, it provides a stable regression target. BYOL avoids the need for negative examples entirely, which makes it more robust to batch size variations.

BYOL achieved 74.3% top-1 accuracy on ImageNet using linear evaluation with a ResNet-50 encoder and 79.6% with a larger ResNet.[6] It demonstrated stable performance across batch sizes ranging from 256 to 4,096, whereas SimCLR's performance degrades significantly with smaller batches.

### SimSiam

SimSiam (Exploring Simple Siamese Representation Learning), published by Chen and He at Meta AI in 2021, further simplified non-contrastive learning by removing the momentum encoder entirely.[16] SimSiam uses a simple Siamese network with a shared encoder and a prediction MLP applied to one branch. The critical innovation is a stop-gradient operation: one branch's output is treated as a fixed target (receives no gradient), while the other branch is trained to predict it.

This design means SimSiam requires neither negative pairs (like SimCLR), nor a momentum encoder (like BYOL and MoCo), nor large batches. Despite its simplicity, SimSiam achieves competitive performance on ImageNet, with 71.3% top-1 accuracy under linear evaluation using ResNet-50. The paper provided theoretical analysis suggesting that SimSiam implicitly performs a form of Expectation-Maximization (EM) optimization, alternating between clustering assignments and representation updates.[16]

### DINO and DINOv2

DINO (Self-Distillation with No Labels), introduced by Caron et al. at Meta AI Research in April 2021, is a self-supervised method based on self-distillation using Vision Transformers.[7] DINO uses a student-teacher framework where both networks share the same architecture:

- The student network receives local crops (small views) of the image.
- The teacher network receives global crops (large views) of the image.
- The student is trained to match the teacher's output distribution using a cross-entropy loss.
- The teacher's parameters are updated as an exponential moving average of the student's parameters.
- A centering and sharpening mechanism prevents mode collapse.

A notable discovery from DINO was that self-supervised Vision Transformers learn to segment objects without any explicit segmentation supervision. The self-attention maps of the final layer's [CLS] token clearly delineate object boundaries, a property that does not emerge as clearly in supervised ViTs or in convolutional networks.[7]

DINO achieved 80.1% top-1 accuracy on ImageNet with linear evaluation using ViT-Base. DINOv2 (Oquab et al., 2023) scaled the method to 142 million curated images and a ViT-Giant model with over 1 billion parameters, using a combination of self-distillation and masked image modeling. DINOv2 achieved state-of-the-art results across classification, segmentation, and depth estimation without any fine-tuning, producing visual features that work out-of-the-box with simple linear classifiers.[14] DINOv3 (2025) further scaled to 1.7 billion images and a 7-billion-parameter ViT teacher, narrowing the gap with fully supervised models across vision benchmarks.

### BEiT

BEiT (BERT Pre-Training of Image Transformers), introduced by Bao et al. at Microsoft Research in 2021, was the first method to make self-supervised pre-training of Vision Transformers outperform supervised pre-training.[15] BEiT adapts the masked language modeling concept from BERT to images:

1. An image is tokenized into discrete visual tokens using a pre-trained discrete variational autoencoder (dVAE) from DALL-E.
2. The image is also split into patches (e.g., 16x16 pixels).
3. Some patches are randomly masked (approximately 40%).
4. The masked patches are fed to a Vision Transformer, and the model predicts the visual token corresponding to each masked position.

By predicting discrete visual tokens rather than raw pixels, BEiT avoids the pixel-level regression problem and instead frames pre-training as a classification task over visual vocabulary. BEiT-Base achieved 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming the DeiT supervised baseline (81.8%). BEiT-Large reached 86.3% using only ImageNet-1K data.[15]

### MAE

MAE (Masked Autoencoders Are Scalable Vision Learners), introduced by He et al. at Meta AI Research in November 2021, adapted the masked prediction idea from NLP to computer vision with a simpler approach than BEiT.[8] MAE applies the following procedure:

1. An input image is divided into non-overlapping patches (e.g., 16x16 pixels).
2. A large fraction of patches (75% by default) are randomly masked.
3. Only the visible (unmasked) patches are fed through a Vision Transformer encoder.
4. A lightweight decoder takes the encoder output along with mask tokens representing the missing positions and reconstructs the pixel values of the masked patches.

The MAE authors summarize their two core design choices plainly: "This is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task."[8] Because the encoder processes only 25% of the patches, pretraining is highly efficient in both computation and memory. The high masking ratio forces the model to develop a holistic understanding of the image rather than relying on local interpolation from nearby visible patches. Unlike BEiT, MAE reconstructs raw pixel values directly using mean squared error loss, making it simpler (no need for a pre-trained tokenizer) and accelerating training by 3 times or more.

MAE with a ViT-Huge encoder achieved 87.8% top-1 accuracy on ImageNet-1K after fine-tuning (using a 448-pixel input), establishing a new state of the art among methods that use only ImageNet-1K data at the time.[8] The approach also scales well to video (VideoMAE) and audio (Audio-MAE).

### Comparison of vision SSL methods

| Method | Year | Authors / Lab | Approach | Requires negatives | Architecture | ImageNet top-1 (linear eval) |
|---|---|---|---|---|---|---|
| [SimCLR](/wiki/simclr) | 2020 | Chen et al. / Google | Contrastive (NT-Xent loss, large batches) | Yes | ResNet-50 | 76.5% |
| [MoCo](/wiki/moco) v2 | 2020 | He et al. / Meta AI | Contrastive (momentum queue) | Yes | ResNet-50 | 71.1% |
| [BYOL](/wiki/byol) | 2020 | Grill et al. / DeepMind | Non-contrastive (online-target prediction) | No | ResNet-50 | 74.3% |
| SimSiam | 2021 | Chen & He / Meta AI | Non-contrastive (stop-gradient Siamese) | No | ResNet-50 | 71.3% |
| Barlow Twins | 2021 | Zbontar et al. / Meta AI | Non-contrastive (redundancy reduction) | No | ResNet-50 | 73.2% |
| BEiT | 2021 | Bao et al. / Microsoft | Masked image modeling (token prediction) | No | ViT-Base (fine-tuned) | 83.2% |
| [DINO](/wiki/dino) | 2021 | Caron et al. / Meta AI | Self-distillation (student-teacher) | No | ViT-Base | 80.1% |
| [MAE](/wiki/masked_autoencoder) | 2022 | He et al. / Meta AI | Masked image modeling (pixel reconstruction) | No | ViT-Huge (fine-tuned) | 87.8% |
| DINOv2 | 2023 | Oquab et al. / Meta AI | Self-distillation at scale (142M images) | No | ViT-Giant | 86.5% |

## Self-supervised learning in speech and audio

Speech and audio present unique challenges for self-supervised learning. Audio signals are continuous waveforms with temporal structure, and the relationship between acoustic features and linguistic content is complex and variable across speakers, accents, and recording conditions. Self-supervised methods for speech typically operate on raw waveform inputs or spectral features and learn representations that encode phonetic, speaker, and prosodic information.

### Wav2Vec 2.0

Wav2Vec 2.0, introduced by Baevski et al. at Meta AI Research in June 2020, combines contrastive learning with quantization to learn speech representations from raw audio.[9] The architecture has three components:

1. **Feature encoder:** A multi-layer convolutional network processes the raw audio waveform and produces a sequence of latent speech representations at approximately 20ms resolution.
2. **Quantization module:** The latent representations are discretized through a product quantization scheme, producing a finite set of speech codes. This quantization provides the targets for the contrastive task.
3. **Transformer encoder:** The latent representations are masked (spans of consecutive time steps) and fed through a transformer encoder. The model must identify the correct quantized representation for each masked position from a set of distractors (contrastive objective).

Wav2Vec 2.0 demonstrated dramatic improvements in low-resource speech recognition. When pre-trained on 53,000 hours of unlabeled audio from LibriVox and fine-tuned on just 10 minutes of labeled data (about 48 recorded sentences), it achieved a word error rate of 4.8/8.2 on the LibriSpeech test-clean/test-other benchmarks. Using only one hour of labeled data, it outperformed the previous state of the art that was trained on 100 times more labeled data.[9]

### HuBERT

HuBERT (Hidden-Unit BERT), introduced by Hsu et al. at Meta AI Research in June 2021, takes a different approach to self-supervised speech representation learning.[10] Instead of contrastive learning with online quantization, HuBERT uses an offline clustering step to generate pseudo-labels:

1. **Clustering:** K-means clustering is applied to MFCC features (in the first iteration) or to features from a previously trained HuBERT model (in subsequent iterations) to produce discrete cluster assignments for each audio frame.
2. **Masked prediction:** Following the BERT paradigm, spans of the audio input are masked, and the model is trained to predict the cluster assignment of each masked frame.
3. **Iterative refinement:** After the first round of training, the model's own learned representations are used to generate improved cluster assignments, and the model is retrained. This iterative process progressively improves both the discrete labels and the learned representations.

HuBERT matches or exceeds Wav2Vec 2.0 performance across all fine-tuning subsets of LibriSpeech. When pretrained on the Libri-Light 60,000-hour dataset, HuBERT achieved state-of-the-art results on several speech processing benchmarks.[10] The approach has also been extended to multilingual settings and speech generation tasks.

### Other speech SSL methods

Beyond Wav2Vec 2.0 and HuBERT, several other self-supervised methods have advanced speech representation learning:

- **WavLM** (Chen et al., 2022) extended HuBERT with denoising and speaker modeling objectives, achieving strong performance on the SUPERB benchmark across tasks including speech recognition, speaker verification, and spoken language understanding.
- **data2vec** (Baevski et al., 2022) proposed a unified self-supervised framework for speech, vision, and text, predicting latent representations of the full input from a masked view.
- **[Whisper](/wiki/whisper)** (Radford et al., 2022), while technically trained in a weakly supervised manner on 680,000 hours of labeled audio data, demonstrated the power of large-scale pretraining for robust speech recognition across many languages.

## Contrastive vs. non-contrastive learning

A fundamental distinction in self-supervised learning, particularly in computer vision, is between contrastive and non-contrastive methods. Both approaches learn by comparing different views of the same input, but they differ in how they prevent the model from learning trivial solutions (representational collapse).

### Contrastive learning

Contrastive methods explicitly use negative examples to prevent collapse. The loss function pulls together representations of positive pairs (different views of the same input) while pushing apart representations of negative pairs (views of different inputs). The InfoNCE loss, used in many contrastive frameworks, formalizes this as a softmax classification problem over one positive and many negative examples.

The effectiveness of contrastive learning depends on the quality and quantity of negative examples. SimCLR addresses this by using very large batch sizes (4,096), while MoCo maintains a separate queue of negatives. A key challenge is that contrastive methods can become inefficient in high-dimensional representation spaces, where the number of negatives required for effective training grows substantially.

### Non-contrastive learning

Non-contrastive methods avoid the need for negative examples entirely. Instead, they prevent collapse through architectural asymmetries, regularization techniques, or information-theoretic objectives. The major non-contrastive approaches include:

- **Asymmetric architectures (BYOL, DINO):** Using a predictor network in only one branch, combined with a momentum-updated target network, prevents both branches from collapsing to the same trivial solution.
- **Stop-gradient (SimSiam):** Treating one branch's output as a constant (no gradient flows through it) introduces an implicit alternating optimization that avoids collapse without requiring momentum or large batches.
- **Redundancy reduction (Barlow Twins):** Measuring the cross-correlation matrix between the outputs of two identical networks and driving it toward the identity matrix ensures that different dimensions of the representation capture different information.[17]
- **Variance-Invariance-Covariance regularization (VICReg):** Combining three explicit objectives: (1) variance regularization to prevent individual dimensions from collapsing, (2) invariance to ensure representations of different views are similar, and (3) covariance regularization to decorrelate different dimensions.

### Trade-offs

| Aspect | Contrastive methods | Non-contrastive methods |
|---|---|---|
| Negative examples | Required; more negatives generally improve performance | Not required |
| Batch size sensitivity | Performance often depends on large batch sizes or external memory | Generally more robust to batch size |
| Loss function | InfoNCE, NT-Xent | MSE, cross-correlation, variance/covariance regularization |
| Collapse prevention | Explicit repulsion of negatives | Architectural asymmetry, regularization, or information-theoretic constraints |
| Computational cost | Can be expensive due to large batches or memory banks | Typically lower, but requires careful design to avoid collapse |
| Examples | SimCLR, MoCo, CLIP | BYOL, DINO, SimSiam, Barlow Twins, VICReg |

## Joint-embedding methods vs. generative methods

At a higher level of abstraction, self-supervised learning methods can be divided into two broad paradigms: joint-embedding methods and generative methods. This distinction, highlighted in the "Cookbook of Self-Supervised Learning" survey (Balestriero et al., 2023), captures a fundamental design choice about how the learning signal is constructed.[19]

**Joint-embedding methods** (also called embedding-based or energy-based methods) map two views of the same input into a shared representation space and train the model to make the two embeddings similar. The model never reconstructs the raw input. SimCLR, MoCo, BYOL, DINO, Barlow Twins, and VICReg all fall into this category. The advantage is that the model is free to discard low-level details (exact pixel values, noise) and focus on high-level semantic content.

**Generative methods** reconstruct some form of the original input from a corrupted or partial version. Masked language modeling (BERT), next-token prediction (GPT), and masked image modeling (MAE, BEiT) are generative in nature. These methods provide a dense training signal (every masked position contributes to the loss), but they require the model to allocate capacity to low-level reconstruction, which may not always be useful for downstream tasks.

In practice, the most successful recent systems combine elements of both. DINOv2, for example, uses a joint-embedding self-distillation objective alongside a masked image modeling objective. The I-JEPA and V-JEPA frameworks represent a hybrid approach that performs prediction in representation space rather than input space.

## Joint-Embedding Predictive Architecture (JEPA)

The Joint-Embedding Predictive Architecture (JEPA) is a framework for self-supervised learning proposed by Yann LeCun in his 2022 position paper "A Path Towards Autonomous Machine Intelligence."[12] JEPA represents a departure from both contrastive methods and pixel-level generative methods, instead learning to predict in a learned abstract representation space.

### Core principles

In a JEPA, two encoder networks map inputs x and y into embedding spaces, producing representations sx and sy. A predictor network takes sx (and optionally a latent variable z) and predicts sy. The key principles are:

1. The representation sx should be maximally informative about x.
2. The representation sy should be maximally informative about y.
3. The representation sy should be easily predictable from sx.
4. The latent variable z should have minimal information content.

By predicting in representation space rather than in input space, JEPA avoids the need to model irrelevant low-level details (exact pixel values, background textures) and instead focuses on capturing high-level semantic content. This is a central motivation: LeCun argues that predicting every pixel in an image or every sample in an audio waveform wastes model capacity on perceptually irrelevant variation.[12]

### I-JEPA

I-JEPA (Image-based JEPA), introduced by Assran et al. at Meta AI in 2023, applies the JEPA framework to images.[13] The method works as follows:

1. An image is divided into patches. A context block (a subset of patches) is left visible, and several target blocks are masked.
2. A context encoder processes the visible patches to produce context representations.
3. A predictor network takes the context representations and positional information of the target blocks and predicts the target representations.
4. A target encoder (updated via exponential moving average of the context encoder) processes the full image to produce the target representations.

I-JEPA differs from MAE in that it predicts in representation space, not in pixel space. This design learns representations that emphasize semantic content over low-level texture and color information.[13]

### V-JEPA

V-JEPA (Video JEPA), introduced by Bardes et al. at Meta AI in 2024, extends the JEPA framework to video. The model predicts masked spatio-temporal regions in a learned latent space, learning from the temporal structure of video without any text supervision, negative examples, or pixel-level reconstruction.[21] V-JEPA pretraining is based solely on an unsupervised feature prediction objective.

V-JEPA 2 (June 2025) scaled the approach to over one million hours of internet video data and combined it with a small amount of robot interaction data. It achieved 77.3 top-1 accuracy on Something-Something v2 for motion understanding and state-of-the-art performance on human action anticipation, reaching 39.7 recall-at-5 on Epic-Kitchens-100. After post-training on less than 62 hours of unlabeled robot video, the V-JEPA 2-AC variant could be used for zero-shot robot planning.[26]

### LeJEPA

LeJEPA, introduced by LeCun and Balestriero at Meta in late 2025, simplified the JEPA framework by combining the JEPA predictive loss with SIGReg (Sketched Isotropic Gaussian Regularization). LeJEPA removes the need for many of the engineering heuristics that earlier self-supervised methods relied on, such as momentum encoders, stop-gradients, and asymmetric architectures. The method can be implemented in approximately 50 lines of code, making it one of the most accessible self-supervised learning algorithms to date.

### JEPA vs. other paradigms

| Feature | Contrastive SSL | Generative SSL (MAE) | JEPA |
|---|---|---|---|
| Prediction space | Embedding similarity | Input (pixel) space | Learned representation space |
| Negative examples | Required | Not applicable | Not required |
| What is predicted | Whether two views match | Missing pixels or tokens | Abstract representations of missing regions |
| Low-level detail modeling | Avoided via embedding space | Required (pixel reconstruction) | Avoided by design |
| Flexibility | Primarily two views of the same input | Masked input reconstruction | Spatial, temporal, and cross-modal prediction |
| Examples | SimCLR, MoCo, CLIP | MAE, BEiT | I-JEPA, V-JEPA, LeJEPA |

## Multimodal self-supervised learning

Self-supervised learning has extended beyond single modalities to learn joint representations across vision, language, and audio.

### CLIP

CLIP (Contrastive Language-Image Pre-training), introduced by Radford et al. at OpenAI in January 2021, learns visual representations from natural language supervision. CLIP jointly trains an image encoder (a Vision Transformer or ResNet) and a text encoder (a transformer-based language model) on 400 million image-text pairs collected from the internet.[11] The contrastive objective maximizes the cosine similarity between matching image-text pairs and minimizes it for non-matching pairs within each minibatch of 32,768 examples.

CLIP enables zero-shot image classification: given an image and a set of textual class descriptions, the model selects the description whose embedding is most similar to the image embedding. Without using any of the 1.28 million labeled training examples, CLIP matched the accuracy of the original supervised ResNet-50 on ImageNet in a zero-shot setting.[11] CLIP representations have become widely used as conditioning signals in text-to-image generation models such as [Stable Diffusion](/wiki/stable_diffusion) and [DALL-E](/wiki/dall-e).

### Other multimodal methods

- **ALIGN** (Jia et al., 2021, Google) scaled image-text contrastive learning to 1.8 billion noisy image-alt-text pairs, demonstrating that scale can compensate for data noise.
- **SigLIP** (Zhai et al., 2023, Google) replaced the softmax-based contrastive loss in CLIP with a pairwise sigmoid loss, removing the need for global normalization across the batch and improving scalability.
- **ImageBind** (Girdhar et al., 2023, Meta AI) extended the idea to six modalities (images, text, audio, depth, thermal, IMU) using image-paired data as a binding modality.

## The self-supervised pre-training pipeline

Self-supervised learning is most commonly used as the first stage of a two-stage pipeline: pre-training followed by adaptation. This pipeline has become the dominant approach for building modern AI systems and is central to the concept of [pre-trained models](/wiki/pre-trained_model).

### Stage 1: self-supervised pre-training

A large [neural network](/wiki/neural_network) (typically a transformer) is trained on a pretext task using a large corpus of unlabeled data. The goal is to learn general-purpose representations that capture the structure of the data domain. This stage is computationally expensive (often requiring hundreds or thousands of GPU-hours) but needs to be performed only once.

### Stage 2: adaptation

The pretrained model is adapted to a specific downstream task using one of several strategies:

- **Full fine-tuning:** All parameters of the pretrained model are updated on the downstream labeled dataset. This typically achieves the highest performance but requires more labeled data and compute.
- **Linear probing:** A single linear layer (logistic regression) is trained on top of the frozen pretrained representations. This tests the quality of the representations themselves without allowing the model to adjust its features.
- **Few-shot and zero-shot inference:** For large language models like GPT-3, no gradient updates are needed. The model performs tasks by conditioning on a few examples in the prompt (few-shot) or on task instructions alone (zero-shot).
- **Parameter-efficient fine-tuning:** Methods like LoRA, adapters, and prompt tuning update only a small fraction of the model's parameters, reducing compute and memory requirements while preserving most of the pretrained knowledge.

This two-stage pipeline is the foundation of the foundation model paradigm, in which a single large pre-trained model serves as a starting point for many different tasks and applications.

## Evaluation protocols

Evaluating the quality of self-supervised representations is a critical and nuanced problem. Because SSL methods do not optimize for any specific downstream task, researchers use several standardized evaluation protocols to assess how useful the learned representations are.

### Linear probing

Linear probing (also called linear evaluation) is the most widely used evaluation protocol for SSL. A linear classifier (single fully-connected layer) is trained on top of the frozen pretrained encoder using a labeled dataset such as ImageNet. The pretrained encoder's weights are not updated during this process. High linear probing accuracy indicates that the pretrained features are linearly separable with respect to the downstream task, meaning they already encode semantically meaningful information.

Linear probing is favored because it isolates the quality of the representations from the capacity of the downstream model. If a complex neural network is used for evaluation, it might compensate for poor representations through its own learning.

### Fine-tuning evaluation

In fine-tuning evaluation, the entire pretrained model is trained end-to-end on the downstream task. All parameters are updated, allowing the representations to adapt to the specific task and dataset. Fine-tuning generally produces higher accuracy than linear probing because the model can adjust its features. However, it provides less insight into the intrinsic quality of the pretrained representations, since a powerful model architecture can partially compensate for weaker pre-training.

### k-Nearest Neighbors (k-NN) evaluation

k-NN evaluation extracts features from the frozen pretrained encoder for both training and test images, then classifies each test image by majority vote among its k nearest neighbors in the training set (measured by Euclidean distance or cosine similarity in the feature space). This protocol requires no training at all, making it fast and computationally lightweight. k-NN accuracy is highly correlated with linear probing accuracy when embedding normalization is applied, and the two metrics can often be used interchangeably.

### Comparison of evaluation protocols

| Protocol | Pretrained weights | Evaluation model | Computational cost | What it measures |
|---|---|---|---|---|
| Linear probing | Frozen | Single linear layer | Low | Quality of frozen representations |
| Fine-tuning | Updated | Full pretrained model | High | Upper bound on task performance with pretrained initialization |
| k-NN | Frozen | None (nearest neighbor lookup) | Very low | Cluster structure of the representation space |
| Few-shot | Frozen or minimal adaptation | Linear or lightweight head | Low | Generalization from very few labeled examples |

## Impact on foundation models

Self-supervised learning is the engine behind the foundation model paradigm, in which a single large model is pretrained on broad data and then adapted to many downstream tasks.

### Scale and performance

The effectiveness of self-supervised pretraining scales with both model size and data volume. In NLP, scaling from GPT-1 (117M parameters, approximately 5 GB text) to GPT-3 (175B parameters, approximately 570 GB text) led to dramatic improvements in few-shot and zero-shot capabilities. In vision, DINOv2 demonstrated that self-supervised models trained on 142 million curated images produce features that rival or exceed supervised pretraining for classification, segmentation, and depth estimation.[14] DINOv3 (2025) pushed this further with 1.7 billion images and a 7-billion-parameter ViT.

### Transfer learning efficiency

Self-supervised pretrained models require far less labeled data for downstream tasks compared to training from scratch. Wav2Vec 2.0 demonstrated that just 10 minutes of labeled speech data, combined with self-supervised pretraining, can achieve competitive speech recognition.[9] In NLP, BERT fine-tuned on a few thousand labeled examples routinely outperforms models trained from scratch on much larger labeled datasets.

### Emergent capabilities

Large self-supervised models exhibit capabilities that are not explicitly trained for:

- **In-context learning:** GPT-3 and later models can learn new tasks from a few examples provided in the prompt, without any gradient updates.
- **Object segmentation:** DINO's self-attention maps reveal object boundaries without any segmentation supervision.[7]
- **Cross-lingual transfer:** Multilingual models pretrained on text from many languages can perform tasks in languages with very little training data.
- **Compositional understanding:** CLIP and similar models can compose visual and textual concepts in novel ways for zero-shot classification and retrieval.

### Summary of SSL methods across modalities

| Modality | Method | Pretext task | Key result |
|---|---|---|---|
| Text | [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) | Masked language modeling | SOTA on 11 NLP benchmarks at release (2018) |
| Text | [GPT-3](/wiki/gpt-3) | Next-token prediction | Few-shot learning without fine-tuning (2020) |
| Text | [T5](/wiki/t5) | Span corruption | Unified text-to-text framework across NLP tasks (2019) |
| Images | [SimCLR](/wiki/simclr) | Contrastive (augmented views) | 76.5% ImageNet linear eval with ResNet-50 (2020) |
| Images | [MoCo](/wiki/moco) | Contrastive (momentum queue) | Decoupled batch size from number of negatives (2019) |
| Images | [BYOL](/wiki/byol) | Non-contrastive (self-prediction) | 74.3% ImageNet without negative examples (2020) |
| Images | SimSiam | Non-contrastive (stop-gradient) | Simplified non-contrastive SSL without momentum (2021) |
| Images | BEiT | Masked image modeling (token prediction) | First SSL to outperform supervised ViT pre-training (2021) |
| Images | [DINO](/wiki/dino) | Self-distillation | Emergent object segmentation in ViT attention maps (2021) |
| Images | [MAE](/wiki/masked_autoencoder) | Masked image modeling | 87.8% ImageNet fine-tuned with ViT-Huge (2022) |
| Images | I-JEPA | JEPA (representation prediction) | Semantic features without pixel reconstruction (2023) |
| Speech | Wav2Vec 2.0 | Contrastive with quantization | Competitive ASR with 10 min labeled data (2020) |
| Speech | HuBERT | Masked prediction with offline clustering | Matched/exceeded Wav2Vec 2.0 across benchmarks (2021) |
| Multimodal | [CLIP](/wiki/clip) | Image-text contrastive | Zero-shot ImageNet classification matching supervised ResNet-50 (2021) |
| Video | V-JEPA | Spatio-temporal representation prediction | Action understanding without text or pixel reconstruction (2024) |

## Applications and successes

Self-supervised learning has shown transformative results in a variety of domains, including:

- **Computer vision:** SSL techniques have been used to learn powerful representations from large-scale image datasets, which can then be fine-tuned for tasks like [object detection](/wiki/object_detection), [image segmentation](/wiki/image_segmentation), and classification. DINOv2 features serve as general-purpose visual features for medical imaging, autonomous driving, and robotics applications.

- **Natural language processing:** Language models like BERT and GPT have achieved state-of-the-art results on numerous NLP benchmarks by leveraging self-supervised pre-training on large text corpora. The GPT series demonstrated that scaling autoregressive pretraining leads to emergent few-shot and zero-shot capabilities.

- **Speech recognition:** Wav2Vec 2.0 and HuBERT have dramatically reduced the amount of labeled data required for speech recognition systems, enabling competitive performance in low-resource languages where labeled transcriptions are scarce.

- **Reinforcement learning:** SSL has been used to learn useful features from raw sensory data in reinforcement learning settings, enabling agents to learn more efficiently and generalize better across tasks. V-JEPA 2 has demonstrated the potential for self-supervised video models to support planning in physical environments.

- **Medical imaging:** Self-supervised pretraining on unlabeled medical scans, followed by fine-tuning on small labeled datasets, has improved diagnostic accuracy for radiology, pathology, and dermatology applications. Studies have shown that SSL-pretrained models can match fully supervised models while requiring 5 to 10 times fewer labeled examples.

- **Robotics:** Self-supervised visual and multimodal representations are used for manipulation tasks, navigation, and sim-to-real transfer, where labeled data is particularly difficult to collect.

## Challenges and open problems

Despite its successes, self-supervised learning faces several ongoing challenges:

- **Evaluation standards:** There is no universally accepted protocol for evaluating SSL representations. Linear probing, fine-tuning, and few-shot evaluation can yield different rankings of methods, making comparisons difficult. A 2025 study found that in-domain linear and k-NN probing accuracies are, on average, the best general predictors for out-of-domain performance.
- **Computational cost:** Many SSL methods require extensive pretraining on large compute clusters. SimCLR's large batch requirement, MAE's long pretraining schedules, and the sheer scale of data for DINOv2 and GPT-3 mean that reproducing results is prohibitively expensive for most researchers.
- **Collapse and training instability:** Non-contrastive methods can suffer from representational collapse if architectural safeguards fail. Training instability has been reported for contrastive methods applied to Vision Transformers (addressed in MoCo v3).
- **Domain specificity of pretext tasks:** A pretext task that works well for images (e.g., masking 75% of patches) may not transfer directly to other modalities like point clouds, graphs, or tabular data. Designing effective pretext tasks for new data types remains an active area of research.
- **Understanding what is learned:** The theoretical understanding of why certain pretext tasks lead to useful representations is still limited. Empirical studies suggest that data augmentation strategies may matter more than the specific pretext task, but the precise mechanisms remain unclear.
- **Fairness and bias:** Self-supervised models inherit biases present in their training data. Since SSL methods are trained on massive, often uncurated web corpora, they can amplify societal biases related to gender, race, and other attributes.

## Frequently asked questions

### Is self-supervised learning the same as unsupervised learning?

Not exactly. Both learn from unlabeled data, but self-supervised learning constructs an explicit prediction target from the data (such as a masked word or a hidden image patch), giving it a concrete supervised-style training objective. Unsupervised learning in the classical sense (clustering, density estimation, dimensionality reduction) has no such prediction target. Yann LeCun popularized the term "self-supervised learning" precisely because he felt "unsupervised learning" was a confusing and inaccurate label for this mechanism.[18]

### When did self-supervised learning become dominant?

The modern era began in 2018 with the transformer-based models BERT (October 2018) and GPT-1 (June 2018), which showed that masked and autoregressive language modeling on raw text could set new state-of-the-art results across NLP.[1][2] Contrastive vision methods such as SimCLR and MoCo followed in 2019-2020, and masked image modeling (BEiT, MAE) matured in 2021-2022.[4][5][8] By the early 2020s, self-supervised pretraining had become the standard first stage for essentially all foundation models.

### What are the main types of self-supervised learning?

The principal families are predictive or generative methods (masked language modeling in BERT, next-token prediction in GPT, masked image modeling in MAE), contrastive methods (SimCLR, MoCo, CLIP), non-contrastive joint-embedding methods (BYOL, DINO, SimSiam, Barlow Twins, VICReg), and joint-embedding predictive architectures (I-JEPA, V-JEPA) that predict in a learned representation space rather than in pixel or token space.[19]

## References

1. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*. arXiv:1810.04805.
2. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." *OpenAI Technical Report*.
3. Raffel, C., Shazeer, N., Roberts, A., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67.
4. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." *Proceedings of ICML 2020*. arXiv:2002.05709.
5. He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning." *Proceedings of CVPR 2020*.
6. Grill, J.-B., Strub, F., Altche, F., et al. (2020). "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning." *Proceedings of NeurIPS 2020*.
7. Caron, M., Touvron, H., Misra, I., et al. (2021). "Emerging Properties in Self-Supervised Vision Transformers." *Proceedings of ICCV 2021*.
8. He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." *Proceedings of CVPR 2022*. arXiv:2111.06377.
9. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." *Proceedings of NeurIPS 2020*.
10. Hsu, W.-N., Bolte, B., Tsai, Y.-H., et al. (2021). "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." *IEEE/ACM Transactions on Audio, Speech, and Language Processing*, 29, 3451-3460.
11. Radford, A., Kim, J. W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of ICML 2021*.
12. LeCun, Y. (2022). "A Path Towards Autonomous Machine Intelligence." *OpenReview preprint*.
13. Assran, M., Duval, Q., Misra, I., et al. (2023). "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture." *Proceedings of CVPR 2023*.
14. Oquab, M., Darcet, T., Moutakanni, T., et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision." *Transactions on Machine Learning Research (TMLR)*.
15. Bao, H., Dong, L., Piao, S., & Wei, F. (2021). "BEiT: BERT Pre-Training of Image Transformers." *Proceedings of ICLR 2022*.
16. Chen, X. & He, K. (2021). "Exploring Simple Siamese Representation Learning." *Proceedings of CVPR 2021*.
17. Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). "Barlow Twins: Self-Supervised Learning via Redundancy Reduction." *Proceedings of ICML 2021*.
18. LeCun, Y. & Misra, I. (2021). "Self-supervised learning: The dark matter of intelligence." *Meta AI Blog*, March 4, 2021.
19. Balestriero, R., Ibrahim, M., Sobal, V., et al. (2023). "A Cookbook of Self-Supervised Learning." *arXiv preprint arXiv:2304.12210*.
20. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *Proceedings of ICLR 2013*.
21. Bardes, A., Garrido, Q., Ponce, J., et al. (2024). "V-JEPA: Latent Video Prediction for Visual Representation Learning." *arXiv preprint*.
22. Liu, Y., Ott, M., Goyal, N., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." *arXiv preprint arXiv:1907.11692*.
23. Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." *Proceedings of ICLR 2020*.
24. LeCun, Y. (2016). "Predictive Learning." Keynote, NIPS 2016 (cake analogy: unsupervised learning is the bulk of the cake).
25. LeCun, Y. (2019). Keynote, International Solid-State Circuits Conference (ISSCC) 2019 (updated cake analogy substituting self-supervised learning for unsupervised learning).
26. Assran, M., Bardes, A., Fan, D., et al. (2025). "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning." *Meta AI*. arXiv:2506.09985.

