Self-Supervised Learning

Introduction

Self-supervised learning (SSL) is a subfield of machine learning that focuses on learning representations of data without human-provided labels by exploiting the structure and inherent properties of the data itself. Rather than requiring manually annotated labels, SSL algorithms generate supervisory signals directly from the input data by defining pretext tasks that force models to learn meaningful internal representations. This approach has gained significant traction since the mid-2010s because it enables algorithms to learn useful features from large volumes of unlabeled data, thereby reducing the reliance on expensive labeled datasets. The learned representations can then be fine-tuned for a wide range of downstream tasks, such as image classification, natural language processing, speech recognition, and reinforcement learning.

Self-supervised learning has become the dominant pretraining paradigm behind modern foundation models. Nearly all large-scale AI systems introduced since 2018, including BERT, GPT, T5, CLIP, DINO, and Wav2Vec 2.0, rely on self-supervised pretraining as the first stage of their training pipeline. By learning from raw, unlabeled corpora of text, images, audio, or video, these models acquire general-purpose representations that transfer effectively to hundreds of specialized tasks with minimal labeled data.

ELI5 (Explain like I'm 5)

Imagine you are given a picture book, but someone has covered up parts of every picture with sticky notes. Your job is to guess what is hidden under each sticky note. Nobody tells you the answers, but by looking at thousands of pictures and figuring out what fits, you start to understand what dogs, trees, and cars look like. That is essentially what self-supervised learning does: the computer hides parts of its own data (words in a sentence, patches of an image, segments of audio) and then tries to guess what was hidden. By solving these "fill in the blank" puzzles millions of times, the computer builds a deep understanding of language, images, or sound, all without a human teacher labeling every example.

Another analogy: think of a jigsaw puzzle. Nobody tells you what the finished picture should look like, but by fitting the pieces together you learn a lot about shapes, colors, and scenes. Self-supervised learning creates its own jigsaw puzzles from raw data, and the process of solving them teaches the model useful patterns it can later apply to real tasks like translating languages or identifying objects in photos.

Motivation and background

Challenges in supervised learning

The conventional supervised machine learning paradigm requires large amounts of labeled data to train accurate models. However, obtaining such labeled data can be expensive, time-consuming, and infeasible for some domains. Medical imaging, for instance, requires expert radiologists to annotate each scan, while speech transcription in low-resource languages may lack native transcribers entirely. Moreover, supervised learning models may not generalize well to new, unseen data, as they are often biased towards the specific distribution of the training set. In contrast, self-supervised learning aims to leverage the abundance of unlabeled data available in the wild, allowing models to learn meaningful representations without explicit supervision.

Unsupervised learning and representation learning

Unsupervised machine learning methods, such as clustering and dimensionality reduction, have long been used to analyze and discover structures in data without relying on labels. Self-supervised learning builds upon these foundations by focusing on learning rich, high-level representations of data that can be used as a starting point for various downstream tasks. By doing so, SSL bridges the gap between unsupervised learning and supervised learning, exploiting the benefits of both paradigms.

The term "self-supervised learning" was popularized by Yann LeCun, who argued that it more accurately describes the mechanism at work compared to the broader label of "unsupervised learning." In self-supervised learning, the supervision signal is not absent; it is derived automatically from the data. For example, predicting a masked word in a sentence or predicting the next frame in a video provides a concrete training objective, even though no human annotator supplied the target.

The "dark matter of intelligence"

In a landmark 2021 blog post co-authored with Yann LeCun, Meta AI described self-supervised learning as the "dark matter of intelligence." The analogy draws on cosmology: just as dark matter constitutes the vast majority of the universe's mass yet remains invisible, the vast majority of learning that biological organisms perform is self-supervised rather than supervised. Humans and animals learn to understand the world largely through observation, not through labeled examples. A child does not need someone to label every object in a room to learn what a chair looks like; instead, the child builds mental models by predicting what will happen next and filling in gaps in perception.

LeCun argued that self-supervised learning is one of the most promising paths toward building AI systems that approach human-level common sense. Supervised learning, by analogy, accounts for only the thin "icing" on the cake of intelligence, while self-supervised learning provides the bulk of the "cake" itself. This framing has been influential in motivating research into methods like JEPA that seek to learn world models through prediction in abstract representation spaces.

Comparison with other learning paradigms

Self-supervised learning occupies a distinct position among the major machine learning paradigms. The following table summarizes the key differences.

Aspect	Supervised learning	Unsupervised learning	Semi-supervised learning	Self-supervised learning
Labels required	Full labeled dataset	No labels	Small labeled set + large unlabeled set	No labels (labels derived from data)
Training signal	Human-provided labels	Data structure (clusters, distributions)	Combination of labels and consistency regularization	Pretext task generated from the data itself
Typical goal	Predict target variable	Discover hidden patterns	Improve supervised model with unlabeled data	Learn general-purpose representations
Common workflow	Train directly on labeled data	Clustering, density estimation, dimensionality reduction	Joint training on labeled and unlabeled data	Pre-train on pretext task, then fine-tune on downstream task
Data efficiency	Requires large labeled datasets	Works on unlabeled data	Reduces labeling cost	Leverages massive unlabeled corpora
Examples	Image classification with ImageNet labels, sentiment analysis	K-means clustering, PCA, autoencoders	FixMatch, MixMatch, pseudo-labeling	BERT (masked language modeling), SimCLR (contrastive learning), MAE (masked image modeling)

A key distinction is that semi-supervised learning and self-supervised learning both use unlabeled data, but they do so differently. Semi-supervised methods typically train a single model jointly on a small labeled set and a large unlabeled set, using techniques such as consistency regularization or pseudo-labeling. Self-supervised methods, by contrast, define an explicit pretext task that requires no labels at all. The resulting pretrained model is then adapted to downstream tasks through transfer learning, usually via fine-tuning or linear probing.

Historical context

The roots of self-supervised learning trace back to early work on distributed word representations and autoencoders. Autoencoders, which learn to compress and reconstruct their input through a bottleneck layer, represent one of the earliest forms of learning useful representations without labels. Denoising autoencoders (Vincent et al., 2008) took this further by corrupting the input and training the network to recover the original, an approach that foreshadowed modern masked prediction methods.

In 2013, Tomas Mikolov and colleagues at Google published Word2Vec, which learned word embeddings by predicting context words given a target word (skip-gram) or predicting a target word given its context (continuous bag of words, CBOW). Although the term "self-supervised" was not commonly used at the time, Word2Vec exemplified the core principle: constructing supervision from the data itself. Later methods such as GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2016) extended this idea to capture subword information and global co-occurrence statistics.

The release of ELMo in 2018 by Peters et al. demonstrated that deep, context-dependent word representations trained with language modeling objectives could substantially improve downstream NLP tasks. This set the stage for the transformer-based self-supervised revolution that followed with BERT and GPT.

Taxonomy of pretext tasks

Self-supervised methods can be organized by the type of pretext task they use to extract supervision from raw data. The three broad families are predictive methods, contrastive methods, and generative (reconstructive) methods.

Predictive methods

Predictive pretext tasks require the model to predict some part of the input from the remaining parts. Examples include:

Masked prediction: Hide a portion of the input (tokens, patches, audio frames) and train the model to recover the hidden content. BERT's masked language modeling and MAE's masked image modeling are prominent examples.
Autoregressive prediction: Predict the next element in a sequence given all preceding elements. The GPT family of models uses next-token prediction as its pretext task.
Span corruption: Replace contiguous spans of tokens with sentinel tokens and train the model to reconstruct the original spans. T5 uses this approach.

Contrastive methods

Contrastive methods learn representations by pulling together embeddings of semantically similar ("positive") pairs and pushing apart embeddings of dissimilar ("negative") pairs. The model does not reconstruct the input; instead, it learns an embedding space where similarity reflects semantic relatedness. SimCLR, MoCo, and CLIP are well-known contrastive approaches.

Generative and reconstructive methods

Generative pretext tasks require the model to reconstruct the original input from a corrupted or partial version. Variational autoencoders (VAEs), denoising autoencoders, and masked autoencoders (MAE) fall into this category. While generative models like GANs can also be used for representation learning, their primary training signal comes from an adversarial game rather than direct reconstruction.

Self-supervised learning in natural language processing

Natural language processing has been one of the most successful application domains for self-supervised learning. The key insight is that raw text contains rich structure that can be exploited as a training signal. Three principal pretext tasks have dominated NLP pretraining: masked language modeling, next-token prediction, and span corruption.

Masked language modeling (BERT)

BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. at Google in October 2018, pioneered the masked language modeling (MLM) approach. During pretraining, 15% of the input tokens are selected at random. Of these selected tokens, 80% are replaced with a special [MASK] token, 10% are replaced with a random word, and 10% are left unchanged. The model must predict the original identity of each selected token using the bidirectional context provided by the surrounding words.

BERT also employed a secondary pretext task called next sentence prediction (NSP), in which the model received two sentences and predicted whether the second sentence followed the first in the original document. BERT was pretrained on the BooksCorpus (800 million words) and English Wikipedia (2,500 million words).

The BERT-Base configuration uses 12 transformer encoder layers, 12 attention heads, and a hidden size of 768, totaling 110 million parameters. BERT-Large scales to 24 layers, 16 heads, and a hidden size of 1,024, for a total of 340 million parameters. After pretraining, BERT set new state-of-the-art results on 11 NLP benchmarks, including SQuAD and GLUE.

Several variants refined the MLM objective:

RoBERTa (Liu et al., 2019) removed the NSP task, used dynamic masking instead of static masking, trained on a larger dataset (160 GB of text), and used larger batch sizes, achieving 2 to 20 percent improvements over BERT on several benchmarks.
ALBERT (Lan et al., 2019) introduced factorized embedding parameterization and cross-layer parameter sharing to drastically reduce the model's parameter count while maintaining performance. It replaced NSP with a sentence-order prediction task.
ELECTRA (Clark et al., 2020) replaced masked token prediction with a replaced token detection task: a small generator network replaces some tokens, and the main discriminator network predicts which tokens were replaced. Because every input token contributes to the loss (not just the 15% that are masked), ELECTRA trains more efficiently and matches RoBERTa and XLNet performance with less than 25% of the compute.

Next-token prediction (GPT)

The GPT (Generative Pre-trained Transformer) series, developed by OpenAI, uses autoregressive language modeling as its self-supervised pretext task. The model processes a sequence of tokens from left to right and predicts the next token at each position. The training objective is to maximize the likelihood of the next token given all preceding tokens, using causal (left-to-right) attention masking so that each position can only attend to earlier positions.

GPT-1 (Radford et al., 2018) demonstrated that generative pretraining on a large text corpus followed by discriminative fine-tuning could achieve strong performance across diverse NLP tasks. GPT-2 (2019) scaled the approach to 1.5 billion parameters and showed that language models could perform tasks in a zero-shot setting without any fine-tuning. GPT-3 (2020) further scaled to 175 billion parameters and introduced few-shot in-context learning, where the model could perform new tasks by conditioning on a handful of examples in the prompt. GPT-4 (2023) and GPT-5 (2025) continued this trajectory, combining autoregressive pretraining with reinforcement learning from human feedback (RLHF) and other alignment techniques.

The autoregressive pretraining approach has also been adopted by many other large language models, including LLaMA (Meta), Mistral, Falcon, Qwen (Alibaba), and DeepSeek.

Span corruption (T5)

T5 (Text-to-Text Transfer Transformer), introduced by Raffel et al. at Google in 2019, unified all NLP tasks into a text-to-text format, where both the input and output are text strings. Its self-supervised pretraining objective is span corruption, a denoising task that randomly selects and drops out 15% of the tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single unique sentinel token (e.g., <X>, <Y>). The model is then trained to generate the missing spans, delimited by the corresponding sentinel tokens.

For example, given the input sentence "The quick brown fox jumps over the lazy dog," the words "brown fox" and "lazy" might be dropped out. The corrupted input would become "The quick <X> jumps over the <Y> dog," and the target output would be "<X> brown fox <Y> lazy <Z>." Because the model must reconstruct multiple consecutive tokens at once, span corruption encourages the learning of richer contextual representations than single-token masking.

T5 uses an encoder-decoder transformer architecture, unlike the encoder-only BERT or decoder-only GPT. The original T5 paper systematically evaluated a wide range of pretraining objectives, architectures, and dataset sizes using the Colossal Clean Crawled Corpus (C4), a 750 GB dataset derived from Common Crawl. T5 models range from T5-Small (60 million parameters) to T5-11B (11 billion parameters). Subsequent versions include mT5 (multilingual T5), Flan-T5 (instruction-tuned), and UL2, which combines multiple pretraining objectives.

Comparison of NLP self-supervised objectives

Method	Pretext task	Architecture	Directionality	Key innovation	Notable models
Masked language modeling	Predict masked tokens from context	Encoder-only	Bidirectional	Learns from both left and right context simultaneously	BERT, RoBERTa, ALBERT
Next-token prediction	Predict next token autoregressively	Decoder-only	Left-to-right (causal)	Scales naturally to generation tasks; enables in-context learning	GPT, LLaMA, Mistral
Span corruption	Reconstruct corrupted spans delimited by sentinels	Encoder-decoder	Bidirectional encoder, autoregressive decoder	Predicts multi-token spans; shorter target sequences reduce training cost	T5, mT5, UL2
Replaced token detection	Detect which tokens were replaced by a generator	Encoder-only	Bidirectional	All tokens contribute to the loss, not just 15%	ELECTRA
Permutation language modeling	Predict tokens in random order to capture bidirectional context	Encoder-only	Permuted	Combines benefits of autoregressive and bidirectional models	XLNet

Self-supervised learning in computer vision

Applying self-supervised learning to images presents a different set of challenges compared to text. Language has a natural sequential structure and discrete tokens, whereas images are high-dimensional, continuous signals without an obvious ordering. Early SSL methods in vision used hand-designed pretext tasks such as predicting image rotations, solving jigsaw puzzles, or colorizing grayscale images. While these methods yielded useful representations, they were eventually surpassed by contrastive learning and masked image modeling approaches that learn more general features.

Early pretext tasks

Before the contrastive learning era, researchers devised several creative pretext tasks for visual SSL:

Rotation prediction (Gidaris et al., 2018): The model receives an image rotated by 0, 90, 180, or 270 degrees and must predict which rotation was applied. Learning to solve this task requires understanding object orientation and scene layout.
Jigsaw puzzles (Noroozi and Favaro, 2016): An image is divided into a grid of patches, the patches are shuffled, and the model must predict the correct spatial arrangement. This forces the model to learn about spatial relationships between object parts.
Colorization (Zhang et al., 2016): A color image is converted to grayscale, and the model must predict the original colors. Accurate colorization requires understanding object semantics (grass is green, sky is blue).
Inpainting (Pathak et al., 2016): A region of the image is removed, and the model must fill in the missing content, learning about object structure and scene context.

While these methods produced representations that outperformed random initialization, they often encoded task-specific biases. For example, a rotation predictor might focus on texture cues rather than semantic content. The move to contrastive and masked modeling methods addressed this limitation by learning more general-purpose features.

SimCLR

SimCLR (A Simple Framework for Contrastive Learning of Visual Representations), introduced by Chen et al. at Google Research in February 2020, demonstrated that a straightforward contrastive framework could match or exceed earlier, more complex methods. The SimCLR pipeline has four components:

Data augmentation: Two random augmentations (random crop-and-resize, color distortion, Gaussian blur) are applied to each image in a minibatch, producing two correlated "views" of the same image.
Base encoder: A convolutional neural network (typically ResNet-50) extracts feature vectors from each augmented view.
Projection head: A small multilayer perceptron (MLP) maps the encoder output to a lower-dimensional space where the contrastive loss is applied.
Contrastive loss (NT-Xent): The Normalized Temperature-scaled Cross-Entropy loss maximizes the cosine similarity between the two views of the same image (positive pair) while minimizing similarity with all other images in the batch (negative pairs). A temperature parameter (typically 0.1) controls the sharpness of the distribution.

SimCLR requires very large batch sizes (4,096 in the original paper) to provide a sufficient number of negative examples. After pretraining, the projection head is discarded, and the encoder is used for downstream tasks. SimCLR achieved 76.5% top-1 accuracy on ImageNet with linear evaluation using a ResNet-50 backbone. SimCLR v2 (2020) scaled the framework to larger ResNet models and introduced a semi-supervised variant that leveraged a small amount of labeled data.

MoCo

MoCo (Momentum Contrast), introduced by He et al. at Meta AI Research in November 2019, tackled the large-batch requirement of contrastive learning by maintaining a dynamic queue of negative representations. The key components are:

Query encoder: Processes the current image view and is updated by standard backpropagation.
Momentum encoder: Processes a second view of the same image. Its parameters are not updated directly by gradient descent; instead, they are updated as an exponential moving average of the query encoder's parameters, using a momentum coefficient m (typically 0.999).
Queue: A first-in-first-out queue stores the representations produced by the momentum encoder from recent minibatches. This decouples the number of negative examples from the batch size, allowing contrastive training with thousands of negatives even on modest hardware.

MoCo v1 achieved competitive results with SimCLR while using a batch size of only 256. MoCo v2 (2020) incorporated improvements inspired by SimCLR, including an MLP projection head and stronger augmentations. MoCo v3 (Chen et al., 2021) adapted the framework for Vision Transformers (ViT) and addressed training instability issues that arise when using transformers with contrastive objectives.

BYOL

BYOL (Bootstrap Your Own Latent), published by Grill et al. at DeepMind in June 2020, challenged the assumption that negative pairs are essential for contrastive learning. BYOL uses two networks:

Online network: Consists of an encoder, a projector, and a predictor. The predictor is a key architectural element that prevents representational collapse.
Target network: Consists of an encoder and a projector (no predictor). Its parameters are updated as an exponential moving average of the online network's parameters.

The online network is trained to predict the target network's representation of a differently augmented view of the same image. Because the target network updates slowly via the momentum mechanism, it provides a stable regression target. BYOL avoids the need for negative examples entirely, which makes it more robust to batch size variations.

BYOL achieved 74.3% top-1 accuracy on ImageNet using linear evaluation with a ResNet-50 encoder and 79.6% with a larger ResNet. It demonstrated stable performance across batch sizes ranging from 256 to 4,096, whereas SimCLR's performance degrades significantly with smaller batches.

SimSiam

SimSiam (Exploring Simple Siamese Representation Learning), published by Chen and He at Meta AI in 2021, further simplified non-contrastive learning by removing the momentum encoder entirely. SimSiam uses a simple Siamese network with a shared encoder and a prediction MLP applied to one branch. The critical innovation is a stop-gradient operation: one branch's output is treated as a fixed target (receives no gradient), while the other branch is trained to predict it.

This design means SimSiam requires neither negative pairs (like SimCLR), nor a momentum encoder (like BYOL and MoCo), nor large batches. Despite its simplicity, SimSiam achieves competitive performance on ImageNet, with 71.3% top-1 accuracy under linear evaluation using ResNet-50. The paper provided theoretical analysis suggesting that SimSiam implicitly performs a form of Expectation-Maximization (EM) optimization, alternating between clustering assignments and representation updates.

DINO and DINOv2

DINO (Self-Distillation with No Labels), introduced by Caron et al. at Meta AI Research in April 2021, is a self-supervised method based on self-distillation using Vision Transformers. DINO uses a student-teacher framework where both networks share the same architecture:

The student network receives local crops (small views) of the image.
The teacher network receives global crops (large views) of the image.
The student is trained to match the teacher's output distribution using a cross-entropy loss.
The teacher's parameters are updated as an exponential moving average of the student's parameters.
A centering and sharpening mechanism prevents mode collapse.

A notable discovery from DINO was that self-supervised Vision Transformers learn to segment objects without any explicit segmentation supervision. The self-attention maps of the final layer's [CLS] token clearly delineate object boundaries, a property that does not emerge as clearly in supervised ViTs or in convolutional networks.

DINO achieved 80.1% top-1 accuracy on ImageNet with linear evaluation using ViT-Base. DINOv2 (Oquab et al., 2023) scaled the method to 142 million curated images and a ViT-Giant model with over 1 billion parameters, using a combination of self-distillation and masked image modeling. DINOv2 achieved state-of-the-art results across classification, segmentation, and depth estimation without any fine-tuning, producing visual features that work out-of-the-box with simple linear classifiers. DINOv3 (2025) further scaled to 1.7 billion images and a 7-billion-parameter ViT teacher, narrowing the gap with fully supervised models across vision benchmarks.

BEiT

BEiT (BERT Pre-Training of Image Transformers), introduced by Bao et al. at Microsoft Research in 2021, was the first method to make self-supervised pre-training of Vision Transformers outperform supervised pre-training. BEiT adapts the masked language modeling concept from BERT to images:

An image is tokenized into discrete visual tokens using a pre-trained discrete variational autoencoder (dVAE) from DALL-E.
The image is also split into patches (e.g., 16x16 pixels).
Some patches are randomly masked (approximately 40%).
The masked patches are fed to a Vision Transformer, and the model predicts the visual token corresponding to each masked position.

By predicting discrete visual tokens rather than raw pixels, BEiT avoids the pixel-level regression problem and instead frames pre-training as a classification task over visual vocabulary. BEiT-Base achieved 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming the DeiT supervised baseline (81.8%). BEiT-Large reached 86.3% using only ImageNet-1K data.

MAE

MAE (Masked Autoencoders Are Scalable Vision Learners), introduced by He et al. at Meta AI Research in November 2021, adapted the masked prediction idea from NLP to computer vision with a simpler approach than BEiT. MAE applies the following procedure:

An input image is divided into non-overlapping patches (e.g., 16x16 pixels).
A large fraction of patches (75% by default) are randomly masked.
Only the visible (unmasked) patches are fed through a Vision Transformer encoder.
A lightweight decoder takes the encoder output along with mask tokens representing the missing positions and reconstructs the pixel values of the masked patches.

The asymmetric encoder-decoder design is critical: because the encoder processes only 25% of the patches, pretraining is highly efficient in both computation and memory. The high masking ratio forces the model to develop a holistic understanding of the image rather than relying on local interpolation from nearby visible patches. Unlike BEiT, MAE reconstructs raw pixel values directly using mean squared error loss, making it simpler (no need for a pre-trained tokenizer) and 3.5 times faster per epoch.

MAE with a ViT-Huge encoder achieved 87.8% top-1 accuracy on ImageNet after fine-tuning, establishing a new state of the art for self-supervised methods at the time. The approach also scales well to video (VideoMAE) and audio (Audio-MAE).

Comparison of vision SSL methods

Method	Year	Authors / Lab	Approach	Requires negatives	Architecture	ImageNet top-1 (linear eval)
SimCLR	2020	Chen et al. / Google	Contrastive (NT-Xent loss, large batches)	Yes	ResNet-50	76.5%
MoCo v2	2020	He et al. / Meta AI	Contrastive (momentum queue)	Yes	ResNet-50	71.1%
BYOL	2020	Grill et al. / DeepMind	Non-contrastive (online-target prediction)	No	ResNet-50	74.3%
SimSiam	2021	Chen & He / Meta AI	Non-contrastive (stop-gradient Siamese)	No	ResNet-50	71.3%
Barlow Twins	2021	Zbontar et al. / Meta AI	Non-contrastive (redundancy reduction)	No	ResNet-50	73.2%
BEiT	2021	Bao et al. / Microsoft	Masked image modeling (token prediction)	No	ViT-Base (fine-tuned)	83.2%
DINO	2021	Caron et al. / Meta AI	Self-distillation (student-teacher)	No	ViT-Base	80.1%
MAE	2022	He et al. / Meta AI	Masked image modeling (pixel reconstruction)	No	ViT-Huge (fine-tuned)	87.8%
DINOv2	2023	Oquab et al. / Meta AI	Self-distillation at scale (142M images)	No	ViT-Giant	86.5%

Self-supervised learning in speech and audio

Speech and audio present unique challenges for self-supervised learning. Audio signals are continuous waveforms with temporal structure, and the relationship between acoustic features and linguistic content is complex and variable across speakers, accents, and recording conditions. Self-supervised methods for speech typically operate on raw waveform inputs or spectral features and learn representations that encode phonetic, speaker, and prosodic information.

Wav2Vec 2.0

Wav2Vec 2.0, introduced by Baevski et al. at Meta AI Research in June 2020, combines contrastive learning with quantization to learn speech representations from raw audio. The architecture has three components:

Feature encoder: A multi-layer convolutional network processes the raw audio waveform and produces a sequence of latent speech representations at approximately 20ms resolution.
Quantization module: The latent representations are discretized through a product quantization scheme, producing a finite set of speech codes. This quantization provides the targets for the contrastive task.
Transformer encoder: The latent representations are masked (spans of consecutive time steps) and fed through a transformer encoder. The model must identify the correct quantized representation for each masked position from a set of distractors (contrastive objective).

Wav2Vec 2.0 demonstrated dramatic improvements in low-resource speech recognition. When pre-trained on 53,000 hours of unlabeled audio from LibriVox and fine-tuned on just 10 minutes of labeled data, it achieved a word error rate of 4.8/8.2 on the LibriSpeech test-clean/test-other benchmarks. Using only one hour of labeled data, it outperformed the previous state of the art trained on 100 times more labeled data.

HuBERT

HuBERT (Hidden-Unit BERT), introduced by Hsu et al. at Meta AI Research in June 2021, takes a different approach to self-supervised speech representation learning. Instead of contrastive learning with online quantization, HuBERT uses an offline clustering step to generate pseudo-labels:

Clustering: K-means clustering is applied to MFCC features (in the first iteration) or to features from a previously trained HuBERT model (in subsequent iterations) to produce discrete cluster assignments for each audio frame.
Masked prediction: Following the BERT paradigm, spans of the audio input are masked, and the model is trained to predict the cluster assignment of each masked frame.
Iterative refinement: After the first round of training, the model's own learned representations are used to generate improved cluster assignments, and the model is retrained. This iterative process progressively improves both the discrete labels and the learned representations.

HuBERT matches or exceeds Wav2Vec 2.0 performance across all fine-tuning subsets of LibriSpeech. When pretrained on the Libri-Light 60,000-hour dataset, HuBERT achieved state-of-the-art results on several speech processing benchmarks. The approach has also been extended to multilingual settings and speech generation tasks.

Other speech SSL methods

Beyond Wav2Vec 2.0 and HuBERT, several other self-supervised methods have advanced speech representation learning:

WavLM (Chen et al., 2022) extended HuBERT with denoising and speaker modeling objectives, achieving strong performance on the SUPERB benchmark across tasks including speech recognition, speaker verification, and spoken language understanding.
data2vec (Baevski et al., 2022) proposed a unified self-supervised framework for speech, vision, and text, predicting latent representations of the full input from a masked view.
Whisper (Radford et al., 2022), while technically trained in a weakly supervised manner on 680,000 hours of labeled audio data, demonstrated the power of large-scale pretraining for robust speech recognition across many languages.

Contrastive vs. non-contrastive learning

A fundamental distinction in self-supervised learning, particularly in computer vision, is between contrastive and non-contrastive methods. Both approaches learn by comparing different views of the same input, but they differ in how they prevent the model from learning trivial solutions (representational collapse).

Contrastive learning

Contrastive methods explicitly use negative examples to prevent collapse. The loss function pulls together representations of positive pairs (different views of the same input) while pushing apart representations of negative pairs (views of different inputs). The InfoNCE loss, used in many contrastive frameworks, formalizes this as a softmax classification problem over one positive and many negative examples.

The effectiveness of contrastive learning depends on the quality and quantity of negative examples. SimCLR addresses this by using very large batch sizes (4,096), while MoCo maintains a separate queue of negatives. A key challenge is that contrastive methods can become inefficient in high-dimensional representation spaces, where the number of negatives required for effective training grows substantially.

Non-contrastive learning

Non-contrastive methods avoid the need for negative examples entirely. Instead, they prevent collapse through architectural asymmetries, regularization techniques, or information-theoretic objectives. The major non-contrastive approaches include:

Asymmetric architectures (BYOL, DINO): Using a predictor network in only one branch, combined with a momentum-updated target network, prevents both branches from collapsing to the same trivial solution.
Stop-gradient (SimSiam): Treating one branch's output as a constant (no gradient flows through it) introduces an implicit alternating optimization that avoids collapse without requiring momentum or large batches.
Redundancy reduction (Barlow Twins): Measuring the cross-correlation matrix between the outputs of two identical networks and driving it toward the identity matrix ensures that different dimensions of the representation capture different information.
Variance-Invariance-Covariance regularization (VICReg): Combining three explicit objectives: (1) variance regularization to prevent individual dimensions from collapsing, (2) invariance to ensure representations of different views are similar, and (3) covariance regularization to decorrelate different dimensions.

Trade-offs

Aspect	Contrastive methods	Non-contrastive methods
Negative examples	Required; more negatives generally improve performance	Not required
Batch size sensitivity	Performance often depends on large batch sizes or external memory	Generally more robust to batch size
Loss function	InfoNCE, NT-Xent	MSE, cross-correlation, variance/covariance regularization
Collapse prevention	Explicit repulsion of negatives	Architectural asymmetry, regularization, or information-theoretic constraints
Computational cost	Can be expensive due to large batches or memory banks	Typically lower, but requires careful design to avoid collapse
Examples	SimCLR, MoCo, CLIP	BYOL, DINO, SimSiam, Barlow Twins, VICReg

Joint-embedding methods vs. generative methods

At a higher level of abstraction, self-supervised learning methods can be divided into two broad paradigms: joint-embedding methods and generative methods. This distinction, highlighted in the "Cookbook of Self-Supervised Learning" survey (Balestriero et al., 2023), captures a fundamental design choice about how the learning signal is constructed.

Joint-embedding methods (also called embedding-based or energy-based methods) map two views of the same input into a shared representation space and train the model to make the two embeddings similar. The model never reconstructs the raw input. SimCLR, MoCo, BYOL, DINO, Barlow Twins, and VICReg all fall into this category. The advantage is that the model is free to discard low-level details (exact pixel values, noise) and focus on high-level semantic content.

Generative methods reconstruct some form of the original input from a corrupted or partial version. Masked language modeling (BERT), next-token prediction (GPT), and masked image modeling (MAE, BEiT) are generative in nature. These methods provide a dense training signal (every masked position contributes to the loss), but they require the model to allocate capacity to low-level reconstruction, which may not always be useful for downstream tasks.

In practice, the most successful recent systems combine elements of both. DINOv2, for example, uses a joint-embedding self-distillation objective alongside a masked image modeling objective. The I-JEPA and V-JEPA frameworks represent a hybrid approach that performs prediction in representation space rather than input space.

Joint-Embedding Predictive Architecture (JEPA)

The Joint-Embedding Predictive Architecture (JEPA) is a framework for self-supervised learning proposed by Yann LeCun in his 2022 position paper "A Path Towards Autonomous Machine Intelligence." JEPA represents a departure from both contrastive methods and pixel-level generative methods, instead learning to predict in a learned abstract representation space.

Core principles

In a JEPA, two encoder networks map inputs x and y into embedding spaces, producing representations sx and sy. A predictor network takes sx (and optionally a latent variable z) and predicts sy. The key principles are:

The representation sx should be maximally informative about x.
The representation sy should be maximally informative about y.
The representation sy should be easily predictable from sx.
The latent variable z should have minimal information content.

By predicting in representation space rather than in input space, JEPA avoids the need to model irrelevant low-level details (exact pixel values, background textures) and instead focuses on capturing high-level semantic content. This is a central motivation: LeCun argues that predicting every pixel in an image or every sample in an audio waveform wastes model capacity on perceptually irrelevant variation.

I-JEPA

I-JEPA (Image-based JEPA), introduced by Assran et al. at Meta AI in 2023, applies the JEPA framework to images. The method works as follows:

An image is divided into patches. A context block (a subset of patches) is left visible, and several target blocks are masked.
A context encoder processes the visible patches to produce context representations.
A predictor network takes the context representations and positional information of the target blocks and predicts the target representations.
A target encoder (updated via exponential moving average of the context encoder) processes the full image to produce the target representations.

I-JEPA differs from MAE in that it predicts in representation space, not in pixel space. This design learns representations that emphasize semantic content over low-level texture and color information.

V-JEPA

V-JEPA (Video JEPA), introduced by Bardes et al. at Meta AI in 2024, extends the JEPA framework to video. The model predicts masked spatio-temporal regions in a learned latent space, learning from the temporal structure of video without any text supervision, negative examples, or pixel-level reconstruction. V-JEPA pretraining is based solely on an unsupervised feature prediction objective.

V-JEPA 2 (2025) scaled the approach to over one million hours of internet video data and combined it with a small amount of robot interaction data. It achieved 77.3% top-1 accuracy on Something-Something v2 for motion understanding and state-of-the-art performance on human action anticipation on Epic-Kitchens-100.

LeJEPA

LeJEPA, introduced by LeCun and Balestriero at Meta in late 2025, simplified the JEPA framework by combining the JEPA predictive loss with SIGReg (Sketched Isotropic Gaussian Regularization). LeJEPA removes the need for many of the engineering heuristics that earlier self-supervised methods relied on, such as momentum encoders, stop-gradients, and asymmetric architectures. The method can be implemented in approximately 50 lines of code, making it one of the most accessible self-supervised learning algorithms to date.

JEPA vs. other paradigms

Feature	Contrastive SSL	Generative SSL (MAE)	JEPA
Prediction space	Embedding similarity	Input (pixel) space	Learned representation space
Negative examples	Required	Not applicable	Not required
What is predicted	Whether two views match	Missing pixels or tokens	Abstract representations of missing regions
Low-level detail modeling	Avoided via embedding space	Required (pixel reconstruction)	Avoided by design
Flexibility	Primarily two views of the same input	Masked input reconstruction	Spatial, temporal, and cross-modal prediction
Examples	SimCLR, MoCo, CLIP	MAE, BEiT	I-JEPA, V-JEPA, LeJEPA

Multimodal self-supervised learning

Self-supervised learning has extended beyond single modalities to learn joint representations across vision, language, and audio.

CLIP

CLIP (Contrastive Language-Image Pre-training), introduced by Radford et al. at OpenAI in January 2021, learns visual representations from natural language supervision. CLIP jointly trains an image encoder (a Vision Transformer or ResNet) and a text encoder (a transformer-based language model) on 400 million image-text pairs collected from the internet. The contrastive objective maximizes the cosine similarity between matching image-text pairs and minimizes it for non-matching pairs within each minibatch of 32,768 examples.

CLIP enables zero-shot image classification: given an image and a set of textual class descriptions, the model selects the description whose embedding is most similar to the image embedding. Without any fine-tuning on ImageNet, CLIP matched the accuracy of a fully supervised ResNet-50. CLIP representations have become widely used as conditioning signals in text-to-image generation models such as Stable Diffusion and DALL-E.

Other multimodal methods

ALIGN (Jia et al., 2021, Google) scaled image-text contrastive learning to 1.8 billion noisy image-alt-text pairs, demonstrating that scale can compensate for data noise.
SigLIP (Zhai et al., 2023, Google) replaced the softmax-based contrastive loss in CLIP with a pairwise sigmoid loss, removing the need for global normalization across the batch and improving scalability.
ImageBind (Girdhar et al., 2023, Meta AI) extended the idea to six modalities (images, text, audio, depth, thermal, IMU) using image-paired data as a binding modality.

The self-supervised pre-training pipeline

Self-supervised learning is most commonly used as the first stage of a two-stage pipeline: pre-training followed by adaptation. This pipeline has become the dominant approach for building modern AI systems and is central to the concept of pre-trained models.

Stage 1: self-supervised pre-training

A large neural network (typically a transformer) is trained on a pretext task using a large corpus of unlabeled data. The goal is to learn general-purpose representations that capture the structure of the data domain. This stage is computationally expensive (often requiring hundreds or thousands of GPU-hours) but needs to be performed only once.

Stage 2: adaptation

The pretrained model is adapted to a specific downstream task using one of several strategies:

Full fine-tuning: All parameters of the pretrained model are updated on the downstream labeled dataset. This typically achieves the highest performance but requires more labeled data and compute.
Linear probing: A single linear layer (logistic regression) is trained on top of the frozen pretrained representations. This tests the quality of the representations themselves without allowing the model to adjust its features.
Few-shot and zero-shot inference: For large language models like GPT-3, no gradient updates are needed. The model performs tasks by conditioning on a few examples in the prompt (few-shot) or on task instructions alone (zero-shot).
Parameter-efficient fine-tuning: Methods like LoRA, adapters, and prompt tuning update only a small fraction of the model's parameters, reducing compute and memory requirements while preserving most of the pretrained knowledge.

This two-stage pipeline is the foundation of the foundation model paradigm, in which a single large pre-trained model serves as a starting point for many different tasks and applications.

Evaluation protocols

Evaluating the quality of self-supervised representations is a critical and nuanced problem. Because SSL methods do not optimize for any specific downstream task, researchers use several standardized evaluation protocols to assess how useful the learned representations are.

Linear probing

Linear probing (also called linear evaluation) is the most widely used evaluation protocol for SSL. A linear classifier (single fully-connected layer) is trained on top of the frozen pretrained encoder using a labeled dataset such as ImageNet. The pretrained encoder's weights are not updated during this process. High linear probing accuracy indicates that the pretrained features are linearly separable with respect to the downstream task, meaning they already encode semantically meaningful information.

Linear probing is favored because it isolates the quality of the representations from the capacity of the downstream model. If a complex neural network is used for evaluation, it might compensate for poor representations through its own learning.

Fine-tuning evaluation

In fine-tuning evaluation, the entire pretrained model is trained end-to-end on the downstream task. All parameters are updated, allowing the representations to adapt to the specific task and dataset. Fine-tuning generally produces higher accuracy than linear probing because the model can adjust its features. However, it provides less insight into the intrinsic quality of the pretrained representations, since a powerful model architecture can partially compensate for weaker pre-training.

k-Nearest Neighbors (k-NN) evaluation

k-NN evaluation extracts features from the frozen pretrained encoder for both training and test images, then classifies each test image by majority vote among its k nearest neighbors in the training set (measured by Euclidean distance or cosine similarity in the feature space). This protocol requires no training at all, making it fast and computationally lightweight. k-NN accuracy is highly correlated with linear probing accuracy when embedding normalization is applied, and the two metrics can often be used interchangeably.

Comparison of evaluation protocols

Protocol	Pretrained weights	Evaluation model	Computational cost	What it measures
Linear probing	Frozen	Single linear layer	Low	Quality of frozen representations
Fine-tuning	Updated	Full pretrained model	High	Upper bound on task performance with pretrained initialization
k-NN	Frozen	None (nearest neighbor lookup)	Very low	Cluster structure of the representation space
Few-shot	Frozen or minimal adaptation	Linear or lightweight head	Low	Generalization from very few labeled examples

Impact on foundation models

Self-supervised learning is the engine behind the foundation model paradigm, in which a single large model is pretrained on broad data and then adapted to many downstream tasks.

Scale and performance

The effectiveness of self-supervised pretraining scales with both model size and data volume. In NLP, scaling from GPT-1 (117M parameters, approximately 5 GB text) to GPT-3 (175B parameters, approximately 570 GB text) led to dramatic improvements in few-shot and zero-shot capabilities. In vision, DINOv2 demonstrated that self-supervised models trained on 142 million curated images produce features that rival or exceed supervised pretraining for classification, segmentation, and depth estimation. DINOv3 (2025) pushed this further with 1.7 billion images and a 7-billion-parameter ViT.

Transfer learning efficiency

Self-supervised pretrained models require far less labeled data for downstream tasks compared to training from scratch. Wav2Vec 2.0 demonstrated that just 10 minutes of labeled speech data, combined with self-supervised pretraining, can achieve competitive speech recognition. In NLP, BERT fine-tuned on a few thousand labeled examples routinely outperforms models trained from scratch on much larger labeled datasets.

Emergent capabilities

Large self-supervised models exhibit capabilities that are not explicitly trained for:

In-context learning: GPT-3 and later models can learn new tasks from a few examples provided in the prompt, without any gradient updates.
Object segmentation: DINO's self-attention maps reveal object boundaries without any segmentation supervision.
Cross-lingual transfer: Multilingual models pretrained on text from many languages can perform tasks in languages with very little training data.
Compositional understanding: CLIP and similar models can compose visual and textual concepts in novel ways for zero-shot classification and retrieval.

Summary of SSL methods across modalities

Modality	Method	Pretext task	Key result
Text	BERT	Masked language modeling	SOTA on 11 NLP benchmarks at release (2018)
Text	GPT-3	Next-token prediction	Few-shot learning without fine-tuning (2020)
Text	T5	Span corruption	Unified text-to-text framework across NLP tasks (2019)
Images	SimCLR	Contrastive (augmented views)	76.5% ImageNet linear eval with ResNet-50 (2020)
Images	MoCo	Contrastive (momentum queue)	Decoupled batch size from number of negatives (2019)
Images	BYOL	Non-contrastive (self-prediction)	74.3% ImageNet without negative examples (2020)
Images	SimSiam	Non-contrastive (stop-gradient)	Simplified non-contrastive SSL without momentum (2021)
Images	BEiT	Masked image modeling (token prediction)	First SSL to outperform supervised ViT pre-training (2021)
Images	DINO	Self-distillation	Emergent object segmentation in ViT attention maps (2021)
Images	MAE	Masked image modeling	87.8% ImageNet fine-tuned with ViT-Huge (2022)
Images	I-JEPA	JEPA (representation prediction)	Semantic features without pixel reconstruction (2023)
Speech	Wav2Vec 2.0	Contrastive with quantization	Competitive ASR with 10 min labeled data (2020)
Speech	HuBERT	Masked prediction with offline clustering	Matched/exceeded Wav2Vec 2.0 across benchmarks (2021)
Multimodal	CLIP	Image-text contrastive	Zero-shot ImageNet classification matching supervised ResNet-50 (2021)
Video	V-JEPA	Spatio-temporal representation prediction	Action understanding without text or pixel reconstruction (2024)

Applications and successes

Self-supervised learning has shown transformative results in a variety of domains, including:

Computer vision: SSL techniques have been used to learn powerful representations from large-scale image datasets, which can then be fine-tuned for tasks like object detection, image segmentation, and classification. DINOv2 features serve as general-purpose visual features for medical imaging, autonomous driving, and robotics applications.
Natural language processing: Language models like BERT and GPT have achieved state-of-the-art results on numerous NLP benchmarks by leveraging self-supervised pre-training on large text corpora. The GPT series demonstrated that scaling autoregressive pretraining leads to emergent few-shot and zero-shot capabilities.
Speech recognition: Wav2Vec 2.0 and HuBERT have dramatically reduced the amount of labeled data required for speech recognition systems, enabling competitive performance in low-resource languages where labeled transcriptions are scarce.
Reinforcement learning: SSL has been used to learn useful features from raw sensory data in reinforcement learning settings, enabling agents to learn more efficiently and generalize better across tasks. V-JEPA 2 has demonstrated the potential for self-supervised video models to support planning in physical environments.
Medical imaging: Self-supervised pretraining on unlabeled medical scans, followed by fine-tuning on small labeled datasets, has improved diagnostic accuracy for radiology, pathology, and dermatology applications. Studies have shown that SSL-pretrained models can match fully supervised models while requiring 5 to 10 times fewer labeled examples.
Robotics: Self-supervised visual and multimodal representations are used for manipulation tasks, navigation, and sim-to-real transfer, where labeled data is particularly difficult to collect.

Challenges and open problems

Despite its successes, self-supervised learning faces several ongoing challenges:

Evaluation standards: There is no universally accepted protocol for evaluating SSL representations. Linear probing, fine-tuning, and few-shot evaluation can yield different rankings of methods, making comparisons difficult. A 2025 study found that in-domain linear and k-NN probing accuracies are, on average, the best general predictors for out-of-domain performance.
Computational cost: Many SSL methods require extensive pretraining on large compute clusters. SimCLR's large batch requirement, MAE's long pretraining schedules, and the sheer scale of data for DINOv2 and GPT-3 mean that reproducing results is prohibitively expensive for most researchers.
Collapse and training instability: Non-contrastive methods can suffer from representational collapse if architectural safeguards fail. Training instability has been reported for contrastive methods applied to Vision Transformers (addressed in MoCo v3).
Domain specificity of pretext tasks: A pretext task that works well for images (e.g., masking 75% of patches) may not transfer directly to other modalities like point clouds, graphs, or tabular data. Designing effective pretext tasks for new data types remains an active area of research.
Understanding what is learned: The theoretical understanding of why certain pretext tasks lead to useful representations is still limited. Empirical studies suggest that data augmentation strategies may matter more than the specific pretext task, but the precise mechanisms remain unclear.
Fairness and bias: Self-supervised models inherit biases present in their training data. Since SSL methods are trained on massive, often uncurated web corpora, they can amplify societal biases related to gender, race, and other attributes.

References

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." *OpenAI Technical Report*.
Raffel, C., Shazeer, N., Roberts, A., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." *Proceedings of ICML 2020*.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning." *Proceedings of CVPR 2020*.
Grill, J.-B., Strub, F., Altche, F., et al. (2020). "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning." *Proceedings of NeurIPS 2020*.
Caron, M., Touvron, H., Misra, I., et al. (2021). "Emerging Properties in Self-Supervised Vision Transformers." *Proceedings of ICCV 2021*.
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." *Proceedings of CVPR 2022*.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." *Proceedings of NeurIPS 2020*.
Hsu, W.-N., Bolte, B., Tsai, Y.-H., et al. (2021). "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." *IEEE/ACM Transactions on Audio, Speech, and Language Processing*, 29, 3451-3460.
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of ICML 2021*.
LeCun, Y. (2022). "A Path Towards Autonomous Machine Intelligence." *OpenReview preprint*.
Assran, M., Duval, Q., Misra, I., et al. (2023). "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture." *Proceedings of CVPR 2023*.
Oquab, M., Darcet, T., Moutakanni, T., et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision." *Transactions on Machine Learning Research (TMLR)*.
Bao, H., Dong, L., Piao, S., & Wei, F. (2021). "BEiT: BERT Pre-Training of Image Transformers." *Proceedings of ICLR 2022*.
Chen, X. & He, K. (2021). "Exploring Simple Siamese Representation Learning." *Proceedings of CVPR 2021*.
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). "Barlow Twins: Self-Supervised Learning via Redundancy Reduction." *Proceedings of ICML 2021*.
LeCun, Y. & Misra, I. (2021). "Self-supervised learning: The dark matter of intelligence." *Meta AI Blog*.
Balestriero, R., Ibrahim, M., Sobal, V., et al. (2023). "A Cookbook of Self-Supervised Learning." *arXiv preprint arXiv:2304.12210*.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *Proceedings of ICLR 2013*.
Bardes, A., Garrido, Q., Ponce, J., et al. (2024). "V-JEPA: Latent Video Prediction for Visual Representation Learning." *arXiv preprint*.
Liu, Y., Ott, M., Goyal, N., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." *arXiv preprint arXiv:1907.11692*.
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." *Proceedings of ICLR 2020*.

Introduction

ELI5 (Explain like I'm 5)

Motivation and background

Challenges in supervised learning

Unsupervised learning and representation learning

The "dark matter of intelligence"

Comparison with other learning paradigms

Historical context

Taxonomy of pretext tasks

Predictive methods

Contrastive methods

Generative and reconstructive methods

Self-supervised learning in natural language processing

Masked language modeling (BERT)

Next-token prediction (GPT)

Span corruption (T5)

Comparison of NLP self-supervised objectives

Self-supervised learning in computer vision

Early pretext tasks

SimCLR

MoCo

BYOL

SimSiam

DINO and DINOv2

BEiT

MAE

Comparison of vision SSL methods

Self-supervised learning in speech and audio

Wav2Vec 2.0

HuBERT

Other speech SSL methods

Contrastive vs. non-contrastive learning

Contrastive learning

Non-contrastive learning

Trade-offs

Joint-embedding methods vs. generative methods

Joint-Embedding Predictive Architecture (JEPA)

Core principles

I-JEPA

V-JEPA

LeJEPA

JEPA vs. other paradigms

Multimodal self-supervised learning

CLIP

Other multimodal methods

The self-supervised pre-training pipeline

Stage 1: self-supervised pre-training

Stage 2: adaptation

Evaluation protocols

Linear probing

Fine-tuning evaluation

k-Nearest Neighbors (k-NN) evaluation

Comparison of evaluation protocols

Impact on foundation models

Scale and performance

Transfer learning efficiency

Emergent capabilities

Summary of SSL methods across modalities

Applications and successes

Challenges and open problems

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention

Introduction

ELI5 (Explain like I'm 5)

Motivation and background

Challenges in supervised learning

Unsupervised learning and representation learning

The "dark matter of intelligence"

Comparison with other learning paradigms

Historical context

Taxonomy of pretext tasks

Predictive methods

Contrastive methods