Self-supervised learning (SSL) is a machine learning paradigm in which models learn representations from unlabeled data by solving pretext tasks that generate supervisory signals from the data itself. Instead of relying on human-annotated labels, self-supervised methods exploit the inherent structure of the data to create pseudo-labels or prediction targets. This approach has become the dominant pre-training strategy for foundation models across computer vision, natural language processing, speech, and multimodal AI. By 2025, virtually every major foundation model, from GPT-4 to Meta's DINOv3, relies on some form of self-supervised pre-training.
Labeled data is expensive, time-consuming, and sometimes impossible to obtain at scale. Training a single ImageNet classifier requires over 14 million human-annotated images. Medical imaging datasets require expert radiologists or pathologists. Legal and financial document labeling demands domain specialists. Self-supervised learning sidesteps this bottleneck by learning from the vast quantities of unlabeled data that exist naturally: images on the web, text corpora, audio recordings, and video streams.
The practical benefits are substantial. Models pre-trained with self-supervised objectives learn general-purpose representations that transfer effectively to downstream tasks with minimal labeled data. In many cases, SSL pre-trained models match or outperform their fully supervised counterparts. This has shifted the standard workflow in both research and industry from "collect labels, then train" to "pre-train on unlabeled data, then fine-tune with a small labeled set."
Self-supervised methods fall into three broad categories: contrastive, generative (reconstructive), and predictive.
Contrastive learning methods learn representations by pulling similar (positive) pairs closer together in embedding space while pushing dissimilar (negative) pairs apart. Positive pairs are typically created through data augmentation: two different augmented views of the same image or two overlapping text spans. Negative pairs come from other samples in the batch or a memory bank.
Prominent contrastive methods include SimCLR (Chen et al., 2020, Google) [1], MoCo (He et al., 2019, Facebook AI) [2], and CLIP (Radford et al., 2021, OpenAI) [3]. A related class of methods, sometimes called "non-contrastive," achieves similar goals without explicit negatives. BYOL (Grill et al., 2020, DeepMind) [4] uses a momentum-updated target network. Barlow Twins (Zbontar et al., 2021, Facebook AI) [5] minimizes redundancy between embedding dimensions. VICReg (Bardes et al., 2022, Meta AI) [6] regularizes variance, invariance, and covariance of embeddings.
Generative methods learn by reconstructing corrupted or incomplete inputs. The model receives a partial or noisy version of the data and must predict the missing or original content. This forces the model to build rich internal representations of the data distribution.
In NLP, masked language modeling (as in BERT) and denoising objectives (as in T5) are generative pretext tasks. In vision, masked image modeling (as in MAE and BEiT) and image inpainting serve the same purpose. Generative methods are especially effective when the reconstruction target is meaningful and non-trivial, requiring the model to understand high-level semantics rather than merely copying local patterns.
Predictive methods train models to predict properties or transformations of the input data. For example, a model might predict the rotation angle applied to an image, the relative position of image patches, or the next frame in a video. These tasks do not require the model to reconstruct the input pixel-by-pixel; instead, they test whether the model understands structural properties. Predictive methods were prominent in early self-supervised vision research and have since been incorporated into more modern frameworks.
Vision-based self-supervised learning has produced a diverse set of pretext tasks, evolving from hand-crafted puzzles to the masked modeling approaches that dominate current research.
Gidaris et al. (2018) proposed training a model to predict which of four rotation angles (0, 90, 180, or 270 degrees) was applied to an input image [7]. Recognizing the rotation requires understanding the image's content and spatial structure. Despite its simplicity, rotation prediction produced competitive representations for its time and demonstrated that geometric reasoning could drive feature learning.
Noroozi and Favaro (2016) split images into a grid of patches, shuffled them, and trained a network to predict the correct spatial arrangement [8]. Solving jigsaw puzzles forces the model to learn about object parts, spatial relationships, and scene layout. The approach required careful design of permutation sets to balance difficulty.
Pathak et al. (2016) removed a region from an image and trained a model to fill in the missing content [9]. This context-based prediction task encouraged models to learn about object shapes, textures, and scene semantics. Inpainting served as an early precursor to the masked modeling techniques that would later achieve state-of-the-art results.
Masked Autoencoders (MAE), introduced by Kaiming He and colleagues at Meta AI in 2021, brought the masked prediction paradigm from NLP to vision with striking effectiveness [10]. MAE works by masking a large proportion (typically 75%) of random patches in an image and training a Vision Transformer (ViT) to reconstruct the missing pixels. The architecture uses an asymmetric encoder-decoder design: the encoder processes only the visible patches (making training efficient), while a lightweight decoder reconstructs the full image from the latent representation and mask tokens. A vanilla ViT-Huge model trained with MAE achieved 87.8% top-1 accuracy on ImageNet-1K, setting a new benchmark for self-supervised methods. The high masking ratio is critical: it creates a genuinely challenging task that prevents the model from simply interpolating from nearby visible patches.
BEiT (Bidirectional Encoder representation from Image Transformers), proposed by Bao et al. at Microsoft in 2021, took a different approach to masked image modeling [11]. Rather than reconstructing raw pixels, BEiT first tokenizes images into discrete visual tokens using a Vector-Quantized Variational Autoencoder (VQ-VAE). The model then masks random image patches and predicts the corresponding visual tokens. This discrete prediction target proved highly effective: BEiT-Base achieved 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch training of DeiT (81.8%). BEiT-Large reached 86.3% using only ImageNet-1K data, surpassing ViT-L pre-trained with full supervision on the larger ImageNet-22K dataset (85.2%).
Self-supervised pre-training transformed NLP even before it became standard in vision. The three foundational approaches correspond to the three most influential model families.
BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. at Google in 2018, popularized masked language modeling as a pre-training objective [12]. During training, approximately 15% of tokens in the input sequence are randomly selected. Of those, 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% are left unchanged. The model must predict the original tokens based on bidirectional context. For example, in "The cat sat on the [MASK]," BERT learns to predict "mat" by attending to both the left context ("The cat sat on the") and any right context. This bidirectional understanding was a major advance over prior unidirectional models and established BERT as the foundation for a wide range of NLP tasks including question answering, sentiment analysis, and named entity recognition.
The GPT (Generative Pre-trained Transformer) family, developed by OpenAI starting in 2018, uses autoregressive next token prediction as its self-supervised objective [13]. Given a sequence of tokens, the model predicts the next token at each position, using only the left context (causal masking). This deceptively simple objective, when applied at massive scale, produced the large language models that power modern conversational AI, code generation, and reasoning systems. GPT-2 (2019), GPT-3 (2020), and GPT-4 (2023) each demonstrated that scaling next token prediction to larger models and datasets yields emergent capabilities.
T5 (Text-to-Text Transfer Transformer), introduced by Raffel et al. at Google in 2019, frames every NLP task as a text-to-text problem and uses a denoising objective for pre-training [14]. During pre-training, spans of tokens in the input are replaced with sentinel tokens, and the model is trained to generate the missing spans. Unlike BERT's in-place prediction, T5's denoising approach uses an encoder-decoder architecture where the corrupted input goes through the encoder and the decoder generates the missing content. This formulation proved remarkably flexible: classification, translation, summarization, and question answering all become sequence-to-sequence tasks with the same architecture and pre-training strategy.
| Pretext task | Domain | Example method | Year | What the model predicts |
|---|---|---|---|---|
| Rotation prediction | Vision | RotNet (Gidaris et al.) | 2018 | Rotation angle (0, 90, 180, 270 degrees) |
| Jigsaw puzzle | Vision | Noroozi & Favaro | 2016 | Correct spatial arrangement of shuffled patches |
| Inpainting | Vision | Context Encoders (Pathak et al.) | 2016 | Missing image region |
| Masked image modeling | Vision | MAE (He et al.) | 2021 | Missing pixel patches |
| Masked image modeling | Vision | BEiT (Bao et al.) | 2021 | Discrete visual tokens for masked patches |
| Masked language modeling | NLP | BERT (Devlin et al.) | 2018 | Original tokens at masked positions |
| Next token prediction | NLP | GPT (Radford et al.) | 2018 | Next token given left context |
| Denoising | NLP | T5 (Raffel et al.) | 2019 | Missing text spans |
| Contrastive (image-text) | Multimodal | CLIP (Radford et al.) | 2021 | Correct image-text pairings |
The following table summarizes major self-supervised learning methods across domains.
| Method | Organization | Year | Domain | Category | Key idea |
|---|---|---|---|---|---|
| BERT [12] | 2018 | NLP | Generative (MLM) | Bidirectional masked language modeling | |
| GPT [13] | OpenAI | 2018 | NLP | Generative (autoregressive) | Next token prediction at scale |
| MoCo [2] | Facebook AI | 2019 | Vision | Contrastive | Momentum encoder with queue of negatives |
| T5 [14] | 2019 | NLP | Generative (denoising) | Text-to-text denoising with encoder-decoder | |
| SimCLR [1] | Google Brain | 2020 | Vision | Contrastive | Large-batch contrastive learning with augmentations |
| BYOL [4] | DeepMind | 2020 | Vision | Non-contrastive | Two networks with momentum; no negatives |
| SwAV | Facebook AI | 2020 | Vision | Non-contrastive | Online clustering with swapped prediction |
| DINO [15] | Facebook AI | 2021 | Vision | Self-distillation | Self-distillation with no labels; emergent segmentation |
| Barlow Twins [5] | Facebook AI | 2021 | Vision | Non-contrastive | Cross-correlation matrix pushed toward identity |
| MAE [10] | Meta AI | 2021 | Vision | Generative (MIM) | Asymmetric encoder-decoder; 75% masking ratio |
| BEiT [11] | Microsoft | 2021 | Vision | Generative (MIM) | Predicts discrete visual tokens for masked patches |
| CLIP [3] | OpenAI | 2021 | Multimodal | Contrastive | Image-text contrastive learning on 400M pairs |
| VICReg [6] | Meta AI, Inria | 2022 | Vision | Non-contrastive | Variance-invariance-covariance regularization |
| DINOv2 [16] | Meta AI | 2023 | Vision | Self-distillation + MIM | Universal visual features; 142M images; no labels |
| DINOv3 [17] | Meta AI | 2025 | Vision | Self-distillation + MIM | 7B parameter ViT; 1.7B images; Gram anchoring |
DINO (Self-Distillation with No Labels), introduced by Mathilde Caron and colleagues at Facebook AI in 2021, demonstrated that self-supervised Vision Transformers learn features with remarkable properties [15]. DINO uses a teacher-student framework where both networks are Vision Transformers. The student is trained to match the teacher's output distribution across different augmented views of the same image, and the teacher's weights are updated as an exponential moving average of the student. A key finding was that self-supervised ViT features contain explicit information about scene layout and object boundaries. The attention maps of DINO-trained ViTs naturally segment objects without ever being trained on segmentation labels.
DINOv2, released by Meta AI in 2023, scaled the DINO approach substantially [16]. It combined self-distillation with a masked image modeling objective, training on a curated dataset of 142 million images. DINOv2 produced universal visual features that performed well across classification, segmentation, depth estimation, and image retrieval without any fine-tuning. The model established a new paradigm: train a large self-supervised backbone once, then use its frozen features as input to lightweight task-specific heads.
DINOv3, released by Meta in 2025, pushed self-supervised vision to unprecedented scale [17]. With 7 billion parameters and training on 1.7 billion images (a 12x increase over DINOv2's dataset), DINOv3 introduced Gram anchoring, a technique that prevents dense feature maps from degrading during long training schedules. For the first time, a purely self-supervised model outperformed weakly supervised counterparts across a wide range of vision tasks. Meta also distilled the 7B ViT teacher into smaller variants (ViT-B, ViT-L) and ConvNeXt architectures (T, S, B, L), making high-quality self-supervised features accessible at various compute budgets.
Self-supervised learning is the engine behind the modern foundation model paradigm. The workflow is now well established: pre-train a large model on massive unlabeled data with self-supervised objectives, then adapt it to specific tasks through fine-tuning, prompting, or feature extraction.
In NLP, this pattern emerged with BERT and GPT and has only intensified. GPT-3's 175 billion parameters were trained entirely with next token prediction on internet text. LLaMA (Meta, 2023) and its successors used the same autoregressive self-supervised objective. Every major large language model follows this template.
In vision, the pattern took longer to establish but is now equally standard. DINOv2 and DINOv3 serve as general-purpose vision backbones, analogous to how BERT served NLP. MAE pre-training is widely used for Vision Transformers. Specialized foundation models for medical imaging (Virchow), remote sensing, and video understanding all begin with self-supervised pre-training.
In multimodal settings, self-supervised contrastive learning (CLIP, SigLIP) provides the vision-language alignment layer that connects image understanding to language understanding. This alignment is essential for text-to-image generation, visual question answering, and multimodal assistants.
Self-supervised learning occupies a distinct position relative to supervised and unsupervised learning.
| Aspect | Supervised learning | Unsupervised learning | Self-supervised learning |
|---|---|---|---|
| Labels | Requires human-annotated labels | No labels | No labels (pseudo-labels derived from data) |
| Training signal | Explicit targets (classes, bounding boxes, etc.) | Data distribution (clustering, density) | Pretext tasks (masking, prediction, contrastive) |
| Typical goal | Predict specific outputs | Discover structure, compress data | Learn general-purpose representations |
| Data efficiency | Limited by label availability | Works with any data | Works with any data; transfers to labeled tasks |
| Examples | Classification, detection, segmentation | K-means, PCA, autoencoders | BERT, MAE, SimCLR, DINO |
| Scale | Bounded by labeling budget | Unbounded | Unbounded |
| Representation quality | Task-specific; may not generalize | May miss task-relevant features | General-purpose; strong transfer |
Supervised learning excels when abundant labeled data is available for a specific task. Unsupervised learning finds structure without any notion of tasks. Self-supervised learning bridges the gap: it uses the structure of unlabeled data to learn representations that transfer well to supervised tasks. In practice, the boundaries are fluid. Many modern training pipelines combine self-supervised pre-training with supervised fine-tuning, and the term "unsupervised" is sometimes applied loosely to self-supervised methods.
A critical distinction is that self-supervised learning defines explicit prediction tasks (fill in the mask, predict the next token, match augmented views), whereas traditional unsupervised methods like clustering or dimensionality reduction do not. This structured learning signal is what makes self-supervised representations so effective for downstream tasks.
Self-supervised learning is no longer an emerging technique; it is the default pre-training strategy for nearly every major AI system. Several developments characterize the current landscape.
Scale continues to increase. DINOv3's 7 billion parameter model trained on 1.7 billion images represents the current frontier in self-supervised vision. In language, models with hundreds of billions of parameters continue to be trained with next token prediction on trillion-token corpora.
Hybrid objectives are standard. Pure contrastive, pure generative, and pure predictive methods have converged. DINOv2 and DINOv3 combine self-distillation with masked image modeling. Many language models combine autoregressive training with instruction tuning and reinforcement learning from human feedback. The boundaries between SSL categories are increasingly blurred in practice.
Domain-specific foundation models proliferate. Self-supervised pre-training has enabled foundation models in medicine (Virchow for pathology, BiomedCLIP for biomedical vision-language), earth observation (satellite imagery models), robotics, and scientific discovery. These models train on domain-specific unlabeled data before being fine-tuned for specialized tasks.
Efficiency and distillation matter. Not every deployment can afford a 7B parameter model. Knowledge distillation from large self-supervised teachers to smaller students (as in DINOv3's ViT-B and ConvNeXt distillations) has become a standard practice. This makes the benefits of large-scale SSL accessible on edge devices and in latency-sensitive applications.
Evaluation benchmarks are evolving. As self-supervised methods saturate existing benchmarks like ImageNet, the community has shifted toward more demanding evaluations: few-shot learning, out-of-distribution generalization, dense prediction tasks (segmentation, depth estimation), and real-world robustness. These evaluations better reflect the practical value of learned representations.
Video and temporal understanding are growing frontiers. VideoMAE and related methods extend masked modeling to video, learning spatiotemporal representations from unlabeled footage. Self-supervised methods for video understanding are increasingly important for autonomous driving, surveillance, sports analytics, and content recommendation.
The trajectory is unmistakable. Self-supervised learning has moved from a niche research topic to the foundational layer upon which modern AI systems are built. The question is no longer whether to use self-supervised pre-training, but how to best design pretext tasks, curate training data, and scale models for specific applications.