See also: self-supervised learning, representation learning, metric learning, transfer learning, deep learning
Contrastive learning is a family of machine learning methods that learn representations by comparing data points against each other. The central idea is to pull representations of similar (positive) pairs closer together in an embedding space while pushing representations of dissimilar (negative) pairs apart. By structuring the learning objective around pairwise or group-wise comparisons rather than explicit class labels, contrastive learning has become one of the most successful paradigms for self-supervised learning, enabling models to learn useful visual, textual, and multimodal features from unlabeled data.
The roots of contrastive learning trace back to earlier work in metric learning and Siamese networks. Bromley et al. (1993) introduced Siamese networks for signature verification, training two weight-sharing networks to compare input pairs. Hadsell, Chopra, and LeCun (2006) formalized the contrastive loss function, which minimizes distance between same-class pairs while enforcing a margin-based separation for different-class pairs. The triplet loss introduced in FaceNet (Schroff et al., 2015) extended this idea by simultaneously considering an anchor, a positive sample, and a negative sample.
The modern era of contrastive learning began in 2018 when van den Oord et al. proposed Contrastive Predictive Coding (CPC) and introduced the InfoNCE loss. This was followed by a rapid succession of methods between 2019 and 2021, including MoCo, SimCLR, BYOL, SwAV, Barlow Twins, and CLIP. These methods demonstrated that self-supervised contrastive pretraining could match or even surpass supervised pretraining on downstream tasks such as image classification, object detection, and semantic segmentation.
Imagine you have a big box of photos. You pick up one photo of a cat and then ask yourself: "Which other photos look like this one?" You put all the cat photos in one pile and all the non-cat photos in another pile. You do this without anyone telling you what a cat is. You just notice that some photos look alike and some do not.
Contrastive learning works the same way. A computer looks at pairs of things and learns to tell which ones are similar and which ones are different. Over time, it gets really good at noticing what makes things alike or unlike, and it can use that knowledge to do all sorts of tasks later, like sorting photos, finding matching items, or understanding what a sentence means.
At its heart, contrastive learning operates on pairs of data points. Given a data point (called an anchor), the method constructs one or more positive pairs (semantically similar examples) and one or more negative pairs (semantically dissimilar examples). A neural network encoder maps each data point into a vector in an embedding space, and a contrastive loss function optimizes the encoder so that positive pairs have high similarity (small distance) and negative pairs have low similarity (large distance).
The general pipeline consists of four stages:
After pretraining, the projection head is discarded, and the encoder representations are used for downstream tasks, either through linear probing or fine-tuning.
Several loss functions have been developed for contrastive learning, each with different properties and trade-offs.
The original contrastive loss, introduced by Hadsell et al. (2006), operates on pairs of examples. Given two inputs x_i and x_j with representations z_i and z_j, and a binary label y indicating whether the pair is similar (y=1) or dissimilar (y=0):
L = y * d(z_i, z_j)^2 + (1 - y) * max(0, m - d(z_i, z_j))^2
where d is a distance function (typically Euclidean) and m is a margin hyperparameter. This loss pulls similar pairs together and pushes dissimilar pairs apart up to a margin m.
The triplet loss, introduced in FaceNet (Schroff et al., 2015), operates on triplets of (anchor, positive, negative) samples:
L = max(0, d(z_a, z_p) - d(z_a, z_n) + m)
where z_a, z_p, z_n are the embeddings of the anchor, positive, and negative samples, d is a distance function, and m is a margin. The loss pushes the anchor closer to the positive than to the negative by at least margin m. Hard negative mining, the process of selecting negatives that are close to the anchor, is important for effective training with triplet loss.
The InfoNCE (Noise-Contrastive Estimation) loss was introduced by van den Oord et al. (2018) in the Contrastive Predictive Coding paper. It frames the contrastive objective as a classification problem: given an anchor and one positive sample among K negative samples, identify the positive. The loss is:
L_InfoNCE = -log( exp(sim(z_i, z_j) / tau) / sum_k exp(sim(z_i, z_k) / tau) )
where sim is a similarity function (typically cosine similarity), tau is a temperature parameter, z_j is the positive sample, and the sum runs over the positive and all K negative samples. The InfoNCE loss has a direct connection to mutual information estimation: minimizing InfoNCE maximizes a lower bound on the mutual information between the two views.
The Normalized Temperature-scaled Cross-Entropy (NT-Xent) loss, introduced in SimCLR (Chen et al., 2020), is a variant of InfoNCE. For a minibatch of N samples, each sample generates two augmented views, yielding 2N total views. For a positive pair (i, j), the loss is:
L_NT-Xent(i,j) = -log( exp(sim(z_i, z_j) / tau) / sum_{k != i} exp(sim(z_i, z_k) / tau) )
The key differences from the original InfoNCE are that NT-Xent uses cosine similarity (with L2 normalization) and treats all other 2(N-1) augmented views in the minibatch as negatives. The temperature parameter tau controls the sharpness of the distribution: lower values make the model more sensitive to differences in similarity, while higher values produce a softer distribution.
Khosla et al. (2020) extended the contrastive loss to the fully supervised setting. Rather than having a single positive per anchor, the supervised contrastive loss treats all samples from the same class as positives:
L_SupCon = sum_{i} (-1/|P(i)|) * sum_{p in P(i)} log( exp(sim(z_i, z_p) / tau) / sum_{k != i} exp(sim(z_i, z_k) / tau) )
where P(i) is the set of all positives for anchor i (same-class samples). This formulation pulls together all same-class representations while pushing apart different-class ones. On ResNet-200, SupCon achieved 81.4% top-1 accuracy on ImageNet, surpassing the cross-entropy baseline.
| Loss function | Year | Inputs per step | Negatives required | Key property |
|---|---|---|---|---|
| Contrastive loss | 2006 | Pair | Yes | Margin-based, pairwise |
| Triplet loss | 2015 | Triplet (anchor, pos, neg) | Yes | Requires hard negative mining |
| InfoNCE | 2018 | 1 positive + K negatives | Yes | Bounds mutual information |
| NT-Xent | 2020 | Minibatch (2N views) | Yes (in-batch) | Cosine similarity, temperature scaling |
| SupCon | 2020 | Minibatch with labels | Yes (different classes) | Multiple positives per anchor |
How positive and negative pairs are formed is one of the most consequential design decisions in contrastive learning.
In self-supervised contrastive learning, positive pairs are typically constructed through data augmentation. Two randomly augmented versions of the same input are treated as a positive pair. The augmentations should change low-level appearance while preserving semantic content. For images, this commonly involves random cropping with resizing, color jittering, Gaussian blur, grayscale conversion, and horizontal flipping.
In supervised settings, any two samples from the same class can serve as positives. Some methods also use nearest neighbors in the embedding space as positives (e.g., NNCLR by Dwibedi et al., 2021), which provides greater diversity than augmentation-based positives alone.
Negative pairs consist of semantically different inputs. In self-supervised settings, negatives are typically all other samples in the minibatch (in-batch negatives). The effectiveness of the contrastive objective depends heavily on having a sufficient number and diversity of negatives.
There are several strategies for negative sampling:
| Strategy | Description | Used by |
|---|---|---|
| In-batch negatives | Other samples in the same minibatch serve as negatives | SimCLR |
| Memory bank / queue | A queue stores encoded representations from previous batches | MoCo |
| Hard negative mining | Negatives are selected to be close to the anchor in embedding space | FaceNet, various |
| Debiased sampling | Corrects for false negatives (samples that are actually similar to the anchor) | Chuang et al. (2020) |
A persistent challenge is false negatives: in unlabeled data, a randomly sampled "negative" might actually be semantically similar to the anchor (e.g., two different photos of dogs). False negatives introduce noise into the training signal and can degrade learned representations.
Data augmentation is the primary mechanism for constructing positive pairs in self-supervised contrastive learning and has a significant effect on representation quality.
SimCLR (Chen et al., 2020) systematically studied which augmentation combinations matter most. Their findings revealed that random cropping combined with color distortion is the single most important augmentation composition. Without color distortion, different crops from the same image can be distinguished trivially by their color histograms, leading the model to learn a shortcut rather than semantic features.
Commonly used image augmentations include:
| Augmentation | Effect | Importance |
|---|---|---|
| Random resized crop | Forces the model to recognize objects at different scales and positions | Very high |
| Color jittering | Changes brightness, contrast, saturation, and hue | Very high (prevents color histogram shortcuts) |
| Gaussian blur | Smooths the image, removing fine texture details | Moderate |
| Horizontal flip | Mirrors the image horizontally | Moderate |
| Grayscale conversion | Removes all color information | Moderate |
| Solarization | Inverts pixels above a threshold | Used in BYOL, Barlow Twins |
SwAV introduced multi-crop augmentation, which generates two global views (224x224 resolution) and several smaller local views (96x96 resolution) from each image. This approach increases the number of positive pairs without proportionally increasing compute, because the smaller crops are cheaper to process.
Contrastive learning in natural language processing requires different augmentation strategies because text is discrete and symbolic. Common approaches include:
A large number of contrastive and contrastive-adjacent methods have been proposed since 2018. The following sections describe the most influential ones.
Contrastive Predictive Coding (van den Oord et al., 2018) was one of the first methods to formalize the modern contrastive learning framework. CPC learns representations by predicting future observations in latent space using an autoregressive model. The method introduced the InfoNCE loss and demonstrated strong results across four domains: speech, images, text, and reinforcement learning. CPC established the idea that contrastive objectives can serve as a proxy for mutual information maximization between different parts of the input signal.
MoCo (He et al., 2020) frames contrastive learning as a dictionary look-up problem. The method maintains a dynamic dictionary of encoded keys using a queue and a momentum-updated encoder:
MoCo v1 was published at CVPR 2020. MoCo v2 (Chen et al., 2020) incorporated design improvements from SimCLR, specifically an MLP projection head and stronger data augmentations, achieving better results without needing large batch sizes. MoCo v3 (Chen et al., 2021) extended the framework to Vision Transformers (ViT) with a symmetric loss and was published at ICCV 2021.
On several downstream tasks including object detection and semantic segmentation on PASCAL VOC and COCO, MoCo representations outperformed their supervised pretraining counterparts.
SimCLR (Chen et al., 2020) is a simple framework for contrastive learning of visual representations, developed at Google Research. SimCLR simplifies the contrastive learning pipeline by eliminating the need for specialized architectures, memory banks, or momentum encoders.
The SimCLR framework consists of four components:
Key findings from the SimCLR paper include:
A linear classifier trained on SimCLR representations achieved 76.5% top-1 accuracy on ImageNet, matching the performance of a supervised ResNet-50.
BYOL (Grill et al., 2020) challenged the assumption that negative pairs are necessary for contrastive learning. BYOL uses two networks, an online network and a target network, that interact and learn from each other:
BYOL does not use negative pairs at all. Instead, the asymmetric architecture (predictor + stop gradient on the target) prevents the trivial solution of mapping all inputs to a constant vector. Later analysis by Richemond et al. (2020) suggested that batch normalization in BYOL implicitly introduces a form of contrastive signal, though subsequent work has shown that BYOL can work without batch normalization given appropriate architectural choices.
BYOL achieved 74.3% top-1 accuracy on ImageNet with a ResNet-50 and 79.6% with a larger ResNet, while being more robust to changes in batch size and augmentation choices than SimCLR.
SwAV (Caron et al., 2020) combines contrastive learning with online clustering. Instead of comparing features directly, SwAV assigns features to a set of learnable prototype vectors and enforces consistency between the cluster assignments of different views of the same image.
The method works by:
SwAV introduced multi-crop augmentation, using two standard resolution crops (224x224) and four to six lower resolution crops (96x96). This strategy increased the effective number of comparisons without significantly increasing memory usage.
SwAV achieved 75.3% top-1 accuracy on ImageNet with ResNet-50 and outperformed supervised pretraining on all considered transfer tasks.
Barlow Twins (Zbontar et al., 2021) takes a different approach inspired by neuroscientist H. Barlow's redundancy-reduction principle. The method measures the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of the same sample and pushes this matrix toward the identity matrix.
The objective has two components:
Barlow Twins does not require large batch sizes, asymmetric architectures, stop gradients, or momentum updates. It benefits from high-dimensional output vectors (e.g., 8,192 dimensions), in contrast to most other methods that use lower-dimensional projections.
VICReg (Bardes, Ponce, and LeCun, 2022) decomposes the self-supervised learning objective into three explicit regularization terms:
VICReg does not require weight sharing between branches, batch normalization, feature-wise normalization, output quantization, stop gradients, or memory banks, yet achieves results on par with state-of-the-art methods.
DINO (Caron et al., 2021) applies self-distillation to Vision Transformers in a contrastive-like framework. A student network is trained to match the output of a teacher network (updated via exponential moving average) across different augmented views. Unlike MoCo or SimCLR, DINO uses a cross-entropy loss between the softmax outputs of the student and teacher rather than a contrastive similarity-based loss.
DINO's self-supervised ViT features exhibit notable emerging properties: the attention maps contain explicit information about the semantic segmentation of the image, with different attention heads attending to different semantic parts. DINO achieved 80.1% top-1 accuracy on ImageNet in linear evaluation with ViT-Base.
| Method | Year | Venue | Negatives required | Key mechanism | ImageNet top-1 (linear, ResNet-50) |
|---|---|---|---|---|---|
| CPC | 2018 | arXiv | Yes | Autoregressive prediction + InfoNCE | N/A (evaluated on other tasks) |
| MoCo v1 | 2020 | CVPR | Yes | Momentum encoder + queue | 60.6% |
| SimCLR | 2020 | ICML | Yes (in-batch) | Large batch + projection head + NT-Xent | 76.5% |
| MoCo v2 | 2020 | arXiv | Yes | MoCo + SimCLR improvements | 71.1% |
| BYOL | 2020 | NeurIPS | No | Online/target networks + predictor | 74.3% |
| SwAV | 2020 | NeurIPS | No (uses prototypes) | Online clustering + multi-crop | 75.3% |
| Barlow Twins | 2021 | ICML | No | Cross-correlation + redundancy reduction | 73.2% |
| VICReg | 2022 | ICLR | No | Variance + invariance + covariance terms | 73.2% |
| DINO | 2021 | ICCV | No | Self-distillation + ViT | 75.3% (ViT-S/16) |
A subset of self-supervised methods learn representations without explicit negative pairs. These methods, sometimes called non-contrastive, prevent collapse through architectural asymmetry, regularization, or clustering rather than through positive-negative comparisons.
The fundamental challenge for non-contrastive methods is preventing representational collapse, where the encoder maps all inputs to the same constant vector. Several strategies have been developed:
| Strategy | Methods using it | How it prevents collapse |
|---|---|---|
| Stop gradient + predictor | BYOL, SimSiam | Asymmetric gradient flow prevents trivial solution |
| Momentum encoder | BYOL, DINO | Slowly evolving target provides a stable learning signal |
| Redundancy reduction | Barlow Twins | Decorrelation of embedding dimensions ensures information is distributed |
| Variance regularization | VICReg | Explicit variance term prevents dimensional collapse |
| Online clustering | SwAV | Equipartition constraint via Sinkhorn-Knopp prevents degenerate assignments |
| Batch normalization | Implicit in several methods | Implicitly spreads representations across the batch |
SimSiam (Chen and He, 2021) demonstrated that a simple Siamese network with a stop gradient and a predictor head can learn useful representations without negative pairs, momentum encoders, or large batches. The authors showed that the stop gradient operation is essential: without it, the model collapses.
Contrastive learning has been applied across many domains beyond its origins in computer vision.
Contrastive pretraining has become a standard approach for learning visual representations. Self-supervised contrastive models pretrained on ImageNet or larger unlabeled datasets produce features that transfer well to downstream tasks:
Contrastive learning has been applied to learn sentence and document representations:
CLIP (Radford et al., 2021) is the most prominent example of multimodal contrastive learning. CLIP trains separate image and text encoders to align their representations in a shared embedding space using a contrastive objective. Given a batch of N (image, text) pairs, CLIP maximizes the cosine similarity of the N correct pairs while minimizing the similarity of the N^2 - N incorrect pairs.
CLIP was trained on 400 million image-text pairs scraped from the internet (the WebImageText dataset). After training, CLIP enables zero-shot image classification: the model classifies images by comparing their embeddings to text embeddings of class descriptions, without any task-specific training. CLIP matched the zero-shot accuracy of a supervised ResNet-50 on ImageNet and demonstrated strong transfer performance across dozens of datasets.
CLIP's influence extends well beyond classification. It serves as the text encoder in Stable Diffusion and other text-to-image generation systems, powers image search and retrieval systems, and has been extended to additional modalities including audio (CLAP) and video.
| Domain | Application | Example methods |
|---|---|---|
| Speech and audio | Speaker verification, speech representation learning | wav2vec 2.0, CPC |
| Graphs | Node and graph-level representation learning | GraphCL, GCC |
| Reinforcement learning | State representation learning from observations | CURL, CPC for RL |
| Time series | Temporal representation learning | TS2Vec, TNC |
| Recommendation systems | User and item representation learning | CLRec, SGL |
Contrastive learning and metric learning share the same core objective: learning an embedding space where similar items are close and dissimilar items are far apart. Both fields use loss functions based on distances or similarities between embeddings, and both rely on effective sampling of positive and negative pairs.
However, there are important differences:
| Aspect | Traditional metric learning | Modern contrastive learning |
|---|---|---|
| Supervision | Typically supervised (requires labels) | Often self-supervised (labels not required) |
| Positive pairs | Defined by class labels | Defined by data augmentation |
| Scale | Often trained on smaller, curated datasets | Designed for large-scale, unlabeled data |
| Primary goal | Learn a distance metric for retrieval or verification | Learn general-purpose representations |
| Loss functions | Contrastive loss, triplet loss, N-pair loss | InfoNCE, NT-Xent, and variants |
| Negatives per anchor | Typically 1 (triplet) or a few | Hundreds to thousands (in-batch or queued) |
Modern contrastive learning can be viewed as a scaled-up, self-supervised extension of metric learning. The InfoNCE loss generalizes the N-pair loss from metric learning, and many of the same principles (hard negative mining, embedding normalization, margin selection) apply in both settings.
The temperature parameter tau in the InfoNCE and NT-Xent losses controls how sharply the model distinguishes between positive and negative pairs. A lower temperature (e.g., tau = 0.05) makes the loss highly sensitive to small differences in similarity, producing stronger gradients for hard negatives but risking numerical instability. A higher temperature (e.g., tau = 0.5) produces a softer distribution that is more forgiving of similarity differences.
SimCLR used tau = 0.5 as default, while MoCo used tau = 0.07. The supervised contrastive loss found tau = 0.1 to work well. Temperature tuning requires careful experimentation, as performance can be highly sensitive to this value.
Larger batch sizes provide more in-batch negatives, which generally improves the quality of the learned representations in methods that use in-batch negatives (like SimCLR). SimCLR's performance improved substantially going from batch size 256 to 8,192. However, very large batch sizes require significant GPU memory and may necessitate distributed training.
MoCo addresses this constraint by decoupling the number of negatives from the batch size through its queue mechanism, allowing a large number of negatives (65,536) even with modest batch sizes (256).
SimCLR demonstrated that applying the contrastive loss on a projected representation (after an MLP projection head) rather than directly on the encoder output substantially improves downstream task performance. The intuition is that the projection head can discard information that is useful for the downstream task but not for the contrastive objective (e.g., color or orientation information that varies between augmented views). After pretraining, the projection head is discarded and only the encoder representations are used.
Most early contrastive learning methods used ResNet architectures (especially ResNet-50) as the backbone encoder. More recent work has adopted Vision Transformers (ViT), which tend to benefit even more from self-supervised pretraining than CNNs. DINO, MoCo v3, and subsequent methods have shown that ViTs trained with contrastive or self-distillation objectives learn representations with distinctive properties, such as attention maps that capture semantic segmentation.
Despite its successes, contrastive learning has several notable limitations.
Many contrastive methods require large batch sizes (SimCLR) or large queues of negatives (MoCo) to achieve strong performance. Training SimCLR with a batch size of 8,192 requires substantial GPU memory and compute. Additionally, contrastive pretraining typically requires many more epochs than supervised training (e.g., 800-1000 epochs on ImageNet compared to 90 for supervised training).
The temperature hyperparameter strongly influences performance, and the optimal value varies across methods, datasets, and architectures. Searching for a valid temperature requires extensive experimentation. Some recent work (e.g., Haochen et al., 2025) has proposed temperature-free loss functions to address this limitation.
Even when contrastive methods avoid complete representational collapse (mapping all inputs to the same vector), they can still suffer from dimensional collapse, where the embedding vectors effectively reside in a lower-dimensional subspace of the full embedding space. This wastes representational capacity. Methods like Barlow Twins and VICReg address this directly through decorrelation or variance regularization.
In self-supervised settings, negatives are sampled randomly without knowledge of semantic similarity. Two images that happen to depict the same object or concept may be incorrectly treated as negatives, providing a misleading training signal. Debiased contrastive learning (Chuang et al., 2020) and other approaches attempt to mitigate this issue.
The quality of learned representations depends heavily on the choice of data augmentations. Augmentations that are too weak lead to trivial solutions, while augmentations that are too strong can destroy semantic information. Finding the right augmentation pipeline for a new domain or data type often requires domain-specific knowledge and experimentation.
While contrastive pretraining produces strong representations for many downstream tasks, there can be a gap between the pretraining objective (instance discrimination) and the downstream task (e.g., dense prediction or fine-grained classification). This gap can be larger than with supervised pretraining for some specific tasks, particularly those requiring fine-grained spatial information.
| Year | Development | Reference |
|---|---|---|
| 1993 | Siamese networks for signature verification | Bromley et al. |
| 2005 | Contrastive loss for dimensionality reduction | Hadsell, Chopra, LeCun |
| 2015 | Triplet loss (FaceNet) | Schroff et al. |
| 2018 | Contrastive Predictive Coding (CPC) and InfoNCE loss | van den Oord et al. |
| 2019 | MoCo v1 (Momentum Contrast) | He et al. |
| 2020 | SimCLR | Chen et al. (Google) |
| 2020 | MoCo v2 | Chen et al. (FAIR) |
| 2020 | BYOL (no negatives needed) | Grill et al. (DeepMind) |
| 2020 | SwAV (multi-crop + prototypes) | Caron et al. (FAIR) |
| 2020 | Supervised Contrastive Learning (SupCon) | Khosla et al. (Google) |
| 2021 | CLIP (contrastive language-image pretraining) | Radford et al. (OpenAI) |
| 2021 | Barlow Twins | Zbontar et al. (FAIR) |
| 2021 | DINO (self-distillation with ViT) | Caron et al. (FAIR) |
| 2021 | SimCSE (contrastive sentence embeddings) | Gao et al. (Princeton) |
| 2021 | MoCo v3 (ViT backbone) | Chen et al. (FAIR) |
| 2022 | VICReg | Bardes, Ponce, LeCun |