Contrastive Learning

Introduction

Contrastive learning is a family of machine learning methods that learn representations by comparing data points against each other. The central idea is to pull representations of similar (positive) pairs closer together in an embedding space while pushing representations of dissimilar (negative) pairs apart. By structuring the learning objective around pairwise or group-wise comparisons rather than explicit class labels, contrastive learning has become one of the most successful paradigms for self-supervised learning, enabling models to learn useful visual, textual, and multimodal features from unlabeled data.

The roots of contrastive learning trace back to earlier work in metric learning and Siamese networks. Bromley et al. (1993) introduced Siamese networks for signature verification, training two weight-sharing networks to compare input pairs. Hadsell, Chopra, and LeCun (2006) formalized the contrastive loss function, which minimizes distance between same-class pairs while enforcing a margin-based separation for different-class pairs. The triplet loss introduced in FaceNet (Schroff et al., 2015) extended this idea by simultaneously considering an anchor, a positive sample, and a negative sample.

The modern era of contrastive learning began in 2018 when van den Oord et al. proposed Contrastive Predictive Coding (CPC) and introduced the InfoNCE loss. This was followed by a rapid succession of methods between 2019 and 2021, including MoCo, SimCLR, BYOL, SwAV, Barlow Twins, and CLIP. These methods demonstrated that self-supervised contrastive pretraining could match or even surpass supervised pretraining on downstream tasks such as image classification, object detection, and semantic segmentation.

Explain like I'm 5 (ELI5)

Imagine you have a big box of photos. You pick up one photo of a cat and then ask yourself: "Which other photos look like this one?" You put all the cat photos in one pile and all the non-cat photos in another pile. You do this without anyone telling you what a cat is. You just notice that some photos look alike and some do not.

Contrastive learning works the same way. A computer looks at pairs of things and learns to tell which ones are similar and which ones are different. Over time, it gets really good at noticing what makes things alike or unlike, and it can use that knowledge to do all sorts of tasks later, like sorting photos, finding matching items, or understanding what a sentence means.

Core concept

At its heart, contrastive learning operates on pairs of data points. Given a data point (called an anchor), the method constructs one or more positive pairs (semantically similar examples) and one or more negative pairs (semantically dissimilar examples). A neural network encoder maps each data point into a vector in an embedding space, and a contrastive loss function optimizes the encoder so that positive pairs have high similarity (small distance) and negative pairs have low similarity (large distance).

The general pipeline consists of four stages:

Data augmentation: Create multiple views of each data point (e.g., two randomly augmented versions of the same image).
Encoding: Pass each view through an encoder network (e.g., a ResNet or Vision Transformer) to obtain representation vectors.
Projection: Optionally map the representations through a small projection head (typically an MLP) into a lower-dimensional space where the contrastive loss is applied.
Contrastive loss: Compute a loss that encourages the embeddings of positive pairs to be close and embeddings of negative pairs to be far apart.

After pretraining, the projection head is discarded, and the encoder representations are used for downstream tasks, either through linear probing or fine-tuning.

Contrastive loss functions

Several loss functions have been developed for contrastive learning, each with different properties and trade-offs.

Contrastive loss (pairwise)

The original contrastive loss, introduced by Hadsell et al. (2006), operates on pairs of examples. Given two inputs x_i and x_j with representations z_i and z_j, and a binary label y indicating whether the pair is similar (y=1) or dissimilar (y=0):

L = y * d(z_i, z_j)^2 + (1 - y) * max(0, m - d(z_i, z_j))^2

where d is a distance function (typically Euclidean) and m is a margin hyperparameter. This loss pulls similar pairs together and pushes dissimilar pairs apart up to a margin m.

Triplet loss

The triplet loss, introduced in FaceNet (Schroff et al., 2015), operates on triplets of (anchor, positive, negative) samples:

L = max(0, d(z_a, z_p) - d(z_a, z_n) + m)

where z_a, z_p, z_n are the embeddings of the anchor, positive, and negative samples, d is a distance function, and m is a margin. The loss pushes the anchor closer to the positive than to the negative by at least margin m. Hard negative mining, the process of selecting negatives that are close to the anchor, is important for effective training with triplet loss.

InfoNCE loss

The InfoNCE (Noise-Contrastive Estimation) loss was introduced by van den Oord et al. (2018) in the Contrastive Predictive Coding paper. It frames the contrastive objective as a classification problem: given an anchor and one positive sample among K negative samples, identify the positive. The loss is:

L_InfoNCE = -log( exp(sim(z_i, z_j) / tau) / sum_k exp(sim(z_i, z_k) / tau) )

where sim is a similarity function (typically cosine similarity), tau is a temperature parameter, z_j is the positive sample, and the sum runs over the positive and all K negative samples. The InfoNCE loss has a direct connection to mutual information estimation: minimizing InfoNCE maximizes a lower bound on the mutual information between the two views.

NT-Xent loss

The Normalized Temperature-scaled Cross-Entropy (NT-Xent) loss, introduced in SimCLR (Chen et al., 2020), is a variant of InfoNCE. For a minibatch of N samples, each sample generates two augmented views, yielding 2N total views. For a positive pair (i, j), the loss is:

L_NT-Xent(i,j) = -log( exp(sim(z_i, z_j) / tau) / sum_{k != i} exp(sim(z_i, z_k) / tau) )

The key differences from the original InfoNCE are that NT-Xent uses cosine similarity (with L2 normalization) and treats all other 2(N-1) augmented views in the minibatch as negatives. The temperature parameter tau controls the sharpness of the distribution: lower values make the model more sensitive to differences in similarity, while higher values produce a softer distribution.

Supervised contrastive loss (SupCon)

Khosla et al. (2020) extended the contrastive loss to the fully supervised setting. Rather than having a single positive per anchor, the supervised contrastive loss treats all samples from the same class as positives:

L_SupCon = sum_{i} (-1/|P(i)|) * sum_{p in P(i)} log( exp(sim(z_i, z_p) / tau) / sum_{k != i} exp(sim(z_i, z_k) / tau) )

where P(i) is the set of all positives for anchor i (same-class samples). This formulation pulls together all same-class representations while pushing apart different-class ones. On ResNet-200, SupCon achieved 81.4% top-1 accuracy on ImageNet, surpassing the cross-entropy baseline.

Comparison of loss functions

Loss function	Year	Inputs per step	Negatives required	Key property
Contrastive loss	2006	Pair	Yes	Margin-based, pairwise
Triplet loss	2015	Triplet (anchor, pos, neg)	Yes	Requires hard negative mining
InfoNCE	2018	1 positive + K negatives	Yes	Bounds mutual information
NT-Xent	2020	Minibatch (2N views)	Yes (in-batch)	Cosine similarity, temperature scaling
SupCon	2020	Minibatch with labels	Yes (different classes)	Multiple positives per anchor

Positive and negative pair construction

How positive and negative pairs are formed is one of the most consequential design decisions in contrastive learning.

Positive pairs

In self-supervised contrastive learning, positive pairs are typically constructed through data augmentation. Two randomly augmented versions of the same input are treated as a positive pair. The augmentations should change low-level appearance while preserving semantic content. For images, this commonly involves random cropping with resizing, color jittering, Gaussian blur, grayscale conversion, and horizontal flipping.

In supervised settings, any two samples from the same class can serve as positives. Some methods also use nearest neighbors in the embedding space as positives (e.g., NNCLR by Dwibedi et al., 2021), which provides greater diversity than augmentation-based positives alone.

Negative pairs

Negative pairs consist of semantically different inputs. In self-supervised settings, negatives are typically all other samples in the minibatch (in-batch negatives). The effectiveness of the contrastive objective depends heavily on having a sufficient number and diversity of negatives.

There are several strategies for negative sampling:

Strategy	Description	Used by
In-batch negatives	Other samples in the same minibatch serve as negatives	SimCLR
Memory bank / queue	A queue stores encoded representations from previous batches	MoCo
Hard negative mining	Negatives are selected to be close to the anchor in embedding space	FaceNet, various
Debiased sampling	Corrects for false negatives (samples that are actually similar to the anchor)	Chuang et al. (2020)

A persistent challenge is false negatives: in unlabeled data, a randomly sampled "negative" might actually be semantically similar to the anchor (e.g., two different photos of dogs). False negatives introduce noise into the training signal and can degrade learned representations.

Data augmentation in contrastive learning

Data augmentation is the primary mechanism for constructing positive pairs in self-supervised contrastive learning and has a significant effect on representation quality.

Image augmentations

SimCLR (Chen et al., 2020) systematically studied which augmentation combinations matter most. Their findings revealed that random cropping combined with color distortion is the single most important augmentation composition. Without color distortion, different crops from the same image can be distinguished trivially by their color histograms, leading the model to learn a shortcut rather than semantic features.

Commonly used image augmentations include:

Augmentation	Effect	Importance
Random resized crop	Forces the model to recognize objects at different scales and positions	Very high
Color jittering	Changes brightness, contrast, saturation, and hue	Very high (prevents color histogram shortcuts)
Gaussian blur	Smooths the image, removing fine texture details	Moderate
Horizontal flip	Mirrors the image horizontally	Moderate
Grayscale conversion	Removes all color information	Moderate
Solarization	Inverts pixels above a threshold	Used in BYOL, Barlow Twins

SwAV introduced multi-crop augmentation, which generates two global views (224x224 resolution) and several smaller local views (96x96 resolution) from each image. This approach increases the number of positive pairs without proportionally increasing compute, because the smaller crops are cheaper to process.

Text augmentations

Contrastive learning in natural language processing requires different augmentation strategies because text is discrete and symbolic. Common approaches include:

Dropout as augmentation: SimCSE (Gao et al., 2021) passes the same sentence through the encoder twice with different dropout masks, treating the two outputs as a positive pair. This minimal augmentation surprisingly achieves strong results.
Back-translation: Translate a sentence into another language and then back, producing a paraphrase.
Token-level perturbations: Random word deletion, insertion, swapping, or synonym replacement.
Entailment pairs: In supervised SimCSE, sentence pairs labeled as entailment in natural language inference datasets serve as positives, and contradiction pairs serve as hard negatives.

Methods and architectures

A large number of contrastive and contrastive-adjacent methods have been proposed since 2018. The following sections describe the most influential ones.

Contrastive Predictive Coding (CPC)

Contrastive Predictive Coding (van den Oord et al., 2018) was one of the first methods to formalize the modern contrastive learning framework. CPC learns representations by predicting future observations in latent space using an autoregressive model. The method introduced the InfoNCE loss and demonstrated strong results across four domains: speech, images, text, and reinforcement learning. CPC established the idea that contrastive objectives can serve as a proxy for mutual information maximization between different parts of the input signal.

MoCo (Momentum Contrast)

MoCo (He et al., 2020) frames contrastive learning as a dictionary look-up problem. The method maintains a dynamic dictionary of encoded keys using a queue and a momentum-updated encoder:

Queue-based dictionary: A first-in-first-out queue stores encoded representations from previous minibatches, decoupling the dictionary size from the batch size. This allows MoCo to use a large number of negatives (e.g., 65,536) without requiring enormous batch sizes.
Momentum encoder: The key encoder is updated as an exponential moving average of the query encoder, rather than by backpropagation. This produces slowly evolving, consistent keys despite the queue containing representations from different training steps.

MoCo v1 was published at CVPR 2020. MoCo v2 (Chen et al., 2020) incorporated design improvements from SimCLR, specifically an MLP projection head and stronger data augmentations, achieving better results without needing large batch sizes. MoCo v3 (Chen et al., 2021) extended the framework to Vision Transformers (ViT) with a symmetric loss and was published at ICCV 2021.

On several downstream tasks including object detection and semantic segmentation on PASCAL VOC and COCO, MoCo representations outperformed their supervised pretraining counterparts.

SimCLR

SimCLR (Chen et al., 2020) is a simple framework for contrastive learning of visual representations, developed at Google Research. SimCLR simplifies the contrastive learning pipeline by eliminating the need for specialized architectures, memory banks, or momentum encoders.

The SimCLR framework consists of four components:

Stochastic data augmentation: Two random augmentations are applied to each image, creating a positive pair.
Base encoder: A convolutional neural network (e.g., ResNet-50) extracts representation vectors from augmented images.
Projection head: A small MLP maps representations to a lower-dimensional space where the NT-Xent loss is applied. Chen et al. found that including a nonlinear projection head substantially improved representation quality.
NT-Xent loss: The normalized temperature-scaled cross-entropy loss is computed over all positive and in-batch negative pairs.

Key findings from the SimCLR paper include:

The composition of data augmentations matters more than any individual augmentation. Random cropping combined with color distortion is the most effective combination.
A learnable nonlinear projection head between the representation and the contrastive loss significantly improves performance.
Contrastive learning benefits from larger batch sizes and longer training compared to supervised learning. SimCLR used batch sizes of up to 8,192.

A linear classifier trained on SimCLR representations achieved 76.5% top-1 accuracy on ImageNet, matching the performance of a supervised ResNet-50.

BYOL (Bootstrap Your Own Latent)

BYOL (Grill et al., 2020) challenged the assumption that negative pairs are necessary for contrastive learning. BYOL uses two networks, an online network and a target network, that interact and learn from each other:

The online network is trained to predict the target network's representation of the same image under a different augmented view.
The target network is updated as an exponential moving average of the online network (similar to MoCo's momentum encoder).
A predictor head on the online network breaks the symmetry between the two networks, preventing collapse.

BYOL does not use negative pairs at all. Instead, the asymmetric architecture (predictor + stop gradient on the target) prevents the trivial solution of mapping all inputs to a constant vector. Later analysis by Richemond et al. (2020) suggested that batch normalization in BYOL implicitly introduces a form of contrastive signal, though subsequent work has shown that BYOL can work without batch normalization given appropriate architectural choices.

BYOL achieved 74.3% top-1 accuracy on ImageNet with a ResNet-50 and 79.6% with a larger ResNet, while being more robust to changes in batch size and augmentation choices than SimCLR.

SwAV (Swapping Assignments between Views)

SwAV (Caron et al., 2020) combines contrastive learning with online clustering. Instead of comparing features directly, SwAV assigns features to a set of learnable prototype vectors and enforces consistency between the cluster assignments of different views of the same image.

The method works by:

Computing features for multiple augmented views of each image.
Assigning each feature to the nearest prototypes using an optimal transport algorithm (Sinkhorn-Knopp).
Training the network to predict the cluster assignment of one view from the features of another view ("swapped" prediction).

SwAV introduced multi-crop augmentation, using two standard resolution crops (224x224) and four to six lower resolution crops (96x96). This strategy increased the effective number of comparisons without significantly increasing memory usage.

SwAV achieved 75.3% top-1 accuracy on ImageNet with ResNet-50 and outperformed supervised pretraining on all considered transfer tasks.

Barlow Twins

Barlow Twins (Zbontar et al., 2021) takes a different approach inspired by neuroscientist H. Barlow's redundancy-reduction principle. The method measures the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of the same sample and pushes this matrix toward the identity matrix.

The objective has two components:

Invariance term: The diagonal elements of the cross-correlation matrix should be close to 1, meaning corresponding dimensions of the two representations should be highly correlated.
Redundancy reduction term: The off-diagonal elements should be close to 0, meaning different dimensions should be uncorrelated (decorrelated).

Barlow Twins does not require large batch sizes, asymmetric architectures, stop gradients, or momentum updates. It benefits from high-dimensional output vectors (e.g., 8,192 dimensions), in contrast to most other methods that use lower-dimensional projections.

VICReg (Variance-Invariance-Covariance Regularization)

VICReg (Bardes, Ponce, and LeCun, 2022) decomposes the self-supervised learning objective into three explicit regularization terms:

Variance: Maintains the standard deviation of each embedding dimension above a threshold, preventing informational collapse.
Invariance: Minimizes the mean squared distance between embeddings of different views of the same image.
Covariance: Decorrelates pairs of embedding dimensions, encouraging the network to use all available dimensions.

VICReg does not require weight sharing between branches, batch normalization, feature-wise normalization, output quantization, stop gradients, or memory banks, yet achieves results on par with state-of-the-art methods.

DINO (Self-Distillation with No Labels)

DINO (Caron et al., 2021) applies self-distillation to Vision Transformers in a contrastive-like framework. A student network is trained to match the output of a teacher network (updated via exponential moving average) across different augmented views. Unlike MoCo or SimCLR, DINO uses a cross-entropy loss between the softmax outputs of the student and teacher rather than a contrastive similarity-based loss.

DINO's self-supervised ViT features exhibit notable emerging properties: the attention maps contain explicit information about the semantic segmentation of the image, with different attention heads attending to different semantic parts. DINO achieved 80.1% top-1 accuracy on ImageNet in linear evaluation with ViT-Base.

Comparison of methods

Method	Year	Venue	Negatives required	Key mechanism	ImageNet top-1 (linear, ResNet-50)
CPC	2018	arXiv	Yes	Autoregressive prediction + InfoNCE	N/A (evaluated on other tasks)
MoCo v1	2020	CVPR	Yes	Momentum encoder + queue	60.6%
SimCLR	2020	ICML	Yes (in-batch)	Large batch + projection head + NT-Xent	76.5%
MoCo v2	2020	arXiv	Yes	MoCo + SimCLR improvements	71.1%
BYOL	2020	NeurIPS	No	Online/target networks + predictor	74.3%
SwAV	2020	NeurIPS	No (uses prototypes)	Online clustering + multi-crop	75.3%
Barlow Twins	2021	ICML	No	Cross-correlation + redundancy reduction	73.2%
VICReg	2022	ICLR	No	Variance + invariance + covariance terms	73.2%
DINO	2021	ICCV	No	Self-distillation + ViT	75.3% (ViT-S/16)

Non-contrastive alternatives

A subset of self-supervised methods learn representations without explicit negative pairs. These methods, sometimes called non-contrastive, prevent collapse through architectural asymmetry, regularization, or clustering rather than through positive-negative comparisons.

Avoiding collapse without negatives

The fundamental challenge for non-contrastive methods is preventing representational collapse, where the encoder maps all inputs to the same constant vector. Several strategies have been developed:

Strategy	Methods using it	How it prevents collapse
Stop gradient + predictor	BYOL, SimSiam	Asymmetric gradient flow prevents trivial solution
Momentum encoder	BYOL, DINO	Slowly evolving target provides a stable learning signal
Redundancy reduction	Barlow Twins	Decorrelation of embedding dimensions ensures information is distributed
Variance regularization	VICReg	Explicit variance term prevents dimensional collapse
Online clustering	SwAV	Equipartition constraint via Sinkhorn-Knopp prevents degenerate assignments
Batch normalization	Implicit in several methods	Implicitly spreads representations across the batch

SimSiam (Chen and He, 2021) demonstrated that a simple Siamese network with a stop gradient and a predictor head can learn useful representations without negative pairs, momentum encoders, or large batches. The authors showed that the stop gradient operation is essential: without it, the model collapses.

Applications

Contrastive learning has been applied across many domains beyond its origins in computer vision.

Computer vision

Contrastive pretraining has become a standard approach for learning visual representations. Self-supervised contrastive models pretrained on ImageNet or larger unlabeled datasets produce features that transfer well to downstream tasks:

Image classification: Linear classifiers trained on frozen contrastive representations approach or match the accuracy of fully supervised models.
Object detection and segmentation: MoCo representations outperformed supervised pretraining on PASCAL VOC and COCO detection and segmentation tasks.
Medical imaging: Contrastive pretraining is particularly valuable in medical imaging, where labeled data is scarce and expensive. Models pretrained with contrastive objectives on unlabeled medical images can be fine-tuned with small labeled datasets.

Natural language processing

Contrastive learning has been applied to learn sentence and document representations:

SimCSE (Gao et al., 2021) learns sentence embeddings by passing each sentence through the encoder twice with different dropout masks (unsupervised) or using natural language inference pairs (supervised). It achieved strong results on semantic textual similarity benchmarks, with the unsupervised BERT-base model reaching 76.3% average Spearman correlation and the supervised version reaching 81.6%.
Sentence-BERT (Reimers and Gurevych, 2019) uses Siamese networks fine-tuned on NLI data with a contrastive objective to produce fixed-size sentence embeddings.
Dense passage retrieval methods use contrastive training to learn query and document encoders for information retrieval.

Multimodal learning

CLIP (Radford et al., 2021) is the most prominent example of multimodal contrastive learning. CLIP trains separate image and text encoders to align their representations in a shared embedding space using a contrastive objective. Given a batch of N (image, text) pairs, CLIP maximizes the cosine similarity of the N correct pairs while minimizing the similarity of the N^2 - N incorrect pairs.

CLIP was trained on 400 million image-text pairs scraped from the internet (the WebImageText dataset). After training, CLIP enables zero-shot image classification: the model classifies images by comparing their embeddings to text embeddings of class descriptions, without any task-specific training. CLIP matched the zero-shot accuracy of a supervised ResNet-50 on ImageNet and demonstrated strong transfer performance across dozens of datasets.

CLIP's influence extends well beyond classification. It serves as the text encoder in Stable Diffusion and other text-to-image generation systems, powers image search and retrieval systems, and has been extended to additional modalities including audio (CLAP) and video.

Other domains

Domain	Application	Example methods
Speech and audio	Speaker verification, speech representation learning	wav2vec 2.0, CPC
Graphs	Node and graph-level representation learning	GraphCL, GCC
Reinforcement learning	State representation learning from observations	CURL, CPC for RL
Time series	Temporal representation learning	TS2Vec, TNC
Recommendation systems	User and item representation learning	CLRec, SGL

Relationship to metric learning

Contrastive learning and metric learning share the same core objective: learning an embedding space where similar items are close and dissimilar items are far apart. Both fields use loss functions based on distances or similarities between embeddings, and both rely on effective sampling of positive and negative pairs.

However, there are important differences:

Aspect	Traditional metric learning	Modern contrastive learning
Supervision	Typically supervised (requires labels)	Often self-supervised (labels not required)
Positive pairs	Defined by class labels	Defined by data augmentation
Scale	Often trained on smaller, curated datasets	Designed for large-scale, unlabeled data
Primary goal	Learn a distance metric for retrieval or verification	Learn general-purpose representations
Loss functions	Contrastive loss, triplet loss, N-pair loss	InfoNCE, NT-Xent, and variants
Negatives per anchor	Typically 1 (triplet) or a few	Hundreds to thousands (in-batch or queued)

Modern contrastive learning can be viewed as a scaled-up, self-supervised extension of metric learning. The InfoNCE loss generalizes the N-pair loss from metric learning, and many of the same principles (hard negative mining, embedding normalization, margin selection) apply in both settings.

Key design considerations and hyperparameters

Temperature

The temperature parameter tau in the InfoNCE and NT-Xent losses controls how sharply the model distinguishes between positive and negative pairs. A lower temperature (e.g., tau = 0.05) makes the loss highly sensitive to small differences in similarity, producing stronger gradients for hard negatives but risking numerical instability. A higher temperature (e.g., tau = 0.5) produces a softer distribution that is more forgiving of similarity differences.

SimCLR used tau = 0.5 as default, while MoCo used tau = 0.07. The supervised contrastive loss found tau = 0.1 to work well. Temperature tuning requires careful experimentation, as performance can be highly sensitive to this value.

Batch size

Larger batch sizes provide more in-batch negatives, which generally improves the quality of the learned representations in methods that use in-batch negatives (like SimCLR). SimCLR's performance improved substantially going from batch size 256 to 8,192. However, very large batch sizes require significant GPU memory and may necessitate distributed training.

MoCo addresses this constraint by decoupling the number of negatives from the batch size through its queue mechanism, allowing a large number of negatives (65,536) even with modest batch sizes (256).

Projection head

SimCLR demonstrated that applying the contrastive loss on a projected representation (after an MLP projection head) rather than directly on the encoder output substantially improves downstream task performance. The intuition is that the projection head can discard information that is useful for the downstream task but not for the contrastive objective (e.g., color or orientation information that varies between augmented views). After pretraining, the projection head is discarded and only the encoder representations are used.

Encoder architecture

Most early contrastive learning methods used ResNet architectures (especially ResNet-50) as the backbone encoder. More recent work has adopted Vision Transformers (ViT), which tend to benefit even more from self-supervised pretraining than CNNs. DINO, MoCo v3, and subsequent methods have shown that ViTs trained with contrastive or self-distillation objectives learn representations with distinctive properties, such as attention maps that capture semantic segmentation.

Limitations and challenges

Despite its successes, contrastive learning has several notable limitations.

Computational cost

Many contrastive methods require large batch sizes (SimCLR) or large queues of negatives (MoCo) to achieve strong performance. Training SimCLR with a batch size of 8,192 requires substantial GPU memory and compute. Additionally, contrastive pretraining typically requires many more epochs than supervised training (e.g., 800-1000 epochs on ImageNet compared to 90 for supervised training).

Temperature sensitivity

The temperature hyperparameter strongly influences performance, and the optimal value varies across methods, datasets, and architectures. Searching for a valid temperature requires extensive experimentation. Some recent work (e.g., Haochen et al., 2025) has proposed temperature-free loss functions to address this limitation.

Dimensional collapse

Even when contrastive methods avoid complete representational collapse (mapping all inputs to the same vector), they can still suffer from dimensional collapse, where the embedding vectors effectively reside in a lower-dimensional subspace of the full embedding space. This wastes representational capacity. Methods like Barlow Twins and VICReg address this directly through decorrelation or variance regularization.

False negatives

In self-supervised settings, negatives are sampled randomly without knowledge of semantic similarity. Two images that happen to depict the same object or concept may be incorrectly treated as negatives, providing a misleading training signal. Debiased contrastive learning (Chuang et al., 2020) and other approaches attempt to mitigate this issue.

Augmentation sensitivity

The quality of learned representations depends heavily on the choice of data augmentations. Augmentations that are too weak lead to trivial solutions, while augmentations that are too strong can destroy semantic information. Finding the right augmentation pipeline for a new domain or data type often requires domain-specific knowledge and experimentation.

Transfer gap

While contrastive pretraining produces strong representations for many downstream tasks, there can be a gap between the pretraining objective (instance discrimination) and the downstream task (e.g., dense prediction or fine-grained classification). This gap can be larger than with supervised pretraining for some specific tasks, particularly those requiring fine-grained spatial information.

Timeline of contrastive learning

Year	Development	Reference
1993	Siamese networks for signature verification	Bromley et al.
2005	Contrastive loss for dimensionality reduction	Hadsell, Chopra, LeCun
2015	Triplet loss (FaceNet)	Schroff et al.
2018	Contrastive Predictive Coding (CPC) and InfoNCE loss	van den Oord et al.
2019	MoCo v1 (Momentum Contrast)	He et al.
2020	SimCLR	Chen et al. (Google)
2020	MoCo v2	Chen et al. (FAIR)
2020	BYOL (no negatives needed)	Grill et al. (DeepMind)
2020	SwAV (multi-crop + prototypes)	Caron et al. (FAIR)
2020	Supervised Contrastive Learning (SupCon)	Khosla et al. (Google)
2021	CLIP (contrastive language-image pretraining)	Radford et al. (OpenAI)
2021	Barlow Twins	Zbontar et al. (FAIR)
2021	DINO (self-distillation with ViT)	Caron et al. (FAIR)
2021	SimCSE (contrastive sentence embeddings)	Gao et al. (Princeton)
2021	MoCo v3 (ViT backbone)	Chen et al. (FAIR)
2022	VICReg	Bardes, Ponce, LeCun

References

van den Oord, A., Li, Y., & Vinyals, O. (2018). "Representation Learning with Contrastive Predictive Coding." *arXiv preprint arXiv:1807.03748*.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." *Proceedings of the 37th International Conference on Machine Learning (ICML 2020)*.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning." *Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020)*.
Grill, J.-B., Strub, F., Altche, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning." *Advances in Neural Information Processing Systems (NeurIPS 2020)*.
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments." *Advances in Neural Information Processing Systems (NeurIPS 2020)*.
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). "Barlow Twins: Self-Supervised Learning via Redundancy Reduction." *Proceedings of the 38th International Conference on Machine Learning (ICML 2021)*.
Bardes, A., Ponce, J., & LeCun, Y. (2022). "VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning." *International Conference on Learning Representations (ICLR 2022)*.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of the 38th International Conference on Machine Learning (ICML 2021)*.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., ... & Krishnan, D. (2020). "Supervised Contrastive Learning." *Advances in Neural Information Processing Systems (NeurIPS 2020)*.
Gao, T., Yao, X., & Chen, D. (2021). "SimCSE: Simple Contrastive Learning of Sentence Embeddings." *Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021)*.
Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). "Emerging Properties in Self-Supervised Vision Transformers." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021)*.
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). "FaceNet: A Unified Embedding for Face Recognition and Clustering." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015)*.
Hadsell, R., Chopra, S., & LeCun, Y. (2006). "Dimensionality Reduction by Learning an Invariant Mapping." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2006)*.
Chen, X. & He, K. (2021). "Exploring Simple Siamese Representation Learning." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021)*.

Introduction

Explain like I'm 5 (ELI5)

Core concept

Contrastive loss functions

Contrastive loss (pairwise)

Triplet loss

InfoNCE loss

NT-Xent loss

Supervised contrastive loss (SupCon)

Comparison of loss functions

Positive and negative pair construction

Positive pairs

Negative pairs

Data augmentation in contrastive learning

Image augmentations

Text augmentations

Methods and architectures

Contrastive Predictive Coding (CPC)

MoCo (Momentum Contrast)

SimCLR

BYOL (Bootstrap Your Own Latent)

SwAV (Swapping Assignments between Views)

Barlow Twins

VICReg (Variance-Invariance-Covariance Regularization)

DINO (Self-Distillation with No Labels)

Comparison of methods

Non-contrastive alternatives

Avoiding collapse without negatives

Applications

Computer vision

Natural language processing

Multimodal learning

Other domains

Relationship to metric learning

Key design considerations and hyperparameters

Temperature

Batch size

Projection head

Encoder architecture

Limitations and challenges

Computational cost

Temperature sensitivity

Dimensional collapse

False negatives

Augmentation sensitivity

Transfer gap

Timeline of contrastive learning

References

Improve this article

Related Articles

SimCLR

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

DINO (computer vision)

Context window

Introduction

Explain like I'm 5 (ELI5)

Core concept

Contrastive loss functions

Contrastive loss (pairwise)

Triplet loss

InfoNCE loss

NT-Xent loss

Supervised contrastive loss (SupCon)

Comparison of loss functions

Positive and negative pair construction

Positive pairs

Negative pairs

Data augmentation in contrastive learning

Image augmentations

Text augmentations

Methods and architectures

Contrastive Predictive Coding (CPC)

MoCo (Momentum Contrast)

SimCLR

BYOL (Bootstrap Your Own Latent)

SwAV (Swapping Assignments between Views)

Barlow Twins

VICReg (Variance-Invariance-Covariance Regularization)