# Contrastive Learning

> Source: https://aiwiki.ai/wiki/contrastive_learning
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [self-supervised learning](/wiki/self-supervised_learning), [representation learning](/wiki/representation), [metric learning](/wiki/metric_learning), [transfer learning](/wiki/transfer_learning), [deep learning](/wiki/deep_learning)*

Contrastive learning is a family of [machine learning](/wiki/machine_learning) methods that learn representations by pulling similar (positive) pairs of data points closer together in an [embedding](/wiki/embeddings) space while pushing dissimilar (negative) pairs apart, without relying on explicit class labels. It is the dominant paradigm for [self-supervised learning](/wiki/self-supervised_learning) of visual, textual, and multimodal features from unlabeled data. The approach became influential in 2020 when SimCLR (Chen et al.) showed that "a linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50." [2] Later that year, [CLIP](/wiki/clip) (Radford et al., 2021) trained image and text encoders on 400 million internet image-text pairs to enable zero-shot [image classification](/wiki/image_classification_models), matching the ImageNet accuracy of a supervised ResNet-50 with no task-specific training. [8]

## What is contrastive learning?

Contrastive learning is a family of [machine learning](/wiki/machine_learning) methods that learn representations by comparing data points against each other. The central idea is to pull representations of similar (positive) pairs closer together in an [embedding](/wiki/embeddings) space while pushing representations of dissimilar (negative) pairs apart. By structuring the learning objective around pairwise or group-wise comparisons rather than explicit class labels, contrastive learning has become one of the most successful paradigms for [self-supervised learning](/wiki/self-supervised_learning), enabling models to learn useful visual, textual, and multimodal features from unlabeled data.

The roots of contrastive learning trace back to earlier work in [metric learning](/wiki/metric_learning) and Siamese networks. Bromley et al. (1993) introduced Siamese networks for signature verification, training two weight-sharing networks to compare input pairs. Hadsell, Chopra, and LeCun (2006) formalized the contrastive loss function, which minimizes distance between same-class pairs while enforcing a margin-based separation for different-class pairs. [13] The triplet loss introduced in FaceNet (Schroff et al., 2015) extended this idea by simultaneously considering an anchor, a positive sample, and a negative sample. [12]

The modern era of contrastive learning began in 2018 when van den Oord et al. proposed Contrastive Predictive Coding (CPC) and introduced the InfoNCE loss. [1] This was followed by a rapid succession of methods between 2019 and 2021, including [MoCo](/wiki/moco), [SimCLR](/wiki/simclr), [BYOL](/wiki/byol), [SwAV](/wiki/swav), Barlow Twins, and [CLIP](/wiki/clip). These methods demonstrated that self-supervised contrastive pretraining could match or even surpass supervised pretraining on downstream tasks such as [image classification](/wiki/image_classification_models), [object detection](/wiki/object_detection), and [semantic segmentation](/wiki/semantic_segmentation).

## Explain like I'm 5 (ELI5)

Imagine you have a big box of photos. You pick up one photo of a cat and then ask yourself: "Which other photos look like this one?" You put all the cat photos in one pile and all the non-cat photos in another pile. You do this without anyone telling you what a cat is. You just notice that some photos look alike and some do not.

Contrastive learning works the same way. A computer looks at pairs of things and learns to tell which ones are similar and which ones are different. Over time, it gets really good at noticing what makes things alike or unlike, and it can use that knowledge to do all sorts of tasks later, like sorting photos, finding matching items, or understanding what a sentence means.

## Core concept

At its heart, contrastive learning operates on pairs of data points. Given a data point (called an anchor), the method constructs one or more positive pairs (semantically similar examples) and one or more negative pairs (semantically dissimilar examples). A [neural network](/wiki/neural_network) encoder maps each data point into a vector in an embedding space, and a contrastive [loss function](/wiki/loss_function) optimizes the encoder so that positive pairs have high similarity (small distance) and negative pairs have low similarity (large distance).

The general pipeline consists of four stages:

1. **Data augmentation:** Create multiple views of each data point (e.g., two randomly augmented versions of the same image).
2. **Encoding:** Pass each view through an encoder network (e.g., a [ResNet](/wiki/resnet) or [Vision Transformer](/wiki/vision_transformer)) to obtain representation vectors.
3. **Projection:** Optionally map the representations through a small projection head (typically an [MLP](/wiki/perceptron)) into a lower-dimensional space where the contrastive loss is applied.
4. **Contrastive loss:** Compute a loss that encourages the embeddings of positive pairs to be close and embeddings of negative pairs to be far apart.

After pretraining, the projection head is discarded, and the encoder representations are used for downstream tasks, either through linear probing or [fine-tuning](/wiki/fine_tuning).

## Contrastive loss functions

Several loss functions have been developed for contrastive learning, each with different properties and trade-offs.

### Contrastive loss (pairwise)

The original contrastive loss, introduced by Hadsell et al. (2006), operates on pairs of examples. [13] Given two inputs x_i and x_j with representations z_i and z_j, and a binary label y indicating whether the pair is similar (y=1) or dissimilar (y=0):

**L = y * d(z_i, z_j)^2 + (1 - y) * max(0, m - d(z_i, z_j))^2**

where d is a distance function (typically Euclidean) and m is a margin hyperparameter. This loss pulls similar pairs together and pushes dissimilar pairs apart up to a margin m.

### Triplet loss

The triplet loss, introduced in FaceNet (Schroff et al., 2015), operates on triplets of (anchor, positive, negative) samples: [12]

**L = max(0, d(z_a, z_p) - d(z_a, z_n) + m)**

where z_a, z_p, z_n are the embeddings of the anchor, positive, and negative samples, d is a distance function, and m is a margin. The loss pushes the anchor closer to the positive than to the negative by at least margin m. Hard negative mining, the process of selecting negatives that are close to the anchor, is important for effective training with triplet loss.

### InfoNCE loss

The InfoNCE (Noise-Contrastive Estimation) loss was introduced by van den Oord et al. (2018) in the Contrastive Predictive Coding paper. [1] It frames the contrastive objective as a classification problem: given an anchor and one positive sample among K negative samples, identify the positive. The loss is:

**L_InfoNCE = -log( exp(sim(z_i, z_j) / tau) / sum_k exp(sim(z_i, z_k) / tau) )**

where sim is a similarity function (typically [cosine similarity](/wiki/cosine_similarity)), tau is a temperature parameter, z_j is the positive sample, and the sum runs over the positive and all K negative samples. The InfoNCE loss has a direct connection to mutual information estimation: minimizing InfoNCE maximizes a lower bound on the mutual information between the two views. [1]

### NT-Xent loss

The Normalized Temperature-scaled Cross-Entropy (NT-Xent) loss, introduced in SimCLR (Chen et al., 2020), is a variant of InfoNCE. [2] For a minibatch of N samples, each sample generates two augmented views, yielding 2N total views. For a positive pair (i, j), the loss is:

**L_NT-Xent(i,j) = -log( exp(sim(z_i, z_j) / tau) / sum_{k != i} exp(sim(z_i, z_k) / tau) )**

The key differences from the original InfoNCE are that NT-Xent uses cosine similarity (with L2 normalization) and treats all other 2(N-1) augmented views in the minibatch as negatives. The temperature parameter tau controls the sharpness of the distribution: lower values make the model more sensitive to differences in similarity, while higher values produce a softer distribution.

### Supervised contrastive loss (SupCon)

Khosla et al. (2020) extended the contrastive loss to the fully supervised setting. [9] Rather than having a single positive per anchor, the supervised contrastive loss treats all samples from the same class as positives:

**L_SupCon = sum_{i} (-1/|P(i)|) * sum_{p in P(i)} log( exp(sim(z_i, z_p) / tau) / sum_{k != i} exp(sim(z_i, z_k) / tau) )**

where P(i) is the set of all positives for anchor i (same-class samples). This formulation pulls together all same-class representations while pushing apart different-class ones. On ResNet-200, SupCon achieved 81.4% top-1 accuracy on [ImageNet](/wiki/imagenet), which the authors reported as 0.8% above the best number then reported for that architecture, surpassing the cross-entropy baseline. [9]

### Comparison of loss functions

| Loss function | Year | Inputs per step | Negatives required | Key property |
|---|---|---|---|---|
| Contrastive loss | 2006 | Pair | Yes | Margin-based, pairwise |
| Triplet loss | 2015 | Triplet (anchor, pos, neg) | Yes | Requires hard negative mining |
| InfoNCE | 2018 | 1 positive + K negatives | Yes | Bounds mutual information |
| NT-Xent | 2020 | Minibatch (2N views) | Yes (in-batch) | Cosine similarity, temperature scaling |
| SupCon | 2020 | Minibatch with labels | Yes (different classes) | Multiple positives per anchor |

## Positive and negative pair construction

How positive and negative pairs are formed is one of the most consequential design decisions in contrastive learning.

### Positive pairs

In self-supervised contrastive learning, positive pairs are typically constructed through [data augmentation](/wiki/data_augmentation). Two randomly augmented versions of the same input are treated as a positive pair. The augmentations should change low-level appearance while preserving semantic content. For images, this commonly involves random cropping with resizing, color jittering, Gaussian blur, grayscale conversion, and horizontal flipping. [2]

In supervised settings, any two samples from the same class can serve as positives. Some methods also use nearest neighbors in the embedding space as positives (e.g., NNCLR by Dwibedi et al., 2021), which provides greater diversity than augmentation-based positives alone.

### Negative pairs

Negative pairs consist of semantically different inputs. In self-supervised settings, negatives are typically all other samples in the minibatch (in-batch negatives). The effectiveness of the contrastive objective depends heavily on having a sufficient number and diversity of negatives.

There are several strategies for negative sampling:

| Strategy | Description | Used by |
|---|---|---|
| In-batch negatives | Other samples in the same minibatch serve as negatives | SimCLR |
| Memory bank / queue | A queue stores encoded representations from previous batches | MoCo |
| Hard negative mining | Negatives are selected to be close to the anchor in embedding space | FaceNet, various |
| Debiased sampling | Corrects for false negatives (samples that are actually similar to the anchor) | Chuang et al. (2020) |

A persistent challenge is false negatives: in unlabeled data, a randomly sampled "negative" might actually be semantically similar to the anchor (e.g., two different photos of dogs). False negatives introduce noise into the training signal and can degrade learned representations.

## Data augmentation in contrastive learning

Data augmentation is the primary mechanism for constructing positive pairs in self-supervised contrastive learning and has a significant effect on representation quality.

### Image augmentations

SimCLR (Chen et al., 2020) systematically studied which augmentation combinations matter most. Their findings revealed that random cropping combined with color distortion is the single most important augmentation composition. Without color distortion, different crops from the same image can be distinguished trivially by their color histograms, leading the model to learn a shortcut rather than semantic features. [2]

Commonly used image augmentations include:

| Augmentation | Effect | Importance |
|---|---|---|
| Random resized crop | Forces the model to recognize objects at different scales and positions | Very high |
| Color jittering | Changes brightness, contrast, saturation, and hue | Very high (prevents color histogram shortcuts) |
| Gaussian blur | Smooths the image, removing fine texture details | Moderate |
| Horizontal flip | Mirrors the image horizontally | Moderate |
| Grayscale conversion | Removes all color information | Moderate |
| Solarization | Inverts pixels above a threshold | Used in BYOL, Barlow Twins |

SwAV introduced multi-crop augmentation, which generates two global views (224x224 resolution) and several smaller local views (96x96 resolution) from each image. This approach increases the number of positive pairs without proportionally increasing compute, because the smaller crops are cheaper to process. [5]

### Text augmentations

Contrastive learning in [natural language processing](/wiki/natural_language_processing) requires different augmentation strategies because text is discrete and symbolic. Common approaches include:

- **Dropout as augmentation:** SimCSE (Gao et al., 2021) passes the same sentence through the encoder twice with different [dropout](/wiki/dropout) masks, treating the two outputs as a positive pair. This minimal augmentation surprisingly achieves strong results. [10]
- **Back-translation:** Translate a sentence into another language and then back, producing a paraphrase.
- **Token-level perturbations:** Random word deletion, insertion, swapping, or synonym replacement.
- **Entailment pairs:** In supervised SimCSE, sentence pairs labeled as entailment in natural language inference datasets serve as positives, and contradiction pairs serve as hard negatives. [10]

## Methods and architectures

A large number of contrastive and contrastive-adjacent methods have been proposed since 2018. The following sections describe the most influential ones.

### Contrastive Predictive Coding (CPC)

Contrastive Predictive Coding (van den Oord et al., 2018) was one of the first methods to formalize the modern contrastive learning framework. CPC learns representations by predicting future observations in latent space using an autoregressive model. The method introduced the InfoNCE loss and demonstrated strong results across four domains: speech, images, text, and reinforcement learning. CPC established the idea that contrastive objectives can serve as a proxy for mutual information maximization between different parts of the input signal. [1]

### MoCo (Momentum Contrast)

[MoCo](/wiki/moco) (He et al., 2020) frames contrastive learning as a dictionary look-up problem. The method maintains a dynamic dictionary of encoded keys using a queue and a momentum-updated encoder: [3]

- **Queue-based dictionary:** A first-in-first-out queue stores encoded representations from previous minibatches, decoupling the dictionary size from the batch size. This allows MoCo to use a large number of negatives (e.g., 65,536) without requiring enormous batch sizes.
- **Momentum encoder:** The key encoder is updated as an exponential moving average of the query encoder, rather than by backpropagation. This produces slowly evolving, consistent keys despite the queue containing representations from different training steps.

MoCo v1 was published at CVPR 2020. [3] MoCo v2 (Chen et al., 2020) incorporated design improvements from SimCLR, specifically an MLP projection head and stronger data augmentations, achieving better results without needing large batch sizes. MoCo v3 (Chen et al., 2021) extended the framework to [Vision Transformers](/wiki/vision_transformer) (ViT) with a symmetric loss and was published at ICCV 2021.

On several downstream tasks including [object detection](/wiki/object_detection) and [semantic segmentation](/wiki/semantic_segmentation) on PASCAL VOC and [COCO](/wiki/coco_dataset), MoCo representations outperformed their supervised pretraining counterparts. [3]

### SimCLR

[SimCLR](/wiki/simclr) (Chen et al., 2020) is a simple framework for contrastive learning of visual representations, developed at [Google](/wiki/google) Research. SimCLR simplifies the contrastive learning pipeline by eliminating the need for specialized architectures, memory banks, or momentum encoders. [2]

The SimCLR framework consists of four components:

1. **Stochastic data augmentation:** Two random augmentations are applied to each image, creating a positive pair.
2. **Base encoder:** A [convolutional neural network](/wiki/convolutional_neural_network) (e.g., [ResNet-50](/wiki/resnet)) extracts representation vectors from augmented images.
3. **Projection head:** A small MLP maps representations to a lower-dimensional space where the NT-Xent loss is applied. Chen et al. found that including a nonlinear projection head substantially improved representation quality.
4. **NT-Xent loss:** The normalized temperature-scaled cross-entropy loss is computed over all positive and in-batch negative pairs.

The SimCLR paper summarized its three central findings in a single sentence: "We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning." [2] In concrete terms:

- The composition of data augmentations matters more than any individual augmentation. Random cropping combined with color distortion is the most effective combination.
- A learnable nonlinear projection head between the representation and the contrastive loss significantly improves performance.
- Contrastive learning benefits from larger batch sizes and longer training compared to supervised learning. SimCLR used batch sizes of up to 8,192.

A linear classifier trained on SimCLR representations achieved 76.5% top-1 accuracy on ImageNet, a 7% relative improvement over the previous state of the art and matching the performance of a supervised ResNet-50. [2]

### BYOL (Bootstrap Your Own Latent)

[BYOL](/wiki/byol) (Grill et al., 2020) challenged the assumption that negative pairs are necessary for contrastive learning. [4] As the authors put it, "BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other," and unlike prior state-of-the-art methods that rely on negative pairs, "BYOL achieves a new state of the art without them." [4] BYOL uses two networks, an online network and a target network, that interact and learn from each other:

- The **online network** is trained to predict the target network's representation of the same image under a different augmented view.
- The **target network** is updated as an exponential moving average of the online network (similar to MoCo's momentum encoder).
- A **predictor head** on the online network breaks the symmetry between the two networks, preventing collapse.

BYOL does not use negative pairs at all. Instead, the asymmetric architecture (predictor + stop gradient on the target) prevents the trivial solution of mapping all inputs to a constant vector. Later analysis by Richemond et al. (2020) suggested that batch normalization in BYOL implicitly introduces a form of contrastive signal, though subsequent work has shown that BYOL can work without batch normalization given appropriate architectural choices.

BYOL achieved 74.3% top-1 accuracy on ImageNet with a ResNet-50 and 79.6% with a larger ResNet, while being more robust to changes in batch size and augmentation choices than SimCLR. [4]

### SwAV (Swapping Assignments between Views)

[SwAV](/wiki/swav) (Caron et al., 2020) combines contrastive learning with online clustering. Instead of comparing features directly, SwAV assigns features to a set of learnable prototype vectors and enforces consistency between the cluster assignments of different views of the same image. [5]

The method works by:

1. Computing features for multiple augmented views of each image.
2. Assigning each feature to the nearest prototypes using an optimal transport algorithm (Sinkhorn-Knopp).
3. Training the network to predict the cluster assignment of one view from the features of another view ("swapped" prediction).

SwAV introduced multi-crop augmentation, using two standard resolution crops (224x224) and four to six lower resolution crops (96x96). This strategy increased the effective number of comparisons without significantly increasing memory usage.

SwAV achieved 75.3% top-1 accuracy on ImageNet with ResNet-50 and outperformed supervised pretraining on all considered transfer tasks. [5]

### Barlow Twins

Barlow Twins (Zbontar et al., 2021) takes a different approach inspired by neuroscientist H. Barlow's redundancy-reduction principle. The method measures the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of the same sample and pushes this matrix toward the identity matrix. [6]

The objective has two components:

- **Invariance term:** The diagonal elements of the cross-correlation matrix should be close to 1, meaning corresponding dimensions of the two representations should be highly correlated.
- **Redundancy reduction term:** The off-diagonal elements should be close to 0, meaning different dimensions should be uncorrelated (decorrelated).

Barlow Twins does not require large batch sizes, asymmetric architectures, stop gradients, or momentum updates. It benefits from high-dimensional output vectors (e.g., 8,192 dimensions), in contrast to most other methods that use lower-dimensional projections. [6]

### VICReg (Variance-Invariance-Covariance Regularization)

VICReg (Bardes, Ponce, and LeCun, 2022) decomposes the self-supervised learning objective into three explicit regularization terms: [7]

- **Variance:** Maintains the standard deviation of each embedding dimension above a threshold, preventing informational collapse.
- **Invariance:** Minimizes the mean squared distance between embeddings of different views of the same image.
- **Covariance:** Decorrelates pairs of embedding dimensions, encouraging the network to use all available dimensions.

VICReg does not require weight sharing between branches, batch normalization, feature-wise normalization, output quantization, stop gradients, or memory banks, yet achieves results on par with state-of-the-art methods. [7]

### DINO (Self-Distillation with No Labels)

DINO (Caron et al., 2021) applies self-distillation to [Vision Transformers](/wiki/vision_transformer) in a contrastive-like framework. A student network is trained to match the output of a teacher network (updated via exponential moving average) across different augmented views. Unlike MoCo or SimCLR, DINO uses a cross-entropy loss between the softmax outputs of the student and teacher rather than a contrastive similarity-based loss. [11]

DINO's self-supervised ViT features exhibit notable emerging properties. As the authors observed, "self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets." [11] The attention maps contain explicit information about the semantic segmentation of the image, with different attention heads attending to different semantic parts. DINO achieved 80.1% top-1 accuracy on ImageNet in linear evaluation with ViT-Base, and 78.3% top-1 with a k-nearest-neighbor classifier using a small ViT. [11]

### Comparison of methods

| Method | Year | Venue | Negatives required | Key mechanism | ImageNet top-1 (linear, ResNet-50) |
|---|---|---|---|---|---|
| CPC | 2018 | arXiv | Yes | Autoregressive prediction + InfoNCE | N/A (evaluated on other tasks) |
| MoCo v1 | 2020 | CVPR | Yes | Momentum encoder + queue | 60.6% |
| SimCLR | 2020 | ICML | Yes (in-batch) | Large batch + projection head + NT-Xent | 76.5% |
| MoCo v2 | 2020 | arXiv | Yes | MoCo + SimCLR improvements | 71.1% |
| BYOL | 2020 | NeurIPS | No | Online/target networks + predictor | 74.3% |
| SwAV | 2020 | NeurIPS | No (uses prototypes) | Online clustering + multi-crop | 75.3% |
| Barlow Twins | 2021 | ICML | No | Cross-correlation + redundancy reduction | 73.2% |
| VICReg | 2022 | ICLR | No | Variance + invariance + covariance terms | 73.2% |
| DINO | 2021 | ICCV | No | Self-distillation + ViT | 75.3% (ViT-S/16) |

## Non-contrastive alternatives

A subset of self-supervised methods learn representations without explicit negative pairs. These methods, sometimes called non-contrastive, prevent collapse through architectural asymmetry, regularization, or clustering rather than through positive-negative comparisons.

### Avoiding collapse without negatives

The fundamental challenge for non-contrastive methods is preventing representational collapse, where the encoder maps all inputs to the same constant vector. Several strategies have been developed:

| Strategy | Methods using it | How it prevents collapse |
|---|---|---|
| Stop gradient + predictor | BYOL, SimSiam | Asymmetric gradient flow prevents trivial solution |
| Momentum encoder | BYOL, DINO | Slowly evolving target provides a stable learning signal |
| Redundancy reduction | Barlow Twins | Decorrelation of embedding dimensions ensures information is distributed |
| Variance regularization | VICReg | Explicit variance term prevents dimensional collapse |
| Online clustering | SwAV | Equipartition constraint via Sinkhorn-Knopp prevents degenerate assignments |
| Batch normalization | Implicit in several methods | Implicitly spreads representations across the batch |

SimSiam (Chen and He, 2021) demonstrated that a simple Siamese network with a stop gradient and a predictor head can learn useful representations without negative pairs, momentum encoders, or large batches. The authors showed that the stop gradient operation is essential: without it, the model collapses. [14]

## What is contrastive learning used for?

Contrastive learning has been applied across many domains beyond its origins in computer vision.

### Computer vision

Contrastive pretraining has become a standard approach for learning visual representations. Self-supervised contrastive models pretrained on ImageNet or larger unlabeled datasets produce features that transfer well to downstream tasks:

- **Image classification:** Linear classifiers trained on frozen contrastive representations approach or match the accuracy of fully supervised models.
- **Object detection and segmentation:** MoCo representations outperformed supervised pretraining on [PASCAL VOC](/wiki/pascal_voc) and [COCO](/wiki/coco_dataset) detection and segmentation tasks. [3]
- **Medical imaging:** Contrastive pretraining is particularly valuable in medical imaging, where labeled data is scarce and expensive. Models pretrained with contrastive objectives on unlabeled medical images can be fine-tuned with small labeled datasets.

### Natural language processing

Contrastive learning has been applied to learn sentence and document representations:

- **SimCSE** (Gao et al., 2021) learns sentence embeddings by passing each sentence through the encoder twice with different dropout masks (unsupervised) or using natural language inference pairs (supervised). It achieved strong results on semantic textual similarity benchmarks, with the unsupervised BERT-base model reaching 76.3% average Spearman correlation and the supervised version reaching 81.6%. [10]
- **Sentence-BERT** (Reimers and Gurevych, 2019) uses Siamese networks fine-tuned on NLI data with a contrastive objective to produce fixed-size sentence embeddings.
- **Dense passage retrieval** methods use contrastive training to learn query and document encoders for information retrieval.

### Multimodal learning

[CLIP](/wiki/clip) (Radford et al., 2021) is the most prominent example of multimodal contrastive learning. CLIP trains separate image and text encoders to align their representations in a shared embedding space using a contrastive objective. Given a batch of N (image, text) pairs, CLIP maximizes the cosine similarity of the N correct pairs while minimizing the similarity of the N^2 - N incorrect pairs. [8]

CLIP was trained on 400 million image-text pairs scraped from the internet (the WebImageText dataset). After training, CLIP enables zero-shot image classification: the model classifies images by comparing their embeddings to text embeddings of class descriptions, without any task-specific training. CLIP matched the zero-shot accuracy of a supervised ResNet-50 on ImageNet and demonstrated strong transfer performance across dozens of datasets. [8]

CLIP's influence extends well beyond classification. It serves as the text encoder in [Stable Diffusion](/wiki/stable_diffusion) and other text-to-image generation systems, powers image search and retrieval systems, and has been extended to additional modalities including audio (CLAP) and video.

### Other domains

| Domain | Application | Example methods |
|---|---|---|
| Speech and audio | Speaker verification, speech representation learning | wav2vec 2.0, CPC |
| Graphs | Node and graph-level representation learning | GraphCL, GCC |
| Reinforcement learning | State representation learning from observations | CURL, CPC for RL |
| Time series | Temporal representation learning | TS2Vec, TNC |
| Recommendation systems | User and item representation learning | CLRec, SGL |

## How does contrastive learning differ from metric learning?

Contrastive learning and [metric learning](/wiki/metric_learning) share the same core objective: learning an embedding space where similar items are close and dissimilar items are far apart. Both fields use loss functions based on distances or similarities between embeddings, and both rely on effective sampling of positive and negative pairs.

However, there are important differences:

| Aspect | Traditional metric learning | Modern contrastive learning |
|---|---|---|
| Supervision | Typically supervised (requires labels) | Often self-supervised (labels not required) |
| Positive pairs | Defined by class labels | Defined by data augmentation |
| Scale | Often trained on smaller, curated datasets | Designed for large-scale, unlabeled data |
| Primary goal | Learn a distance metric for retrieval or verification | Learn general-purpose representations |
| Loss functions | Contrastive loss, triplet loss, N-pair loss | InfoNCE, NT-Xent, and variants |
| Negatives per anchor | Typically 1 (triplet) or a few | Hundreds to thousands (in-batch or queued) |

Modern contrastive learning can be viewed as a scaled-up, self-supervised extension of metric learning. The InfoNCE loss generalizes the N-pair loss from metric learning, and many of the same principles (hard negative mining, embedding normalization, margin selection) apply in both settings.

## Key design considerations and hyperparameters

### Temperature

The temperature parameter tau in the InfoNCE and NT-Xent losses controls how sharply the model distinguishes between positive and negative pairs. A lower temperature (e.g., tau = 0.05) makes the loss highly sensitive to small differences in similarity, producing stronger gradients for hard negatives but risking numerical instability. A higher temperature (e.g., tau = 0.5) produces a softer distribution that is more forgiving of similarity differences.

SimCLR used tau = 0.5 as default, while MoCo used tau = 0.07. The supervised contrastive loss found tau = 0.1 to work well. Temperature tuning requires careful experimentation, as performance can be highly sensitive to this value.

### Batch size

Larger batch sizes provide more in-batch negatives, which generally improves the quality of the learned representations in methods that use in-batch negatives (like SimCLR). SimCLR's performance improved substantially going from batch size 256 to 8,192. [2] However, very large batch sizes require significant GPU memory and may necessitate distributed training.

MoCo addresses this constraint by decoupling the number of negatives from the batch size through its queue mechanism, allowing a large number of negatives (65,536) even with modest batch sizes (256). [3]

### Projection head

SimCLR demonstrated that applying the contrastive loss on a projected representation (after an MLP projection head) rather than directly on the encoder output substantially improves downstream task performance. [2] The intuition is that the projection head can discard information that is useful for the downstream task but not for the contrastive objective (e.g., color or orientation information that varies between augmented views). After pretraining, the projection head is discarded and only the encoder representations are used.

### Encoder architecture

Most early contrastive learning methods used [ResNet](/wiki/resnet) architectures (especially ResNet-50) as the backbone encoder. More recent work has adopted [Vision Transformers](/wiki/vision_transformer) (ViT), which tend to benefit even more from self-supervised pretraining than CNNs. DINO, MoCo v3, and subsequent methods have shown that ViTs trained with contrastive or self-distillation objectives learn representations with distinctive properties, such as attention maps that capture semantic segmentation. [11]

## Limitations and challenges

Despite its successes, contrastive learning has several notable limitations.

### Computational cost

Many contrastive methods require large batch sizes (SimCLR) or large queues of negatives (MoCo) to achieve strong performance. Training SimCLR with a batch size of 8,192 requires substantial GPU memory and compute. Additionally, contrastive pretraining typically requires many more epochs than supervised training (e.g., 800-1000 epochs on ImageNet compared to 90 for supervised training).

### Temperature sensitivity

The temperature hyperparameter strongly influences performance, and the optimal value varies across methods, datasets, and architectures. Searching for a valid temperature requires extensive experimentation. Some recent work (e.g., Haochen et al., 2025) has proposed temperature-free loss functions to address this limitation.

### Dimensional collapse

Even when contrastive methods avoid complete representational collapse (mapping all inputs to the same vector), they can still suffer from dimensional collapse, where the embedding vectors effectively reside in a lower-dimensional subspace of the full embedding space. This wastes representational capacity. Methods like Barlow Twins and VICReg address this directly through decorrelation or variance regularization. [6]

### False negatives

In self-supervised settings, negatives are sampled randomly without knowledge of semantic similarity. Two images that happen to depict the same object or concept may be incorrectly treated as negatives, providing a misleading training signal. Debiased contrastive learning (Chuang et al., 2020) and other approaches attempt to mitigate this issue.

### Augmentation sensitivity

The quality of learned representations depends heavily on the choice of data augmentations. Augmentations that are too weak lead to trivial solutions, while augmentations that are too strong can destroy semantic information. Finding the right augmentation pipeline for a new domain or data type often requires domain-specific knowledge and experimentation.

### Transfer gap

While contrastive pretraining produces strong representations for many downstream tasks, there can be a gap between the pretraining objective (instance discrimination) and the downstream task (e.g., dense prediction or fine-grained classification). This gap can be larger than with supervised pretraining for some specific tasks, particularly those requiring fine-grained spatial information.

## Timeline of contrastive learning

| Year | Development | Reference |
|---|---|---|
| 1993 | Siamese networks for signature verification | Bromley et al. |
| 2005 | Contrastive loss for dimensionality reduction | Hadsell, Chopra, LeCun |
| 2015 | Triplet loss (FaceNet) | Schroff et al. |
| 2018 | Contrastive Predictive Coding (CPC) and InfoNCE loss | van den Oord et al. |
| 2019 | MoCo v1 (Momentum Contrast) | He et al. |
| 2020 | SimCLR | Chen et al. (Google) |
| 2020 | MoCo v2 | Chen et al. (FAIR) |
| 2020 | BYOL (no negatives needed) | Grill et al. (DeepMind) |
| 2020 | SwAV (multi-crop + prototypes) | Caron et al. (FAIR) |
| 2020 | Supervised Contrastive Learning (SupCon) | Khosla et al. (Google) |
| 2021 | CLIP (contrastive language-image pretraining) | Radford et al. (OpenAI) |
| 2021 | Barlow Twins | Zbontar et al. (FAIR) |
| 2021 | DINO (self-distillation with ViT) | Caron et al. (FAIR) |
| 2021 | SimCSE (contrastive sentence embeddings) | Gao et al. (Princeton) |
| 2021 | MoCo v3 (ViT backbone) | Chen et al. (FAIR) |
| 2022 | VICReg | Bardes, Ponce, LeCun |

## References

1. van den Oord, A., Li, Y., & Vinyals, O. (2018). "Representation Learning with Contrastive Predictive Coding." *arXiv preprint arXiv:1807.03748*.
2. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." *Proceedings of the 37th International Conference on Machine Learning (ICML 2020)*.
3. He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning." *Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020)*.
4. Grill, J.-B., Strub, F., Altche, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning." *Advances in Neural Information Processing Systems (NeurIPS 2020)*.
5. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments." *Advances in Neural Information Processing Systems (NeurIPS 2020)*.
6. Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). "Barlow Twins: Self-Supervised Learning via Redundancy Reduction." *Proceedings of the 38th International Conference on Machine Learning (ICML 2021)*.
7. Bardes, A., Ponce, J., & LeCun, Y. (2022). "VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning." *International Conference on Learning Representations (ICLR 2022)*.
8. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of the 38th International Conference on Machine Learning (ICML 2021)*.
9. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., ... & Krishnan, D. (2020). "Supervised Contrastive Learning." *Advances in Neural Information Processing Systems (NeurIPS 2020)*.
10. Gao, T., Yao, X., & Chen, D. (2021). "SimCSE: Simple Contrastive Learning of Sentence Embeddings." *Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021)*.
11. Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). "Emerging Properties in Self-Supervised Vision Transformers." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021)*.
12. Schroff, F., Kalenichenko, D., & Philbin, J. (2015). "FaceNet: A Unified Embedding for Face Recognition and Clustering." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015)*.
13. Hadsell, R., Chopra, S., & LeCun, Y. (2006). "Dimensionality Reduction by Learning an Invariant Mapping." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2006)*.
14. Chen, X. & He, K. (2021). "Exploring Simple Siamese Representation Learning." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021)*.
