SimCLR

Computer Vision Deep Learning Machine Learning

19 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v2 · 3,743 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SimCLR (Simple Framework for Contrastive Learning of Visual Representations) is a self-supervised learning method for computer vision in which a network is trained to recognise that two differently augmented views of the same image belong together. Introduced by Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton at Google Research and Google Brain in February 2020, it was the first contrastive learning method to match supervised pretraining on ImageNet: a linear classifier trained on SimCLR features reaches 76.5% top-1 accuracy, "a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50."^[1] The paper, A Simple Framework for Contrastive Learning of Visual Representations, was submitted to arXiv on 13 February 2020 (arXiv:2002.05709) and published at ICML 2020.^[1]

The idea behind SimCLR is unfussy: take an image, augment it twice in different ways, and ask the network to recognise that the two distorted versions came from the same source. Repeat over a batch of thousands. With enough compute, strong augmentations, and a small projection head bolted on top of a ResNet encoder, the network learns features good enough to classify ImageNet at 76.5% top-1 accuracy under linear evaluation, which at the time matched a fully supervised baseline.^[1] That result is what put contrastive self-supervised learning on the map.

A follow-up, Big Self-Supervised Models are Strong Semi-Supervised Learners (often called SimCLR v2), appeared at NeurIPS 2020 and pushed the linear-probe number to 79.8% top-1 with a wider ResNet-152, a 4.3% relative improvement over the prior state of the art.^[2] SimCLR's recipe and findings, especially around augmentations and the projection head, were absorbed almost wholesale by the methods that followed (MoCo v2, BYOL, SwAV, DINO) and by CLIP, which is essentially contrastive learning over image-text pairs instead of image-image pairs.

What problem did SimCLR solve?

Before 2019 most of computer vision ran on supervised pretraining on ImageNet. You took a labelled dataset of about 1.28 million images across 1,000 categories, trained a convolutional neural network with cross-entropy loss, then transferred the features to whatever downstream task you cared about. This worked, but it was expensive in labels, and the features tended to be biased toward the categorisation task they were trained on.

The self-supervised alternative had been kicking around for years. The general idea: instead of using labels, design a pretext task that the model has to solve using only the structure of the image itself, and hope the representations learned along the way are useful elsewhere. Early pretext tasks included predicting the relative position of image patches (Doersch et al., 2015), solving jigsaw puzzles (Noroozi and Favaro, 2016), colourising grayscale photos (Zhang et al., 2016), and predicting image rotations (Gidaris et al., 2018). They produced features that beat random initialisation but lagged supervised pretraining by a wide margin on standard benchmarks.

Contrastive methods then took over. The key idea, traceable to Hadsell, Chopra and LeCun in 2006^[10] and refined by Wu et al. (Instance Discrimination, CVPR 2018)^[12] and van den Oord et al. (CPC, 2018),^[11] is to learn an embedding where similar things end up close together and dissimilar things end up far apart. Apply that to images by treating different augmented views of the same image as similar (positives) and views of different images as dissimilar (negatives). This is closer to a dimensionality reduction objective than to a classification one.

In early 2020 two papers landed within a couple of weeks of each other. Kaiming He's group at Facebook AI Research released MoCo (Momentum Contrast).^[3] Hinton's group at Google released SimCLR.^[1] Both used the InfoNCE-style contrastive objective. MoCo handled the negatives problem with a momentum-updated key encoder and a queue of past samples. SimCLR went the other direction: just make the batch enormous and use the rest of the batch as negatives.

The two papers, taken together, are what people usually point to when they say self-supervised pretraining caught up with supervised pretraining on ImageNet. On the linear-evaluation protocol, SimCLR's 76.5% / 93.2% top-1/top-5 accuracy beat the previous best self-supervised result (CPC v2, at 71.5% / 90.1%) by a clear margin.^[14]

How does the SimCLR framework work?

SimCLR is, by design, the bare minimum that works. It has four components:^[1]

A stochastic data augmentation module that produces two correlated views of each image.
A base encoder f, typically a ResNet, that maps each augmented view to a representation vector h.
A small projection head g, an MLP, that maps h to a space z where the contrastive loss is applied.
A contrastive loss called NT-Xent (Normalised Temperature-scaled Cross Entropy) defined over pairs of z vectors.

At evaluation time the projection head g is thrown away. Downstream classifiers operate on the encoder representation h, not on z. This detail matters more than it sounds.

Augmentations

For each image in a minibatch of size N, two random augmentations are sampled, producing 2N total views. The augmentations the paper actually used:^[1]

Operation	Notes
Random crop and resize	Sampled with scale in [0.08, 1.0] then resized back to the input resolution. The strongest single augmentation.
Random horizontal flip	Applied with probability 0.5.
Color jitter	Random brightness, contrast, saturation, hue. Applied with probability 0.8 at strength 0.5 for ResNet-50. The second strongest.
Random grayscale	Applied with probability 0.2.
Gaussian blur	Kernel size 10% of the image, sigma sampled from [0.1, 2.0]. Applied with probability 0.5.

The paper specifically did not use rotation, Sobel filtering, or cutout in the main configuration. The point of Section 3 of the paper is that the combination of crop and color jitter is doing most of the work.^[1] Crop alone leaves a giveaway: two crops from the same image often share colour histograms, so the network can shortcut by matching colours instead of learning shape. Adding colour jitter forces it to ignore that shortcut.

Encoder

The base encoder f is a ResNet. Most of the paper's tables use ResNet-50 in widths 1x, 2x, and 4x, where the multiplier scales the channel counts. The encoder produces a 2048-dimensional output after global average pooling for ResNet-50 (1x), with proportionally higher dimensionality for the wider variants. There is nothing SimCLR-specific about the encoder. You can plug in a vision transformer or any other backbone with no change to the rest of the recipe.

Projection head

This is where the v1 paper made one of its more counterintuitive findings. After the encoder, SimCLR adds a small two-layer MLP with a hidden dimension of 2048 and an output dimension of 128. The contrastive loss is computed on these 128-dimensional projections. After pretraining, the head is discarded.^[1]

The authors compared three options: no head (use h directly for the loss), a linear head, and a nonlinear (MLP) head. As the paper reports, "a nonlinear projection is better than a linear projection (+3%), and much better than no projection (>10%)."^[1] Crucially, they also found that "the hidden layer before the projection head is a better representation than the layer after," which is why the head is thrown away and the encoder output h is used downstream.^[1] The intuition the paper offers is that the contrastive objective discourages the layer it is applied to from carrying information that distinguishes images of the same class. By placing a head between the encoder and the loss, you let the head specialise in losing that information, while h keeps richer features useful for downstream tasks.

This architectural detail has been adopted, with variations, by basically every contrastive method since.

NT-Xent loss

Given a batch of 2N augmented samples, indexed by i, the loss for a positive pair (i, j) is

L_{i,j} = -log( exp(sim(z_i, z_j)/tau) / sum_{k=1..2N, k!=i} exp(sim(z_i, z_k)/tau) )

where sim(u, v) = u^T v / (||u|| ||v||) is cosine similarity and tau is the temperature. The total loss is the average of L_{i,j} over all 2N positive pairs in the batch (each image contributes two positive pairs, one in each direction).^[1]

Mechanically, NT-Xent is a softmax classification problem. For each z_i, you have one correct answer (its positive partner) and 2N-2 negatives (the other augmented views in the batch). You want to maximise the softmax probability of the correct answer. The temperature scales the logits: lower temperature sharpens the distribution and concentrates gradient on hard negatives, higher temperature makes everything mushier. The paper sweeps the temperature and finds an appropriate value of tau (around 0.1 in the official ImageNet configuration) matters a great deal; it also shows that L2-normalising the projections before the loss is important.^[1] There is no margin parameter, no bank of negatives, no momentum encoder, no clustering step. Just normalised dot products and a temperature-scaled softmax.

The trick is that scaling the batch size also scales the number of negatives. With a batch of 4096, every image is contrasted against 8190 negatives in a single step (2N - 2 = 8190); the paper notes that a batch of 8192 "gives us 16382 negative examples per positive pair from both augmentation views."^[1] That is why SimCLR works without the queues and memory banks that earlier contrastive methods relied on. It is also why SimCLR is hungry for compute. You need TPU pods, accelerator memory, or substantial GPU clusters to fit those batches.

Why does SimCLR work? The four findings

The most cited part of the SimCLR paper is its ablation study. The authors made four claims and produced experimental evidence for each.^[1]

1. Augmentation composition is critical. The strongest single augmentation is random cropping. The strongest pair is cropping plus colour jitter. Asymmetric augmentation (one view crop, the other crop+jitter) helped more than applying the same set of augmentations to both views in some configurations.

2. A nonlinear projection head between the encoder and the loss is important, and it should be discarded at evaluation time. This boosted linear-evaluation accuracy by more than 10 points compared to applying the loss directly to the encoder output.^[1]

3. Larger batch sizes and longer training help more than they help in supervised learning. Linear-evaluation accuracy continued to improve up to 8192 batch size and 1000 training epochs, where the supervised counterpart had long since plateaued.^[1]

4. Scaling up models helps more in self-supervised pretraining than in supervised training. ResNet-50 (4x) did better than ResNet-50 (1x), and the gap to supervised baselines shrank as the model grew.^[1]

None of these findings is, in isolation, surprising in retrospect. But the paper's contribution was to lay them out cleanly and quantitatively, which forced the field to take them seriously.

How was SimCLR trained?

SimCLR is trained on Cloud TPU v3 hardware. The paper notes 32 to 128 TPU cores were used depending on the batch size, and reports that "with 128 TPU v3 cores, it takes ~1.5 hours to train our ResNet-50 with a batch size of 4096 for 100 epochs."^[1] Training is synchronous across all replicas because the contrastive loss requires global access to the entire batch of negatives.

The optimiser is LARS (Layer-wise Adaptive Rate Scaling), which is conventional for very large batch training. The default learning rate scales linearly: lr = 0.3 * BatchSize / 256, so lr = 4.8 at batch size 4096.^[1] The paper also explores a square-root scaling rule (lr = 0.075 * sqrt(BatchSize)) which gives the same value at the default batch size and works better for smaller batches. Linear warmup runs for the first 10 epochs, followed by a cosine decay schedule with no restarts. Weight decay is 1e-6.^[1] Global batch normalisation statistics are aggregated across all TPU replicas, otherwise small per-replica batches lead to leaks of information through batch statistics.

The NT-Xent loss is sensitive to the temperature. The ablation sweeps tau over a range with L2-normalised projections and shows that without L2 normalisation the contrastive accuracy can be higher while the downstream representation gets worse.^[1] The Cloud TPU reference config in the GitHub release uses tau = 0.1 for ImageNet pretraining.^[13]

The official code is at github.com/google-research/simclr. The repo includes TensorFlow implementations for both v1 and v2.^[13]

What results did SimCLR achieve on ImageNet?

SimCLR is evaluated through linear probe, fine-tuning, and semi-supervised splits. The headline number that gets quoted is linear evaluation on ImageNet: freeze the pretrained encoder, train a single linear classifier on top, report top-1 accuracy.

Model	Pretraining	ImageNet linear top-1	Notes
ResNet-50 (1x)	Supervised	76.5%	Standard supervised baseline.^[1]
ResNet-50 (1x)	SimCLR v1, 1000 epochs	69.3%	The headline 1x result from Table 6.^[1]
ResNet-50 (2x)	SimCLR v1, 1000 epochs	74.2%	Wider model.^[1]
ResNet-50 (4x)	SimCLR v1, 1000 epochs	76.5%	Matches the supervised ResNet-50 baseline.^[1]
ResNet-152 (3x+SK)	SimCLR v2, linear eval	79.8%	The largest SimCLR v2 backbone, over 795M parameters.^[2]

Figures are top-1 accuracy under the standard linear-evaluation protocol on the ImageNet validation set. Relative to prior self-supervised work, SimCLR's 76.5% / 93.2% (top-1/top-5) beat CPC v2's 71.5% / 90.1%.^[14]

Fine-tuning the encoder on small fractions of ImageNet labels showed even larger gains. With 1% of labels (about 12,800 images), SimCLR v1 reached 85.8% top-5 image classification accuracy, which the abstract describes as "outperforming AlexNet with 100X fewer labels."^[1] On the same protocol SimCLR's 63.0% / 85.8% (top-1/top-5) far exceeded the previous best self-supervised result of 52.7% / 77.9%.^[14] SimCLR v2 pushed this further: with the ResNet-152 (3x+SK) backbone it reported about 76.6% top-1 with 1% of labels and 80.9% top-1 with 10% of labels, while a ResNet-50 reached 73.9% top-1 with just 1% of labels, "a 10x improvement in label efficiency over the previous state-of-the-art."^[2]

Transfer to other classification benchmarks (CIFAR-10, CIFAR-100, Birdsnap, SUN397, Stanford Cars, Aircraft, DTD, Pets, Caltech-101, Flowers) was competitive with or better than supervised ImageNet pretraining on most datasets in the v1 paper's transfer table.^[1] Detection and segmentation gains were positive but more modest, which became a theme in subsequent self-supervised work: contrastive features transfer well to classification but are sometimes less helpful for dense prediction tasks like object detection.

What is SimCLR v2?

SimCLR v2 was submitted on 17 June 2020 as Big Self-Supervised Models are Strong Semi-Supervised Learners (Chen, Kornblith, Swersky, Norouzi, Hinton, NeurIPS 2020, arXiv:2006.10029).^[2] It is less a redesign than a careful scaling and a new emphasis on semi-supervised learning.

The changes from v1:^[2]

The encoder was scaled up to ResNet-152 (3x) with selective kernels (SK), a model with over 795 million parameters.
The projection head was deepened from 2 to 3 layers.
The first layer of the projection head is kept at fine-tuning time and used as part of the network (the optimal layer to fine-tune from). The whole head is discarded at the linear-probe step, but partial retention helped fine-tuning.
A memory mechanism similar to MoCo's queue was added for very large models, though it was not the main driver of the improvement.

The semi-supervised pipeline runs in three stages, described by the paper as "unsupervised pretraining of a big ResNet model using SimCLRv2, supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge":^[2]

Self-supervised pretraining using the SimCLR framework on the full unlabelled ImageNet set.
Supervised fine-tuning on a small labelled subset (1% or 10% of ImageNet).
Self-training distillation: use the fine-tuned big model as a teacher to label the unlabelled images, then train a smaller student on those pseudo-labels via knowledge distillation.

The distilled student tracks the teacher's accuracy in the limited-label setting at a fraction of the parameters. The v2 paper's headline framing is that a big self-supervised pretrained model, fine-tuned on a slice of labels, then distilled into a small model, can match or beat a same-sized model trained on all the labels supervised.^[2] The label efficiency gain came from pretraining; the deployment efficiency came from distillation.

How does SimCLR differ from MoCo, BYOL, and DINO?

SimCLR is one node in a tightly-clustered family of self-supervised methods that all appeared in 2020-2022. The differences between them are mostly in how they get around the practical issues of contrastive learning (batch size, negative mining, representation collapse) rather than in the core idea.

Method	Year	Lab	Key mechanism	Negatives	Backbone in main results
MoCo v1	Nov 2019	FAIR	Momentum encoder + queue of past keys	Queue (~65k)	ResNet-50
SimCLR	Feb 2020	Google Brain	Large batch + NT-Xent	In-batch (~8k)	ResNet-50
MoCo v2	Mar 2020	FAIR	MoCo + SimCLR's MLP head and stronger augs	Queue	ResNet-50
BYOL	Jun 2020	DeepMind	Predict target network output, no negatives	None	ResNet-50
SwAV	Jun 2020	FAIR	Cluster assignments, swapped prediction	None (via clustering)	ResNet-50
MoCo v3	Apr 2021	FAIR	Adapt MoCo to ViT	Queue	ViT
DINO	Apr 2021	FAIR	Self-distillation, ViT backbone	None	ViT
MAE	Nov 2021	FAIR	Masked patch reconstruction with vision transformer	None	ViT

A few notes on this lineage. MoCo and SimCLR were direct competitors at first.^[3] MoCo v2 borrowed SimCLR's projection-head idea and strong augmentations, then matched SimCLR with smaller batches by keeping its queue.^[4] BYOL went the other direction and showed you could remove negatives entirely by learning to predict the output of a slow-moving target network.^[5] SwAV avoided pairwise contrast by mapping features to a small set of learned prototypes and asking that two views agree on cluster assignment.^[6] DINO and MAE are the transformer-era successors. DINO is a self-distillation cousin of BYOL that produced the cleanest emergent attention maps in vision;^[7] MAE swapped the contrastive paradigm for a much simpler reconstruction objective on masked image patches and turned out to scale better, especially for fine-tuning rather than linear probing.^[8]

If you trace the line from SimCLR forward, two things stand out. First, the projection-head trick is everywhere. Second, the field gradually moved away from explicit negatives, then away from contrastive pairs altogether, then onto transformers, with masked autoencoders eclipsing contrastive methods on most leaderboards by 2022.

How did SimCLR influence CLIP and later models?

SimCLR is often described as a stepping stone to CLIP. The connection is direct. CLIP, published by OpenAI in early 2021, applies contrastive learning to image-text pairs scraped from the web.^[9] The architecture is the same skeleton: an image encoder, a text encoder, a projection head on each side, a temperature-scaled InfoNCE loss over a large batch. The difference is that CLIP uses captions as the second view of an image instead of a colour-jittered crop. Once you accept that any aligned modality can serve as the positive pair, the SimCLR recipe generalises straightforwardly to text, audio, and video.

SimCLR-style pretraining also seeded the representation learning used in many vision-language and multimodal foundation models that followed: ALIGN, BASIC, OpenCLIP, SigLIP, EVA, and the image encoders embedded inside multimodal LLMs.

Within pure vision, SimCLR's findings about augmentation, the projection head, and large-batch contrastive losses are now standard infrastructure. Even the methods that abandon contrastive losses (BYOL, MAE) inherited the augmentation pipeline and the head trick.

What are the limitations of SimCLR?

SimCLR's compute requirements are real. Training on a TPU v3-128 pod for 1000 epochs is far beyond what most academic labs can replicate. The paper's strongest results require batch sizes of 4096 to 8192, which is partly an algorithmic choice (more negatives per step) and partly a consequence of needing many TPU cores fed in parallel.^[1] MoCo's queue was specifically designed to ease this requirement and run on smaller hardware.^[3]

The method is sensitive to batch composition. Repeating an image in a batch can hurt training because near-duplicate negatives create false-negative gradients. In practice this is rare in ImageNet but matters for smaller curated datasets.

The linear probe is an imperfect downstream metric. SimCLR's linear-probe numbers are excellent, but the gap shrinks under fine-tuning and shrinks further on dense prediction tasks. Subsequent work, especially MAE, showed that masking-based reconstruction can produce features that probe slightly worse but fine-tune much better, which is what most practitioners actually care about.^[8]

The projection-head trick is empirical. The paper provides intuition for why discarding the head helps but no clean theoretical account.^[1] Subsequent work has tried to formalise this (with the bottleneck information argument and others), but the exact mechanism is still debated.

References

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." *Proceedings of the 37th International Conference on Machine Learning (ICML)*. arXiv:2002.05709. https://arxiv.org/abs/2002.05709 ↩
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and Hinton, G. (2020). "Big Self-Supervised Models are Strong Semi-Supervised Learners." *Advances in Neural Information Processing Systems 33 (NeurIPS)*. arXiv:2006.10029. https://arxiv.org/abs/2006.10029 ↩
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning." *CVPR*. arXiv:1911.05722. https://arxiv.org/abs/1911.05722 ↩
Chen, X., Fan, H., Girshick, R., and He, K. (2020). "Improved Baselines with Momentum Contrastive Learning." arXiv:2003.04297. (MoCo v2) https://arxiv.org/abs/2003.04297 ↩
Grill, J.-B. et al. (2020). "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning." *NeurIPS*. arXiv:2006.07733. (BYOL) https://arxiv.org/abs/2006.07733 ↩
Caron, M. et al. (2020). "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments." *NeurIPS*. arXiv:2006.09882. (SwAV) https://arxiv.org/abs/2006.09882 ↩
Caron, M. et al. (2021). "Emerging Properties in Self-Supervised Vision Transformers." *ICCV*. arXiv:2104.14294. (DINO) https://arxiv.org/abs/2104.14294 ↩
He, K. et al. (2022). "Masked Autoencoders Are Scalable Vision Learners." *CVPR*. arXiv:2111.06377. (MAE) https://arxiv.org/abs/2111.06377 ↩
Radford, A. et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *ICML*. arXiv:2103.00020. (CLIP) https://arxiv.org/abs/2103.00020 ↩
Hadsell, R., Chopra, S., and LeCun, Y. (2006). "Dimensionality Reduction by Learning an Invariant Mapping." *CVPR*. ↩
van den Oord, A., Li, Y., and Vinyals, O. (2018). "Representation Learning with Contrastive Predictive Coding." arXiv:1807.03748. https://arxiv.org/abs/1807.03748 ↩
Wu, Z., Xiong, Y., Yu, S., and Lin, D. (2018). "Unsupervised Feature Learning via Non-Parametric Instance Discrimination." *CVPR*. arXiv:1805.01978. https://arxiv.org/abs/1805.01978 ↩
Google Research SimCLR repository: github.com/google-research/simclr. https://github.com/google-research/simclr ↩
Chen, T. and Kornblith, S. (2020). "Advancing Self-Supervised and Semi-Supervised Learning with SimCLR." Google Research Blog, April 2020. https://research.google/blog/advancing-self-supervised-and-semi-supervised-learning-with-simclr/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Co-Training Contrastive Learning DINOv2 DINOv3 Dimensionality reduction Image Classification Models Joint Embedding Predictive Architecture Loss Self-Supervised Learning Semi-Supervised Learning Unlabeled example

What problem did SimCLR solve?

How does the SimCLR framework work?

Augmentations

Encoder

Projection head

NT-Xent loss

Why does SimCLR work? The four findings

How was SimCLR trained?

What results did SimCLR achieve on ImageNet?

What is SimCLR v2?

How does SimCLR differ from MoCo, BYOL, and DINO?

How did SimCLR influence CLIP and later models?

What are the limitations of SimCLR?

References

Improve this article

Related Articles

Diffusion model

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

Image Recognition

What links here

Related Articles

Diffusion model

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

Image Recognition

What links here