CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network developed by OpenAI that learns to associate images and text in a shared embedding space. Introduced in January 2021 by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, CLIP was trained on 400 million image-text pairs collected from the internet. The model can perform zero-shot learning on a wide range of visual tasks by matching images to natural language descriptions, without requiring any task-specific training data.
CLIP's design is simple but effective: it trains two separate encoders (one for images, one for text) to produce embeddings that land close together when the image and text describe the same concept. This approach allows CLIP to generalize across many visual classification and retrieval tasks, and it has become a foundational component in systems ranging from image generation (such as DALL-E 2 and Stable Diffusion) to content moderation, visual search, and multimodal reasoning.
Imagine you have a big box of photos and a big box of captions. CLIP is a program that learns to match each photo with its correct caption. It does this by looking at millions of photos and captions from the internet and learning what goes together. Once it has learned, you can show it a new photo it has never seen before, type in some descriptions like "a photo of a cat" and "a photo of a dog," and CLIP will tell you which description best fits the photo. The clever part is that nobody had to teach CLIP what a cat or a dog looks like by hand. It figured it out on its own just by reading captions.
Before CLIP, the standard approach to building visual recognition systems followed a two-step pipeline: first, collect a labeled dataset for a specific task (such as ImageNet for object classification), then train a convolutional neural network (CNN) or Vision Transformer (ViT) on that dataset using supervised learning. While this approach produced strong results within the training distribution, the resulting models were narrow. A model trained on ImageNet could classify 1,000 object categories, but adding a new category required collecting new labeled data and retraining. Performance also degraded significantly when the model encountered images that looked different from its training set (a problem known as distribution shift).
In parallel, natural language processing (NLP) had been moving toward more flexible, task-agnostic models. GPT-2 and GPT-3 demonstrated that a single language model, trained on broad internet text, could perform many different NLP tasks through prompting alone, with no task-specific fine-tuning. The CLIP authors asked whether a similar approach could work for vision: could a model learn visual concepts from natural language supervision at web scale, and then transfer those concepts to new tasks without additional training?
Earlier work had explored this direction. In 2013, Frome et al. introduced DeViSE, which learned to map images into a word embedding space. Li et al. (2017) used natural language supervision for zero-shot visual recognition. Joulin et al. (2016) trained CNNs to predict words in image captions. However, these methods achieved modest results compared to supervised baselines and did not scale well. CLIP built on these ideas but succeeded by dramatically increasing the scale of training data (from tens of millions to 400 million pairs) and by using a contrastive learning objective that proved more efficient than predictive objectives at this scale.
CLIP uses a dual-encoder architecture consisting of two separate neural networks: an image encoder and a text encoder. Each encoder maps its respective input into a fixed-length vector in a shared embedding space. The dimensionality of this space depends on the specific model variant (512 or 768 dimensions for ViT models, 512 to 1024 for ResNet models).
The CLIP paper evaluated two families of image encoders:
ResNet variants. The authors trained five modified ResNet models (ResNet-50, ResNet-101, and three larger variants following the EfficientNet scaling pattern: RN50x4, RN50x16, and RN50x64). These used an attention pooling mechanism instead of global average pooling, where the pooling layer performed multi-head attention with a single query token that attended to the spatial feature map.
Vision Transformer (ViT) variants. The authors also trained three ViT models: ViT-B/32, ViT-B/16, and ViT-L/14. These followed the original ViT architecture with minor modifications. The image is divided into fixed-size patches (e.g., 32x32, 16x16, or 14x14 pixels), each patch is linearly embedded, and the resulting sequence of patch embeddings is processed by a Transformer encoder. A special [CLS] token is prepended to the sequence, and its output representation serves as the image embedding. The ViT-L/14 model was later fine-tuned at 336x336 resolution (referred to as ViT-L/14@336px), which improved performance. The naming convention "ViT-B/32" means a Base-sized Transformer with 32x32 pixel patches.
The ViT-L/14 model generally outperformed all ResNet variants, including the largest RN50x64, despite requiring less compute to train.
The text encoder is a Transformer with the following specifications:
| Parameter | Value |
|---|---|
| Parameters | 63 million |
| Layers | 12 |
| Width | 512 |
| Attention heads | 8 |
| Vocabulary size | 49,152 (lower-cased byte pair encoding) |
| Max context length | 76 tokens (plus special tokens) |
| Architecture style | Decoder-only (causal masking), same as GPT-2 |
The text input is bracketed by [SOS] (start of sequence) and [EOS] (end of sequence) tokens. The representation at the [EOS] token position in the final layer is taken as the text embedding. This embedding is then linearly projected to match the dimensionality of the image embedding space.
Both the image encoder output and the text encoder output are linearly projected into the shared embedding space and L2-normalized. The similarity between an image and a text is computed as the cosine similarity (or equivalently, the dot product of the normalized vectors) between their respective embeddings, scaled by a learned temperature parameter. The temperature parameter is initialized as the equivalent of exp(log(1/0.07)) and is learned during training. It controls the sharpness of the softmax distribution over similarities.
CLIP was trained on a private dataset called WebImageText (WIT), constructed by OpenAI. WIT contains approximately 400 million image-text pairs scraped from the internet. The dataset was assembled by searching for images associated with a set of 500,000 text queries. These queries were derived from multiple sources:
Each query could return up to 20,000 image-text pairs, yielding a total corpus of roughly 400 million pairs. The total text in the dataset amounts to approximately 40 gigabytes, comparable in scale to the WebText dataset used to train GPT-2. WIT has not been publicly released.
CLIP is trained using a symmetric contrastive loss, sometimes called a multi-class N-pair loss or InfoNCE loss. Given a mini-batch of N image-text pairs, the training objective works as follows:
The total loss is the average of the image-to-text loss and the text-to-image loss. This symmetric formulation ensures that the model learns to match images to texts and texts to images equally well. During training, the N-1 non-matching pairs in each row and column serve as negative examples.
The contrastive objective was chosen over alternatives (such as predicting the exact caption given an image) because it proved significantly more efficient. The CLIP authors found that a contrastive approach was roughly 4x more efficient in terms of compute than a predictive approach, enabling the model to be trained on the full 400 million pair dataset within practical time constraints.
The models were trained for 32 epochs on the WIT dataset. The training used large batch sizes (32,768 for the largest models), which is important for contrastive learning because larger batches provide more negative examples per step. Mixed-precision training was used to reduce memory requirements.
| Model | Hardware | Training time |
|---|---|---|
| RN50x64 (largest ResNet) | 592 V100 GPUs | 18 days |
| ViT-L/14 (largest ViT) | 256 V100 GPUs | 12 days |
Images are resized and center-cropped to the model's native resolution (224x224 for most models, 336x336 for ViT-L/14@336px) using bicubic interpolation. Pixel values are normalized to the [0,1] range, then standardized with the following per-channel statistics:
One of CLIP's most significant properties is its ability to perform zero-shot image classification, meaning it can classify images into categories it was never explicitly trained on.
To classify an image using CLIP:
This process requires no gradient updates, no labeled training data for the specific task, and no modification to the model. The only input needed is the list of class names.
On ImageNet, zero-shot CLIP (ViT-L/14@336px) achieved 76.2% top-1 accuracy, matching the performance of the original supervised ResNet-50 trained directly on ImageNet. Across a broader suite of 27 evaluation datasets (spanning OCR, texture recognition, satellite imagery, action recognition, geo-localization, and more), zero-shot CLIP outperformed a fully supervised linear classifier trained on ResNet-50 features on 16 of 27 datasets.
When a linear classifier was fitted on top of CLIP's frozen features (a technique called linear probing), performance on ImageNet improved by nearly 10 percentage points. The linear probe on CLIP ViT-L/14 features outperformed the Noisy Student EfficientNet-L2 (a state-of-the-art supervised model at the time) on 21 out of 27 datasets.
CLIP showed strong robustness to natural distribution shifts. On variants of ImageNet designed to test robustness (ImageNet-V2, ImageNet-R, ImageNet Sketch, ObjectNet), zero-shot CLIP significantly outperformed standard ImageNet-trained models at equivalent ImageNet accuracy. Specifically, CLIP narrowed the "robustness gap" (the difference between in-distribution and out-of-distribution accuracy) by up to 75% compared to over 200 supervised models evaluated on the same benchmarks. This result suggests that learning from natural language supervision produces representations that are more robust than those learned from fixed label sets.
The text prompts used during zero-shot classification significantly affect CLIP's accuracy. This has led to a subfield of prompt engineering specifically for CLIP and similar vision-language models.
Simply using the bare class name (e.g., "cat") as the text input tends to produce lower accuracy than wrapping it in a descriptive template. The default template is:
"a photo of a {class}."
More specific templates can improve accuracy for particular domains. For example:
| Domain | Example template |
|---|---|
| General object recognition | "a photo of a {class}" |
| Fine-grained recognition | "a photo of a {class}, a type of pet" |
| Satellite imagery | "a satellite photo of {class}" |
| Texture classification | "a photo of a {class} texture" |
| Action recognition | "a video of a person doing {class}" |
| Food classification | "a photo of {class}, a type of food" |
| Medical imaging | "a medical image showing {class}" |
OpenAI found that averaging the text embeddings from multiple prompt templates per class improved accuracy by approximately 3.5% on ImageNet. For example, using 80 different templates per class (such as "a bad photo of a {class}," "a sculpture of a {class}," "a photo of many {class}") and averaging their embeddings before computing similarity eliminated ambiguity and improved robustness. Because text embeddings can be precomputed and cached, prompt ensembling adds no computational cost at inference time.
Context Optimization (CoOp), proposed by Zhou et al. (2022), replaces the hand-crafted prompt template with learnable continuous vectors that are optimized on a small set of labeled examples. While the CLIP encoders remain frozen, the prompt tokens are updated via backpropagation. CoOp significantly outperforms hand-crafted prompts in few-shot settings, with as few as 16 labeled examples per class yielding roughly 15 percentage points of improvement over zero-shot CLIP. Conditional Context Optimization (CoCoOp) extends this idea by generating input-conditional prompt tokens, improving generalization to unseen classes.
CLIP and its variants have become infrastructure components in many AI systems.
CLIP plays a central role in several major image generation systems:
DALL-E 2 (unCLIP). Introduced by Ramesh et al. (2022), DALL-E 2 uses CLIP as a core component. The system consists of two stages: a "prior" model that generates a CLIP image embedding from a text caption, and a "decoder" (a diffusion model) that generates an image from the CLIP image embedding. The system is called "unCLIP" because the decoder effectively inverts the CLIP image encoder. This architecture allows the model to generate diverse images for a single prompt because many different images can map to similar CLIP embeddings.
Stable Diffusion. Stable Diffusion uses CLIP's text encoder (specifically, the ViT-L/14 text encoder) to convert text prompts into conditioning embeddings for its latent diffusion model. The text is encoded by CLIP, and the resulting embeddings guide the denoising process via cross-attention layers in the U-Net. Later versions of Stable Diffusion (SDXL and beyond) use multiple text encoders, including both CLIP and OpenCLIP variants.
CLIP guidance. In diffusion-based image generation, CLIP can serve as a gradient signal to steer the generation process toward a target text description. During each denoising step, the partially generated image is encoded by CLIP's image encoder, and the gradient of the CLIP similarity score (between the image embedding and the target text embedding) is added to the denoising update. However, Nichol et al. (2022) found in the GLIDE paper that classifier-free guidance generally outperforms CLIP guidance in terms of image quality.
Because CLIP produces aligned embeddings for images and text, it can be used to build text-to-image and image-to-text retrieval systems. A database of images is encoded once, and then natural language queries are encoded at search time. The images whose embeddings are closest to the query embedding are returned as results. This approach powers visual search features in applications and is used in vector databases for multimodal retrieval.
CLIP's zero-shot capabilities allow it to classify images against content policy categories without task-specific training data. Moderation labels can be defined as text prompts (e.g., "a photo containing violence," "safe content"), and CLIP assigns similarity scores. This makes it possible to update moderation policies by simply changing the text prompts, without retraining the model.
CLIP is used in production systems where labeled training data is scarce or the label space changes frequently. Examples include e-commerce product categorization (where new product categories are added regularly), medical image screening (where collecting labeled datasets is expensive), and anomaly detection in manufacturing (where the types of defects may be rare or novel).
Fine-tuned versions of CLIP are used to predict aesthetic quality scores for images. The LAION aesthetics predictor, for instance, is a linear model trained on top of CLIP embeddings to predict human aesthetic ratings. These scores are used to filter training data for image generation models, including Stable Diffusion.
CLIP's image encoder is used as a visual backbone in many multimodal AI systems. Google DeepMind's Flamingo (2022) uses a frozen CLIP visual encoder as its image understanding component. LLaVA (Large Language Visual Assistant) connects a CLIP ViT-L/14 visual encoder to a large language model, using a simple linear projection to translate CLIP visual features into the language model's input space.
Since CLIP's release, several organizations have developed improved variants addressing different limitations of the original model.
OpenCLIP is an open-source reimplementation of CLIP developed by the LAION (Large-scale Artificial Intelligence Open Network) community. Unlike OpenAI's CLIP, which was trained on the private WIT dataset, OpenCLIP models are trained on publicly available datasets: LAION-400M, LAION-2B, and DataComp-1B. OpenCLIP reproduces and extends the original CLIP training procedure, offering models at scales ranging from ViT-B/32 to ViT-G/14 and beyond. The ViT-G/14 model trained on LAION-2B achieves 80.1% zero-shot top-1 accuracy on ImageNet. OpenCLIP also supports architectures beyond ViT, including ConvNeXt.
SigLIP (Sigmoid Loss for Language Image Pre-training), introduced by Zhai et al. (2023) at Google, replaces CLIP's softmax-based contrastive loss with a sigmoid-based loss. The key difference is architectural in the loss computation: while CLIP's loss requires computing a global NxN similarity matrix across the entire batch and normalizing with softmax, SigLIP evaluates each image-text pair independently using a binary sigmoid classification (is this pair a match or not?).
This change has several practical consequences:
| Property | CLIP (softmax loss) | SigLIP (sigmoid loss) |
|---|---|---|
| Loss computation | Global NxN matrix, softmax normalization | Pairwise, independent sigmoid |
| Memory scaling | Quadratic in batch size | Linear in batch size |
| Batch size at 4 TPU-v4 chips | 2,048 | 4,096 |
| Performance at small batch sizes | Lower | Higher |
| Saturation batch size | ~32k | ~32k |
| ImageNet zero-shot (ViT-L, 256px) | 75.5% | 80.5% |
SigLIP is pre-trained on Google's WebLI dataset. Combined with Locked-image Tuning (LiT), SigLIP achieves 84.5% zero-shot accuracy on ImageNet using only four TPU-v4 chips trained for two days.
EVA-CLIP, developed by Sun, Fang et al. (2023) at the Beijing Academy of Artificial Intelligence (BAAI), improves CLIP training efficiency by initializing the image encoder with weights from EVA, a masked image modeling pre-trained ViT. Rather than training the visual encoder from scratch (as in the original CLIP), EVA-CLIP leverages pre-trained representations and applies improved training techniques.
The largest model, EVA-02-CLIP-E/14+ (5 billion parameters), achieves 82.0% zero-shot top-1 accuracy on ImageNet with only 9 billion training samples seen. A smaller variant, EVA-02-CLIP-L/14+ (430 million parameters), achieves 80.4% zero-shot accuracy with only 6 billion samples, making it one of the most efficient CLIP-scale models in terms of the accuracy-to-compute ratio.
CLIPA (CLIP with an Inverse Scaling Law), introduced by Li, Wang, and Xie (2023) at UC Santa Cruz, discovered that larger encoders can be trained effectively with shorter input sequences. Specifically, when using a larger image encoder, the image can be resized to a lower resolution (reducing the number of patch tokens), and the text can be truncated more aggressively. This inverse scaling law dramatically reduces the computational cost of training.
CLIPA achieves practical results on modest hardware: using 8 A100 GPUs, it reaches 63.2% zero-shot ImageNet accuracy in about 2 days and 69.3% in about 4 days. The CLIPA-v2 variant, using a ViT-G/14 encoder, achieves 83.0% zero-shot ImageNet accuracy while being roughly 33x faster to train than the equivalent OpenCLIP model. The paper was published at NeurIPS 2023.
ALIGN (A Large-scale ImaGe and Language for Noisier web data), developed by Jia et al. (2021) at Google, takes a similar approach to CLIP but uses a noisier, larger dataset of over 1 billion image-text pairs collected from alt-text attributes on web images. ALIGN uses an EfficientNet image encoder and a BERT text encoder, demonstrating that the contrastive learning approach works with different architecture choices and noisier data.
| Model | Organization | Year | Image encoder | Training data | ImageNet zero-shot (best) | Key innovation |
|---|---|---|---|---|---|---|
| CLIP | OpenAI | 2021 | ResNet / ViT | WIT (400M pairs, private) | 76.2% | Original contrastive vision-language model |
| ALIGN | 2021 | EfficientNet | 1B+ alt-text pairs | 76.4% | Noisy, larger-scale data | |
| OpenCLIP | LAION | 2022+ | ViT / ConvNeXt | LAION-2B, DataComp-1B | 80.1% | Open-source, public data |
| SigLIP | 2023 | ViT | WebLI | 84.5% (with LiT) | Sigmoid loss, memory efficient | |
| EVA-CLIP | BAAI | 2023 | EVA ViT | Merged-2B | 82.0% | Pre-trained vision encoder initialization |
| CLIPA | UC Santa Cruz | 2023 | ViT | LAION-2B | 83.0% | Inverse scaling law, training efficiency |
Despite its versatility, CLIP has several well-documented limitations.
While CLIP performs well on common object recognition, it struggles on certain specialized tasks. On the MNIST handwritten digit dataset, zero-shot CLIP achieves only 88% accuracy, far below the 99.75% that humans achieve and the near-perfect accuracy of simple supervised models. CLIP also performs poorly on fine-grained classification tasks that require distinguishing between visually similar subcategories (such as car models, bird species, or flower varieties), counting objects in images, and understanding spatial relationships.
CLIP has limited ability to perform tasks that require compositional or systematic reasoning. For instance, it may struggle to distinguish "a red cube on top of a blue sphere" from "a blue cube on top of a red sphere" because its contrastive training does not explicitly teach compositional understanding of spatial arrangements or attribute binding.
CLIP is vulnerable to typographic attacks, where placing text on an image can cause the model to misclassify the image based on the text content rather than the visual content. For example, writing the word "iPod" on an apple can cause CLIP to classify the image as an iPod. This vulnerability arises because CLIP has learned to associate images with text, and printed text in an image creates a strong signal that can override visual features. Research has measured an average performance drop of approximately 34.7% under typographic attacks compared to standard evaluation.
Because CLIP was trained on unfiltered internet data, it inherits the biases present in that data. Studies have shown that CLIP can encode racial, gender, and other social stereotypes. For example, images of people may be associated with different occupations or attributes depending on perceived gender or race. The OpenAI team acknowledged these biases in their paper and model card, noting that deployment of CLIP-based systems in sensitive domains requires careful bias auditing.
CLIP's text encoder has a maximum context length of 76 to 77 tokens (including special tokens), which limits the complexity of text descriptions it can process. Long or detailed descriptions are truncated, potentially losing important information. Some downstream applications have addressed this by chunking long texts and aggregating their embeddings.
CLIP's zero-shot classification outputs are not well-calibrated, meaning the similarity scores do not correspond directly to reliable probability estimates. The model may produce high confidence scores even for incorrect predictions, and the scores are sensitive to the number and composition of candidate classes being evaluated.
CLIP's training data (WIT) was scraped from the English-speaking web, which introduces several biases. The model performs best on concepts that are well-represented in English internet content and performs worse on culturally specific concepts, non-Latin scripts, and specialized domain imagery that is underrepresented online.
CLIP has had a broad influence on the field of machine learning and computer vision.
The paper demonstrated that natural language supervision can serve as a scalable training signal for visual representations, opening up an alternative to the traditional approach of collecting fixed-label datasets. This insight has been adopted widely, with subsequent models like ALIGN, Florence, and CoCa building on the same principle.
CLIP's image encoder has become a standard visual backbone in multimodal systems. Models such as LLaVA, Flamingo, and many open-source vision-language models use CLIP or OpenCLIP encoders as their visual front end. CLIP's text encoder is the conditioning mechanism in Stable Diffusion and related latent diffusion models.
The CLIP embedding space has become a de facto standard for measuring image-text alignment. The CLIP Score, which measures the cosine similarity between a generated image's CLIP embedding and the conditioning text's CLIP embedding, is widely used as an automatic evaluation metric for text-to-image generation models.
By demonstrating that a single model trained with natural language supervision could match or exceed the performance of task-specific supervised models across dozens of datasets, CLIP shifted the field toward foundation models that learn general-purpose representations from broad data, rather than narrow models trained for individual tasks.