CLIP (Contrastive Language-Image Pre-training)
CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network developed by OpenAI that learns to associate images and text in a shared embedding space. Introduced on 5 January 2021 by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, CLIP was trained on roughly 400 million image-text pairs scraped from the internet.[1][2] The model performs zero-shot learning on a wide range of visual tasks by matching images to natural language descriptions, without requiring any task-specific training data.[1]
CLIP's design is simple but effective: it trains two separate encoders, one for images and one for text, to produce embeddings that land close together when the image and text describe the same concept. This approach allows CLIP to generalize across many visual classification and retrieval tasks, and it has become a foundational component in systems ranging from image generation (such as DALL-E 2 and Stable Diffusion) to content moderation, visual search, and multimodal reasoning.[3][4]
ELI5 (explain like I'm 5)
Imagine you have a big box of photos and a big box of captions. CLIP is a program that learns to match each photo with its correct caption. It does this by looking at millions of photos and captions from the internet and learning what goes together. Once it has learned, you can show it a new photo it has never seen before, type in some descriptions like "a photo of a cat" and "a photo of a dog," and CLIP will tell you which description best fits the photo. The clever part is that nobody had to teach CLIP what a cat or a dog looks like by hand. It figured it out on its own just by reading captions.
At a glance
| Property | Value |
|---|
| Developer | OpenAI |
| Initial release | 5 January 2021 |
| Paper | "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., arXiv:2103.00020) |
| Conference | ICML 2021 |
| Training data | WebImageText (WIT), approximately 400 million image-text pairs (private)[1] |
| Largest released model | ViT-L/14@336px (April 2022) |
| Image encoders | ResNet (RN50, RN101, RN50x4, RN50x16, RN50x64) and Vision Transformer (ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px)[5] |
| Text encoder | 12-layer, 8-head Transformer, 63M parameters, 512-wide[1] |
| Best zero-shot ImageNet top-1 (OpenAI weights) | 76.2% with ViT-L/14@336px[1] |
| License | MIT[6] |
Background and motivation
Before CLIP, the standard approach to building visual recognition systems followed a two-step pipeline: collect a labeled dataset for a specific task (such as ImageNet for object classification), then train a convolutional neural network (CNN) or Vision Transformer (ViT) on that dataset using supervised learning. The resulting models were narrow. A model trained on ImageNet could classify 1,000 object categories, but adding a new category required collecting new labeled data and retraining. Performance also degraded significantly when the model encountered images that looked different from its training set (a problem known as distribution shift).[1]
In parallel, natural language processing (NLP) had been moving toward more flexible, task-agnostic models. GPT-2 and GPT-3 demonstrated that a single language model trained on broad internet text could perform many different NLP tasks through prompting alone, with no task-specific fine-tuning. The CLIP authors asked whether a similar approach could work for vision: could a model learn visual concepts from natural language supervision at web scale, and then transfer those concepts to new tasks without additional training?[1]
Earlier work had explored this direction. In 2013, Frome et al. introduced DeViSE, which learned to map images into a word embedding space.[7] Joulin et al. (2016) trained CNNs to predict words in image captions, and Li et al. (2017) used natural language supervision for zero-shot visual recognition. Sariyildiz et al. and Desai and Johnson (VirTex, 2020) showed that captioning-style supervision could produce useful visual representations. However, these methods achieved modest results compared to supervised baselines, with VirTex topping out at roughly the accuracy of a ResNet-50 trained on ImageNet, and they did not scale to evaluation across dozens of datasets.[1] CLIP built on these ideas but succeeded by dramatically increasing the scale of training data, from tens of millions to 400 million pairs, and by using a contrastive learning objective that proved more efficient than predictive objectives at this scale.[1]
The paper was first posted to arXiv on 26 February 2021 and presented at the 38th International Conference on Machine Learning (ICML 2021).[2] OpenAI also released the smaller checkpoints in stages: ViT-B/32 and RN50 in January 2021; RN101 and RN50x4 later in 2021; RN50x16 and ViT-B/16 in July 2021; RN50x64 and ViT-L/14 in January 2022; and ViT-L/14@336px in April 2022.[6]
Architecture
CLIP uses a dual-encoder architecture consisting of two separate neural networks: an image encoder and a text encoder. Each encoder maps its respective input into a fixed-length vector in a shared embedding space. The dimensionality of this space depends on the specific model variant, typically 512 for ViT-B and 768 for ViT-L on the image side, with a final joint projection that aligns image and text embedding dimensionality.[1][5]
Image encoder
The CLIP paper evaluated two families of image encoders.
ResNet variants. The authors trained five modified ResNet models: ResNet-50, ResNet-101, and three larger variants following the EfficientNet-style compound scaling pattern, RN50x4, RN50x16, and RN50x64, where the multiplier indicates approximate compute relative to a ResNet-50. These networks used an attention pooling mechanism in place of global average pooling: the pooling layer performed multi-head attention with a single learned query token that attended to the spatial feature map.[1]
Vision Transformer (ViT) variants. The authors also trained ViT-B/32, ViT-B/16, and ViT-L/14, closely following the architecture of Dosovitskiy et al. (2021).[8] The image is divided into fixed-size patches (32x32, 16x16, or 14x14 pixels), each patch is linearly embedded, and the resulting sequence of patch embeddings is processed by a Transformer encoder. A learnable [CLS] token is prepended to the sequence, and its output representation at the final layer serves as the image embedding. The ViT-L/14 model was subsequently fine-tuned for one additional epoch at 336x336 resolution to produce ViT-L/14@336px.[1] The naming convention "ViT-B/32" denotes a base-sized Transformer with 32x32 pixel patches.
The ViT-L/14 model generally outperformed all ResNet variants, including the largest RN50x64, despite requiring less compute to train.[1] OpenAI's ViT-L/14 (vision tower) has roughly 304M parameters; combined with the text tower the full checkpoint is approximately 428M parameters.[9]
Text encoder
The text encoder is a Transformer with the following specifications:
| Parameter | Value |
|---|
| Parameters | 63 million[1] |
| Layers | 12 |
| Width | 512 |
| Attention heads | 8 |
| Vocabulary size | 49,152 (lower-cased byte-pair encoding) |
| Max context length | 76 tokens (plus special tokens, 77 total) |
| Architecture style | Decoder-only with causal masking, similar to GPT-2 |
The text input is bracketed by [SOS] (start of sequence) and [EOS] (end of sequence) tokens. The representation at the [EOS] token position in the final layer is taken as the text embedding. This embedding is then linearly projected to match the dimensionality of the image embedding space.[1]
Projection and similarity
Both the image encoder output and the text encoder output are linearly projected into the shared embedding space and L2-normalized. The similarity between an image and a text is computed as the cosine similarity (equivalently, the dot product of the normalized vectors) between their respective embeddings, scaled by a learned temperature parameter.[1] The temperature is parameterised as exp(t) where t is a free scalar initialised at log(1/0.07) and clipped to prevent training instability. It controls the sharpness of the softmax distribution over similarities.
The CLIP architecture trains the entire model from scratch with no warm-start from pretrained image or text models; all weights are randomly initialised before contrastive training.[1]
Training
WebImageText (WIT) dataset
CLIP was trained on a private dataset called WebImageText (WIT), constructed by OpenAI. WIT contains approximately 400 million image-text pairs scraped from the internet. The dataset was assembled by searching for images associated with a set of 500,000 text queries. These queries were derived from multiple sources:[1]
- Words appearing more than 100 times in the English Wikipedia
- High-frequency bigrams from Wikipedia
- Titles of English Wikipedia articles
- All WordNet synsets
Each query could return up to 20,000 image-text pairs, yielding a total corpus of roughly 400 million pairs. The total text in the dataset amounts to roughly 40 gigabytes, comparable in scale to the WebText dataset used to train GPT-2.[1] WIT has not been publicly released. (The name conflicts with a separate Wikipedia-based dataset of the same acronym from Google Research; the two are unrelated.)
Subsequent reproduction efforts (notably MetaCLIP[10], OpenCLIP's training on LAION-400M and LAION-5B, and DataComp) helped clarify what made WIT effective, including aggressive subsampling per query, balancing across concepts, and stripping out near-duplicates.[10][11][12]
Contrastive learning objective
CLIP is trained using a symmetric contrastive loss, sometimes called a multi-class N-pair loss or InfoNCE loss (van den Oord et al., 2018).[13] Given a mini-batch of N image-text pairs, the training objective works as follows:[1]
- Encode all N images and all N texts to produce N image embeddings and N text embeddings.
- Compute the NxN matrix of cosine similarities between every image embedding and every text embedding.
- Scale the similarity matrix by exp(t), where t is the learned temperature.
- Apply a cross-entropy loss in both directions: each image should have the highest similarity with its corresponding text, and vice versa. The correct pairing lies on the diagonal of the NxN matrix.
Concretely, the per-image loss is L_i2t = -log( exp(s_ii / τ) / Σ_j exp(s_ij / τ) ), the analogous text-to-image loss is L_t2i = -log( exp(s_ii / τ) / Σ_j exp(s_ji / τ) ), and the total loss is (L_i2t + L_t2i) / 2, averaged over the batch. The temperature τ = exp(-t) controls how sharply the softmax concentrates on the diagonal.[1] During training, the N-1 non-matching pairs in each row and column serve as in-batch negatives.
The contrastive objective was chosen over alternatives such as predicting the exact caption tokens given an image. The CLIP authors found that the contrastive approach was roughly 4x more efficient in terms of compute than a token-prediction approach at equal validation accuracy, enabling the model to be trained on the full 400 million pair dataset within practical time budgets.[1]
Training details
The models were trained for 32 epochs on the WIT dataset. The training used very large global batch sizes (32,768 for the largest models), which is important for contrastive learning because larger batches provide more negative examples per step.[1] Mixed-precision training was used to reduce memory requirements, gradient checkpointing was applied to selected layers, and the cosine similarity computation was sharded across devices to fit the NxN matrix in memory. The Adam optimizer was used with a cosine learning-rate schedule and decoupled weight decay; full hyperparameters are listed in the paper's appendix.[1]
| Model | Hardware | Training time |
|---|
| RN50x64 (largest ResNet) | 592 V100 GPUs | 18 days[1] |
| ViT-L/14 (largest ViT) | 256 V100 GPUs | 12 days[1] |
The ViT-L/14@336px variant was produced by fine-tuning ViT-L/14 for one additional epoch at the higher 336x336 resolution, which improved zero-shot accuracy by roughly one percentage point on ImageNet.[1]
Image preprocessing
Images are resized and center-cropped to the model's native resolution (224x224 for most models, 336x336 for ViT-L/14@336px) using bicubic interpolation. Pixel values are scaled to [0,1] and then standardized with the following per-channel statistics:[6]
- Mean: [0.48145466, 0.4578275, 0.40821073]
- Standard deviation: [0.26862954, 0.26130258, 0.27577711]
These constants come from the WIT data distribution rather than ImageNet's standard normalisation, and most downstream pipelines reuse them when feeding images to CLIP encoders.
Zero-shot transfer
One of CLIP's most significant properties is its ability to perform zero-shot image classification, meaning it can classify images into categories it was never explicitly trained on.
How zero-shot classification works
To classify an image using CLIP:[1]
- Define the set of candidate class labels (e.g., "cat," "dog," "car").
- Convert each class label into a text prompt, typically using a template like "a photo of a {class}."
- Encode the image with the image encoder and encode each text prompt with the text encoder.
- Compute the cosine similarity between the image embedding and each text embedding.
- Select the class whose text embedding has the highest similarity to the image embedding.
This process requires no gradient updates, no labeled training data for the specific task, and no modification to the model. The only input needed is the list of class names. Because text embeddings can be precomputed and cached, the entire zero-shot classifier reduces to a single matrix multiplication at inference time.
Benchmark results
On ImageNet, zero-shot CLIP (ViT-L/14@336px) achieved 76.2% top-1 accuracy, matching the performance of the original supervised ResNet-50 trained directly on ImageNet's 1.28 million labelled examples.[1] Across a broader suite of 27 evaluation datasets spanning OCR, texture recognition, satellite imagery, action recognition, geo-localization, country identification, fine-grained pet classification, and more, zero-shot CLIP outperformed a fully supervised linear classifier trained on ResNet-50 features on 16 of 27 datasets.[1]
When a linear classifier was fitted on top of CLIP's frozen features (a technique called linear probing), performance on ImageNet improved by nearly 10 percentage points. The linear probe on CLIP ViT-L/14 features outperformed the Noisy Student EfficientNet-L2, the state-of-the-art supervised model at the time, on 21 out of 27 datasets.[1]
Distribution shift robustness
CLIP showed strong robustness to natural distribution shifts. On variants of ImageNet designed to test robustness (ImageNet-V2, ImageNet-R, ImageNet Sketch, ObjectNet, ImageNet-A), zero-shot CLIP significantly outperformed standard ImageNet-trained models at equivalent ImageNet accuracy. CLIP narrowed the "effective robustness gap," the difference between in-distribution and out-of-distribution accuracy, by up to 75% compared to more than 200 supervised models evaluated on the same benchmarks.[1] This result suggested that learning from natural language supervision produces representations that are more robust than those learned from fixed label sets, a finding that has held up under follow-up analyses[14] though the exact mechanism is debated.
Prompt engineering for CLIP
The text prompts used during zero-shot classification significantly affect CLIP's accuracy. Carefully designed prompts have produced a sub-field of prompt engineering specifically for CLIP and similar vision-language models.
Prompt templates
Using the bare class name (e.g., "cat") as the text input tends to produce lower accuracy than wrapping it in a descriptive template. OpenAI's reference template is:
"a photo of a {class}."
More specific templates can improve accuracy for particular domains.[1]
| Domain | Example template |
|---|
| General object recognition | "a photo of a {class}" |
| Fine-grained recognition | "a photo of a {class}, a type of pet" |
| Satellite imagery | "a satellite photo of {class}" |
| Texture classification | "a photo of a {class} texture" |
| Action recognition | "a video of a person doing {class}" |
| Food classification | "a photo of {class}, a type of food" |
| Medical imaging | "a medical image showing {class}" |
Prompt ensembling
OpenAI found that averaging the text embeddings from many prompt templates per class improved accuracy by approximately 3.5 percentage points on ImageNet. The released code ships with 80 templates per class, including phrasings such as "a bad photo of a {class}," "a sculpture of a {class}," "a low-resolution photo of a {class}," and "a photo of many {class}."[1][6] Because each template's embedding can be precomputed and the average cached as a single vector per class, prompt ensembling adds no inference-time cost.
Learned prompts (CoOp and CoCoOp)
Context Optimization (CoOp), proposed by Zhou et al. (2022), replaces the hand-crafted prompt template with learnable continuous vectors that are optimized on a small set of labeled examples. While the CLIP encoders remain frozen, the prompt tokens are updated via backpropagation. CoOp significantly outperforms hand-crafted prompts in few-shot settings, with as few as 16 labeled examples per class yielding roughly 15 percentage points of improvement over zero-shot CLIP on average.[15] Conditional Context Optimization (CoCoOp) extends this idea by generating input-conditional prompt tokens, improving generalization to unseen classes.[16]
Applications
CLIP and its variants have become infrastructure components in many AI systems.
Image generation
CLIP plays a central role in several major image generation systems.
DALL-E 2 (unCLIP). Introduced by Ramesh et al. (2022), DALL-E 2 uses CLIP as a core component.[3] The system consists of two stages: a "prior" model that generates a CLIP image embedding from a text caption, and a "decoder," a diffusion model, that generates an image from the CLIP image embedding. The system is referred to as "unCLIP" because the decoder effectively inverts the CLIP image encoder. This architecture allows the model to generate diverse images for a single prompt because many different images can map to similar CLIP embeddings. The authors experimented with autoregressive and diffusion priors and found the diffusion prior more efficient.[3]
Stable Diffusion. Stable Diffusion 1.x (Rombach et al., 2022) uses CLIP's text encoder (specifically OpenAI's frozen ViT-L/14 text encoder) to convert text prompts into conditioning embeddings for its latent diffusion model.[4] The non-pooled output sequence of the text encoder is fed into the U-Net backbone via cross-attention layers; this combination of an 860M-parameter U-Net with a 123M-parameter CLIP text encoder defined the SD 1.4 / 1.5 baseline.[4] Stable Diffusion 2.x switched the text encoder to OpenCLIP ViT-H/14, SDXL (Podell et al., 2023) used a dual setup of OpenAI's CLIP ViT-L/14 alongside OpenCLIP ViT-bigG/14 with concatenated penultimate-layer outputs, and Stable Diffusion 3 added a third T5-XXL encoder while keeping both CLIP variants in its MMDiT architecture.[17][18]
CLIP guidance. In diffusion-based image generation, CLIP can serve as a gradient signal to steer the generation process toward a target text description. During each denoising step, the partially generated image is encoded by CLIP's image encoder, and the gradient of the CLIP similarity score (between the image embedding and the target text embedding) is added to the denoising update. Nichol et al. (2022) found in the GLIDE paper that classifier-free guidance generally outperforms CLIP guidance in terms of image quality, and most modern systems use classifier-free guidance instead.[19]
Newer text-to-image stacks continue to evolve away from CLIP-only conditioning. FLUX.1, Black Forest Labs' 2024 text-to-image model, pairs a CLIP text encoder with a T5-XXL encoder; this dual-encoder pattern has become a common 2024-2025 design.[17][20] Even as systems move toward larger language models for prompt understanding, CLIP-derived embeddings remain useful for short-prompt aesthetic conditioning and for backwards compatibility with the large ecosystem of CLIP-trained LoRAs and embeddings.
Image and video search
Because CLIP produces aligned embeddings for images and text, it can be used to build text-to-image and image-to-text retrieval systems. A database of images is encoded once into CLIP space, and then natural language queries are encoded at search time. The images whose embeddings are closest to the query embedding are returned as results. This approach powers visual search features in applications such as Pinterest's "Lens" workflow, Unsplash's natural-language search, and the LAION-5B index, and it is widely used in vector databases for multimodal retrieval.[11]
Content moderation
CLIP's zero-shot capabilities allow it to classify images against content policy categories without task-specific training data. Moderation labels can be defined as text prompts (for example, "a photo containing violence," "safe content," "a photo of a weapon"), and CLIP assigns similarity scores. This makes it possible to update moderation policies by changing the text prompts, without retraining the model. The Stable Diffusion releases use a CLIP-based "safety checker" trained to detect NSFW content in generated images by comparing image embeddings to a held-out set of unsafe concept embeddings.[21]
Aesthetic scoring
Fine-tuned versions of CLIP are used to predict aesthetic quality scores for images. The LAION aesthetics predictor is a small MLP trained on top of frozen CLIP embeddings to predict human aesthetic ratings on a 1-10 scale. These scores are used to filter training data for image generation models, including the LAION-Aesthetics subset that helped train Stable Diffusion.[11]
Evaluation metric: CLIPScore
Hessel et al. (2021) introduced CLIPScore, a reference-free metric for image captioning and text-to-image generation that scores image-caption pairs by their CLIP cosine similarity.[22] CLIPScore correlates more strongly with human judgments than reference-based metrics like CIDEr and SPICE on multiple captioning benchmarks, and it has become a standard automatic metric for measuring image-text alignment in generative models. Variants like RefCLIPScore combine the reference-free signal with reference-based comparison.[22]
Multimodal models
CLIP's image encoder is used as a visual backbone in many multimodal AI systems. Google DeepMind's Flamingo (2022) uses a frozen contrastively trained NFNet visual encoder built in the spirit of CLIP, while LLaVA (Liu et al., 2023) connects a CLIP ViT-L/14 visual encoder to a large language model using a simple linear projection (later upgraded to a two-layer MLP) to translate CLIP visual features into the language model's input space.[23] Many open-weight vision-language models, including MiniGPT-4, InstructBLIP, Qwen-VL, InternVL, and the LLaVA-1.5/1.6/NeXT family, use CLIP or OpenCLIP encoders as the visual front end.[23]
CLIP embeddings are also used in robotics, audio (via Wav2CLIP-style joint embedding spaces), and 3D systems such as OpenAI's Point-E, where a CLIP image encoder provides language-aligned conditioning for point cloud generation.
Variants and successors
Since CLIP's release, several organizations have developed improved variants addressing different limitations of the original model.
OpenCLIP
OpenCLIP is an open-source reimplementation of CLIP developed by Ilharco, Wightman, Schmidt, and collaborators across the LAION community and academic groups.[11] Unlike OpenAI's CLIP, which was trained on the private WIT dataset, OpenCLIP models are trained on publicly available datasets: LAION-400M, LAION-2B, and DataComp-1B.[12] OpenCLIP reproduces and extends the original CLIP training procedure, offering models at scales ranging from ViT-B/32 to ViT-G/14 and beyond, plus architectures such as ConvNeXt.
Cherti et al. ("Reproducible scaling laws for contrastive language-image learning," CVPR 2023) systematically swept model size, training samples seen, and dataset across LAION and identified power-law scaling for zero-shot classification, retrieval, linear probing, and fine-tuning, while showing that OpenAI's and OpenCLIP's models exhibit measurably different scaling behaviour despite identical architectures and similar recipes, reflecting differences in pretraining distribution.[24] OpenCLIP's flagship ViT-bigG/14 model trained on LAION-2B (39B samples seen at batch 160k) reaches 80.1% zero-shot top-1 on ImageNet.[25] LAION's earlier H/14 release achieves 78.0% top-1 and 73.4% Recall@5 on MS COCO image retrieval.[26]
SigLIP and SigLIP 2
SigLIP (Sigmoid Loss for Language Image Pre-training), introduced by Zhai et al. (2023) at Google, replaces CLIP's softmax-based contrastive loss with a pairwise sigmoid loss.[27] While CLIP's loss requires computing a global NxN similarity matrix across the entire batch and normalizing with softmax, SigLIP evaluates each image-text pair independently using a binary sigmoid classification (is this pair a match or not?).
This change has several practical consequences:
| Property | CLIP (softmax loss) | SigLIP (sigmoid loss) |
|---|
| Loss computation | Global NxN matrix, softmax normalization | Pairwise, independent sigmoid |
| Memory scaling | Quadratic in batch size | Linear in batch size |
| Performance at small batch sizes | Lower | Higher |
| Saturating batch size | ~32k | ~32k |
| ImageNet zero-shot (ViT-L, 256px) | 75.5% | 80.5%[27] |
SigLIP's loss factorises across devices, which allowed Zhai et al. to scale to a million-sample batch (with diminishing returns past 32k). Combined with Locked-image Tuning (LiT), SigLIP achieves 84.5% zero-shot accuracy on ImageNet using only four TPU-v4 chips for two days of fine-tuning.[27]
SigLIP 2, released by Tschannen, Gritsenko, Wang, and colleagues in February 2025, unifies the sigmoid loss with decoder-based captioning (LocCa), self-distillation, masked prediction, and online data curation (ACID).[28] SigLIP 2 ships four model sizes (ViT-B at 86M, ViT-L at 303M, ViT-So400m at 400M, and ViT-g at roughly 1B parameters), is trained on a multilingual WebLI mixture (90% English, 10% other languages), and includes a NaFlex variant that supports multiple input resolutions while preserving native aspect ratios. SigLIP 2 reports 79.1% (B/16, 256px), 82.5% (L/16, 256px), 83.4% (So400m/16, 256px), and 84.5% (g/16, 256px) zero-shot ImageNet top-1, beating both the original SigLIP and OpenAI CLIP at matched compute.[28] The release is widely used as the visual encoder in 2025-era open-source multimodal LLMs.
EVA-CLIP and EVA-CLIP-18B
EVA-CLIP, developed by Sun, Fang et al. (2023) at the Beijing Academy of Artificial Intelligence (BAAI), improves CLIP training efficiency by initializing the image encoder with weights from EVA, a masked image modeling pre-trained ViT.[29] Rather than training the visual encoder from scratch, EVA-CLIP leverages pre-trained representations and applies improved training techniques such as LAMB optimisation and bfloat16 mixed precision.
The largest 2023 model, EVA-02-CLIP-E/14+ (5 billion parameters), achieves 82.0% zero-shot top-1 accuracy on ImageNet with only 9 billion training samples seen.[29] EVA-02-CLIP-L/14+ (430 million parameters) achieves 80.4% zero-shot accuracy with only 6 billion samples, making it one of the most compute-efficient CLIP-scale models per accuracy point.
In February 2024, BAAI released EVA-CLIP-18B, an 18-billion-parameter ViT-based CLIP that averages 80.7% zero-shot top-1 across 27 image-classification benchmarks while training on the openly available LAION-2B and COYO-700M combination (2B image-text pairs total). At the time of release it was the largest publicly available CLIP model, showing that EVA-style weak-to-strong scaling continued to deliver gains beyond 5B parameters.[30]
CLIPA
CLIPA ("An Inverse Scaling Law for CLIP Training"), introduced by Li, Wang, and Xie (2023) at UC Santa Cruz, identified that larger encoders can be trained effectively with shorter input sequences.[31] Specifically, when using a larger image encoder, the image can be resized to a lower resolution (reducing the number of patch tokens), and the text can be truncated more aggressively. This inverse scaling law dramatically reduces the computational cost of training.
CLIPA achieves practical results on modest hardware: using 8 A100 GPUs, it reaches 63.2% zero-shot ImageNet accuracy in about 2 days and 69.3% in about 4 days. The CLIPA-v2 variant, using a ViT-G/14 encoder, achieves 83.0% zero-shot ImageNet accuracy while being roughly 33x faster to train than the equivalent OpenCLIP model. The paper was published at NeurIPS 2023.[31]
ALIGN
ALIGN ("A Large-scale ImaGe and Noisy-text embedding"), developed by Jia et al. (2021) at Google, used a contrastive image-text objective on more than 1 billion image-text pairs collected from raw alt-text without aggressive cleaning.[32] ALIGN uses an EfficientNet image encoder and a BERT-style text encoder, showing that the contrastive approach works with different architecture choices and much noisier data, and reaching 76.4% zero-shot ImageNet top-1 with EfficientNet-L2. ALIGN appeared on arXiv within weeks of CLIP and is often discussed alongside it as the second of the two concurrent demonstrations that contrastive image-text training scales.[32]
CoCa
CoCa ("Contrastive Captioners are Image-Text Foundation Models"), by Yu et al. (2022) at Google, merges a contrastive objective (as in CLIP) with a captioning loss in a single encoder-decoder model.[33] The first half of the decoder is unimodal, producing aligned text embeddings via contrastive loss, and the second half cross-attends to image features to produce a captioning loss. CoCa reaches 86.3% zero-shot top-1 on ImageNet (with a much larger model than CLIP and additional training data) and reports state-of-the-art results across image captioning, retrieval, action recognition, and VQA tasks.[33] CoCa demonstrated that contrastive and generative objectives are complementary, an insight later folded into SigLIP 2.
MetaCLIP (Xu et al., 2023; "Demystifying CLIP Data") attempted to reverse-engineer CLIP's data curation by extracting and balancing a 500k-entry "metadata" vocabulary derived from the CLIP paper, then filtering Common Crawl with it.[10] MetaCLIP-400M (matched in size to WIT) reaches 70.8% ImageNet zero-shot on ViT-B/32 versus OpenAI's 68.3%, and a 2.5B-image scale-up further improves accuracy. The paper's headline contribution is a transparent algorithm for CLIP-style curation, with the result that data sourcing, rather than architecture or loss, accounts for much of CLIP's effectiveness.[10] A 2025 follow-up, Meta CLIP 2, extends the recipe to a worldwide multilingual setting.
DataComp and Data Filtering Networks
DataComp (Gadre et al., NeurIPS 2023 Datasets and Benchmarks) is a public benchmark for image-text dataset design centred on a 12.8B Common Crawl candidate pool.[12] Participants submit filtering or curation strategies, then evaluate by training a CLIP architecture under fixed compute and measuring accuracy across 38 downstream tasks. The reference dataset, DataComp-1B (1.4B image-text pairs), trains a ViT-L/14 to 79.2% zero-shot ImageNet top-1, 3.7 points above OpenAI's ViT-L/14 at matched compute.[12]
Apple's Data Filtering Networks (DFN; Fang et al., 2023, ICLR 2024) trained small auxiliary networks to filter raw image-text pools. DFN-5B (filtered from 43B raw pairs) produces a ViT-H/14 that reaches 83.0% ImageNet zero-shot, and Apple's release of DFN-5B-CLIP-ViT-H-14-378 (378x378 fine-tune) is widely used as a vision encoder.[34]
Comparison with DINOv2
DINOv2 (Oquab et al., Meta, 2023) is a contemporary self-supervised vision encoder that uses image-only self-distillation rather than image-text contrastive supervision. DINOv2 produces strong dense features and outperforms OpenCLIP on dense prediction tasks like segmentation and depth estimation, while CLIP retains advantages on text-conditioned tasks like zero-shot classification and OCR.[35] In modern VLMs both encoders are sometimes used in combination, with CLIP providing semantic alignment to language and DINOv2 providing fine spatial features.
Summary of CLIP variants
| Model | Organization | Year | Image encoder | Training data | ImageNet zero-shot (best) | Key innovation |
|---|
| CLIP | OpenAI | 2021 | ResNet / ViT | WIT (400M, private) | 76.2% (ViT-L/14@336)[1] | Original contrastive vision-language model |
| ALIGN | Google | 2021 | EfficientNet | 1.8B noisy alt-text pairs | 76.4%[32] | Larger noisy data |
| CoCa | Google | 2022 | ViT | JFT-3B + ALIGN | 86.3%[33] | Joint contrastive + captioning |
| OpenCLIP | LAION | 2022+ | ViT / ConvNeXt | LAION-2B, DataComp-1B | 80.1% (bigG/14)[25] | Open-source, public data, scaling laws |
| SigLIP | Google | 2023 | ViT | WebLI | 84.5% (with LiT)[27] | Sigmoid loss, memory efficient |
| EVA-CLIP | BAAI | 2023 | EVA ViT | Merged-2B | 82.0% (E/14+)[29] | Pre-trained vision encoder init |
| CLIPA | UC Santa Cruz | 2023 | ViT | LAION-2B | 83.0% (G/14)[31] | Inverse scaling law, training efficiency |
| MetaCLIP | Meta | 2023 | ViT | Common Crawl 2.5B | 79.2% (H/14)[10] | Reproduced WIT-style curation |
| DataComp-1B | DataComp consortium | 2023 | ViT | DataComp-1B (1.4B) | 79.2% (L/14)[12] | Public data curation benchmark |
| DFN | Apple / U.Washington | 2023 | ViT | DFN-5B | 83.0% (H/14)[34] | Learned data filtering |
| EVA-CLIP-18B | BAAI | 2024 | ViT (18B) | LAION-2B + COYO-700M | 80.7% (avg 27 sets)[30] | Largest open CLIP at release |
| SigLIP 2 | Google | 2025 | ViT (B, L, So400m, g) | Multilingual WebLI | 84.5% (g/16)[28] | Captioning + distillation + multilingual |
Limitations
Despite its versatility, CLIP has several well-documented limitations.
While CLIP performs well on common object recognition, it struggles on certain specialized tasks. On MNIST handwritten digits, zero-shot CLIP achieves only 88% accuracy, far below the 99.75% that humans achieve and the near-perfect accuracy of simple supervised models trained directly on MNIST.[1] CLIP also performs poorly on fine-grained classification tasks that require distinguishing visually similar subcategories (such as bird species or flower varieties in datasets like CUB-200 or Oxford Flowers), counting objects, and understanding spatial relationships such as "left of" or "above."[1]
Abstract and systematic reasoning
CLIP has limited ability to perform compositional or systematic reasoning. For instance, it may struggle to distinguish "a red cube on top of a blue sphere" from "a blue cube on top of a red sphere" because its contrastive training does not explicitly teach compositional understanding of spatial arrangements or attribute binding. Probe sets such as ARO (Attribution, Relation, and Order) and SugarCrepe have measured CLIP's compositional weaknesses systematically.[36]
Typographic attacks
CLIP is vulnerable to typographic attacks, in which placing text on an image causes the model to misclassify the image based on the text content rather than the visual content. Goh et al. (2021) at OpenAI showed that writing the word "iPod" on an apple causes CLIP to classify the apple as an iPod, sometimes with higher confidence than the unmodified image.[37] These attacks generalise: writing "$$$" near a piggy bank object can cause activations of a "finance" neuron, while writing "robot" on a shirt can fool detectors trained on CLIP features. The same paper documented multimodal neurons inside CLIP that respond consistently to a concept across photographs, illustrations, sketches, and the printed word, which is the underlying mechanism enabling the attacks.[37]
Social biases
Because CLIP was trained on unfiltered internet data, it inherits biases present in that data. Agarwal et al. (2021) and the model card document that CLIP can encode racial, gender, age, and other social stereotypes.[38][6] In the FairFace evaluation reported on the model card, CLIP correctly classifies gender at over 96% accuracy averaged across races, but classification of race (about 93%) and age (about 63%) varies more, and denigration probes show disparities in associations between racial categories and crime- or animal-related terms. The OpenAI team explicitly notes that deployment of CLIP-based systems in sensitive domains requires careful bias auditing, and that "any deployed use case of the model, whether commercial or not, is currently out of scope."[6]
Context length limitation
CLIP's text encoder has a maximum context length of 76 tokens (plus the start and end tokens, for 77 total positions), which limits the complexity of text descriptions it can process. Long or detailed descriptions are truncated, potentially losing important information.[1] Some downstream applications address this by chunking long texts and aggregating their embeddings, or by replacing the CLIP text encoder with a longer-context one like Long-CLIP or T5.
Calibration
CLIP's zero-shot classification outputs are not well-calibrated, meaning the similarity scores do not correspond directly to reliable probability estimates. The model may produce high confidence scores for incorrect predictions, and the scores are sensitive to the number and composition of candidate classes being evaluated.[1]
Data limitations
CLIP's WIT training data was scraped from the English-speaking web, which produces several biases. The model performs best on concepts well-represented in English internet content and worse on culturally specific concepts, non-Latin scripts, and specialized domain imagery (medicine, satellite, scientific) that is underrepresented online.[1] Birhane, Prabhu, and Kahembwe (2021) audited LAION-400M, the open analogue of WIT, and documented misogynistic, racist, and pornographic imagery in the dataset, raising similar concerns for any CLIP-like model trained on uncurated web data.[39]
Move toward end-to-end text encoders
Modern text-to-image systems are increasingly moving away from using a frozen CLIP text encoder as the sole conditioning signal. Stable Diffusion 3 and FLUX.1 pair CLIP with T5-XXL, which produces stronger prompt understanding because it was trained on much longer text. Some 2024-2025 systems fine-tune the text encoder end-to-end with the diffusion model, accepting the additional cost in exchange for more accurate adherence to long, compositional prompts.[17] CLIP's role in text-to-image generation is shifting from "sole prompt encoder" toward "aesthetic and style encoder used in combination with a larger LLM-style text model."
Impact and legacy
CLIP has had a broad influence on the field of machine learning and computer vision.
The paper demonstrated that natural language supervision can serve as a scalable training signal for visual representations, opening an alternative to the traditional approach of collecting fixed-label datasets. This insight has been adopted widely, with subsequent models such as ALIGN, Florence, CoCa, BLIP, SigLIP, and DataComp building on the same principle.
CLIP's image encoder has become a standard visual backbone in multimodal systems. Models such as LLaVA, MiniGPT-4, Qwen-VL, InternVL, and many open-source vision-language models use CLIP or OpenCLIP encoders as their visual front end.[23] CLIP's text encoder is the conditioning mechanism in Stable Diffusion 1.x and is one of several encoders used in SDXL, SD3, and FLUX.[4][17]
The CLIP embedding space has become a de facto standard for measuring image-text alignment. CLIPScore is widely used as an automatic evaluation metric for text-to-image generation models[22], and CLIP retrieval indices (clip-retrieval, FAISS-based search engines over LAION-5B) are common tools for dataset analysis and example mining.[11]
By demonstrating that a single model trained with natural language supervision could match or exceed the performance of task-specific supervised models across dozens of datasets, CLIP shifted the field toward foundation models that learn general-purpose representations from broad data, rather than narrow models trained for individual tasks. Five years after its initial release, CLIP and its open-source descendants remain central infrastructure for vision-language AI.
See also
References
- Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. "Learning Transferable Visual Models From Natural Language Supervision". *Proceedings of the 38th International Conference on Machine Learning (ICML 2021)*, 2021-02-26. https://arxiv.org/abs/2103.00020. Accessed 2026-05-24.
- Radford, A., et al. "Learning Transferable Visual Models From Natural Language Supervision" (camera-ready PDF). PMLR Vol. 139, 2021. https://proceedings.mlr.press/v139/radford21a/radford21a.pdf. Accessed 2026-05-24.
- Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125, 2022-04-13. https://arxiv.org/abs/2204.06125. Accessed 2026-05-24.
- Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. "High-Resolution Image Synthesis with Latent Diffusion Models". *CVPR 2022*. https://arxiv.org/abs/2112.10752. Accessed 2026-05-24.
- "Contrastive Language-Image Pre-training", Wikipedia (used for navigation, not as citation source; OpenAI primary docs cited). https://en.wikipedia.org/wiki/Contrastive_Language-Image_Pre-training. Accessed 2026-05-24.
- OpenAI. "CLIP Model Card", GitHub openai/CLIP, 2021-01-05 (updated 2022). https://github.com/openai/CLIP/blob/main/model-card.md. Accessed 2026-05-24.
- Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. "DeViSE: A Deep Visual-Semantic Embedding Model". *NeurIPS 2013*. https://papers.nips.cc/paper/2013/hash/7cce53cf90577442771720a370c3c723-Abstract.html. Accessed 2026-05-24.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". *ICLR 2021*. https://arxiv.org/abs/2010.11929. Accessed 2026-05-24.
- Hugging Face. "openai/clip-vit-large-patch14 model card", 2022. https://huggingface.co/openai/clip-vit-large-patch14. Accessed 2026-05-24.
- Xu, H., Xie, S., Tan, X.E., Huang, P.-Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., & Feichtenhofer, C. "Demystifying CLIP Data" (MetaCLIP). arXiv:2309.16671, 2023-09-28. https://arxiv.org/abs/2309.16671. Accessed 2026-05-24.
- Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., & Schmidt, L. "OpenCLIP". GitHub mlfoundations/open_clip, 2021. https://github.com/mlfoundations/open_clip. Accessed 2026-05-24.
- Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., et al. "DataComp: In Search of the Next Generation of Multimodal Datasets". *NeurIPS 2023 Datasets and Benchmarks (Oral)*. arXiv:2304.14108. https://arxiv.org/abs/2304.14108. Accessed 2026-05-24.
- van den Oord, A., Li, Y., & Vinyals, O. "Representation Learning with Contrastive Predictive Coding". arXiv:1807.03748, 2018-07-10. https://arxiv.org/abs/1807.03748. Accessed 2026-05-24.
- Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., & Schmidt, L. "Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)". *ICML 2022*. https://arxiv.org/abs/2205.01397. Accessed 2026-05-24.
- Zhou, K., Yang, J., Loy, C.C., & Liu, Z. "Learning to Prompt for Vision-Language Models" (CoOp). *International Journal of Computer Vision*, 130(9), 2337-2348, 2022. https://arxiv.org/abs/2109.01134. Accessed 2026-05-24.
- Zhou, K., Yang, J., Loy, C.C., & Liu, Z. "Conditional Prompt Learning for Vision-Language Models" (CoCoOp). *CVPR 2022*. https://arxiv.org/abs/2203.05557. Accessed 2026-05-24.
- Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (Stable Diffusion 3). *ICML 2024*. https://arxiv.org/abs/2403.03206. Accessed 2026-05-24.
- Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., & Rombach, R. "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis". arXiv:2307.01952, 2023-07-04. https://arxiv.org/abs/2307.01952. Accessed 2026-05-24.
- Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models". *ICML 2022*. https://arxiv.org/abs/2112.10741. Accessed 2026-05-24.
- Black Forest Labs. "Announcing Black Forest Labs", 2024-08-01. https://blackforestlabs.ai/announcing-black-forest-labs/. Accessed 2026-05-24.
- CompVis. "stable-diffusion-safety-checker", Hugging Face model card, 2022. https://huggingface.co/CompVis/stable-diffusion-safety-checker. Accessed 2026-05-24.
- Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., & Choi, Y. "CLIPScore: A Reference-free Evaluation Metric for Image Captioning". *EMNLP 2021*. https://aclanthology.org/2021.emnlp-main.595/. Accessed 2026-05-24.
- Liu, H., Li, C., Wu, Q., & Lee, Y.J. "Visual Instruction Tuning" (LLaVA). *NeurIPS 2023 (Oral)*. https://arxiv.org/abs/2304.08485. Accessed 2026-05-24.
- Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., & Jitsev, J. "Reproducible Scaling Laws for Contrastive Language-Image Learning". *CVPR 2023*. arXiv:2212.07143. https://arxiv.org/abs/2212.07143. Accessed 2026-05-24.
- LAION. "CLIP-ViT-bigG-14-laion2B-39B-b160k", Hugging Face model card, 2023. https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k. Accessed 2026-05-24.
- LAION. "Large scale openCLIP: L/14, H/14 and g/14 trained on LAION-2B", LAION blog, 2022-09-15. https://laion.ai/blog/large-openclip/. Accessed 2026-05-24.
- Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. "Sigmoid Loss for Language Image Pre-Training" (SigLIP). *ICCV 2023 (Oral)*. arXiv:2303.15343. https://arxiv.org/abs/2303.15343. Accessed 2026-05-24.
- Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harmsen, J., Steiner, A., & Zhai, X. "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features". arXiv:2502.14786, 2025-02-20. https://arxiv.org/abs/2502.14786. Accessed 2026-05-24.
- Sun, Q., Fang, Y., Wu, L., Wang, X., & Cao, Y. "EVA-CLIP: Improved Training Techniques for CLIP at Scale". arXiv:2303.15389, 2023-03-27. https://arxiv.org/abs/2303.15389. Accessed 2026-05-24.
- Sun, Q., Wang, J., Yu, Q., Cui, Y., Zhang, F., Zhang, X., & Wang, X. "EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters". arXiv:2402.04252, 2024-02-06. https://arxiv.org/abs/2402.04252. Accessed 2026-05-24.
- Li, X., Wang, Z., & Xie, C. "An Inverse Scaling Law for CLIP Training". *NeurIPS 2023*. arXiv:2305.07017. https://arxiv.org/abs/2305.07017. Accessed 2026-05-24.
- Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., & Duerig, T. "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision" (ALIGN). *ICML 2021*. https://arxiv.org/abs/2102.05918. Accessed 2026-05-24.
- Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. "CoCa: Contrastive Captioners are Image-Text Foundation Models". *TMLR / arXiv:2205.01917*, 2022-05-04. https://arxiv.org/abs/2205.01917. Accessed 2026-05-24.
- Fang, A., Madappally Jose, A., Jain, A., Schmidt, L., Toshev, A., & Shankar, V. "Data Filtering Networks". *ICLR 2024*. arXiv:2309.17425. https://arxiv.org/abs/2309.17425. Accessed 2026-05-24.
- Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., et al. "DINOv2: Learning Robust Visual Features without Supervision". *TMLR 2024 / arXiv:2304.07193*. https://arxiv.org/abs/2304.07193. Accessed 2026-05-24.
- Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., & Zou, J. "When and why vision-language models behave like bags-of-words, and what to do about it?" *ICLR 2023*. arXiv:2210.01936. https://arxiv.org/abs/2210.01936. Accessed 2026-05-24.
- Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A., & Olah, C. "Multimodal Neurons in Artificial Neural Networks". *Distill*, 2021-03-04. https://distill.pub/2021/multimodal-neurons/. Accessed 2026-05-24.
- Agarwal, S., Krueger, G., Clark, J., Radford, A., Kim, J.W., & Brundage, M. "Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications". arXiv:2108.02818, 2021-08-05. https://arxiv.org/abs/2108.02818. Accessed 2026-05-24.
- Birhane, A., Prabhu, V.U., & Kahembwe, E. "Multimodal datasets: misogyny, pornography, and malignant stereotypes". arXiv:2110.01963, 2021-10-05. https://arxiv.org/abs/2110.01963. Accessed 2026-05-24.