CLIP (Contrastive Language-Image Pre-training)
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 7,621 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 7,621 words
Add missing citations, update stale details, or suggest a clearer explanation.
CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network developed by OpenAI that learns visual concepts from natural language by training an image encoder and a text encoder to map matching image-text pairs into the same shared embedding space. Released on 5 January 2021 and trained on roughly 400 million image-text pairs scraped from the internet, CLIP can classify images into categories it was never explicitly trained on (zero-shot), and it matched the accuracy of a supervised ResNet-50 on ImageNet without using any of that benchmark's 1.28 million labeled training examples.[1][2] In the words of the original paper, "we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on."[1]
Introduced by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, CLIP performs zero-shot learning on a wide range of visual tasks by matching images to natural language descriptions, without requiring any task-specific training data.[1] The paper's core claim is that "the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet."[1]
CLIP's design is simple but effective: it trains two separate encoders, one for images and one for text, to produce embeddings that land close together when the image and text describe the same concept. This approach allows CLIP to generalize across many visual classification and retrieval tasks, and it has become a foundational component in systems ranging from image generation (such as DALL-E 2 and Stable Diffusion) to content moderation, visual search, and multimodal reasoning.[3][4]
CLIP is a vision-language model that learns a shared embedding space for images and text, so that a picture and a caption describing it produce nearby vectors, enabling open-vocabulary, zero-shot image classification and retrieval driven purely by natural language prompts.[1] OpenAI's announcement summarized the result this way: by learning from natural language supervision rather than fixed labels, CLIP "closes the robustness gap by up to 75%" on challenging ImageNet variants while matching the original ResNet-50 on standard ImageNet zero-shot.[40][1]
Imagine you have a big box of photos and a big box of captions. CLIP is a program that learns to match each photo with its correct caption. It does this by looking at millions of photos and captions from the internet and learning what goes together. Once it has learned, you can show it a new photo it has never seen before, type in some descriptions like "a photo of a cat" and "a photo of a dog," and CLIP will tell you which description best fits the photo. The clever part is that nobody had to teach CLIP what a cat or a dog looks like by hand. It figured it out on its own just by reading captions.
| Property | Value |
|---|---|
| Developer | OpenAI |
| Initial release | 5 January 2021 |
| Paper | "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., arXiv:2103.00020) |
| Conference | ICML 2021 |
| Training data | WebImageText (WIT), approximately 400 million image-text pairs (private)[1] |
| Benchmark suite | Over 30 existing computer vision datasets[1] |
| Largest released model | ViT-L/14@336px (April 2022) |
| Image encoders | ResNet (RN50, RN101, RN50x4, RN50x16, RN50x64) and Vision Transformer (ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px)[5] |
| Text encoder | 12-layer, 8-head Transformer, 63M parameters, 512-wide[1] |
| Best zero-shot ImageNet top-1 (OpenAI weights) | 76.2% with ViT-L/14@336px[1] |
| Robustness gap closed vs supervised models | Up to 75%[1][40] |
| License | MIT[6] |
Before CLIP, the standard approach to building visual recognition systems followed a two-step pipeline: collect a labeled dataset for a specific task (such as ImageNet for object classification), then train a convolutional neural network (CNN) or Vision Transformer (ViT) on that dataset using supervised learning. The resulting models were narrow. A model trained on ImageNet could classify 1,000 object categories, but adding a new category required collecting new labeled data and retraining. Performance also degraded significantly when the model encountered images that looked different from its training set (a problem known as distribution shift).[1] The CLIP authors framed this limitation directly: "State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept."[1]
In parallel, natural language processing (NLP) had been moving toward more flexible, task-agnostic models. GPT-2 and GPT-3 demonstrated that a single language model trained on broad internet text could perform many different NLP tasks through prompting alone, with no task-specific fine-tuning. The CLIP authors asked whether a similar approach could work for vision: could a model learn visual concepts from natural language supervision at web scale, and then transfer those concepts to new tasks without additional training?[1]
Earlier work had explored this direction. In 2013, Frome et al. introduced DeViSE, which learned to map images into a word embedding space.[7] Joulin et al. (2016) trained CNNs to predict words in image captions, and Li et al. (2017) used natural language supervision for zero-shot visual recognition. Sariyildiz et al. and Desai and Johnson (VirTex, 2020) showed that captioning-style supervision could produce useful visual representations. However, these methods achieved modest results compared to supervised baselines, with VirTex topping out at roughly the accuracy of a ResNet-50 trained on ImageNet, and they did not scale to evaluation across dozens of datasets.[1] CLIP built on these ideas but succeeded by dramatically increasing the scale of training data, from tens of millions to 400 million pairs, and by using a contrastive learning objective that proved more efficient than predictive objectives at this scale.[1]
The paper was first posted to arXiv on 26 February 2021 and presented at the 38th International Conference on Machine Learning (ICML 2021).[2] OpenAI also released the smaller checkpoints in stages: ViT-B/32 and RN50 in January 2021; RN101 and RN50x4 later in 2021; RN50x16 and ViT-B/16 in July 2021; RN50x64 and ViT-L/14 in January 2022; and ViT-L/14@336px in April 2022.[6]
CLIP uses a dual-encoder architecture consisting of two separate neural networks: an image encoder and a text encoder. Each encoder maps its respective input into a fixed-length vector in a shared embedding space. The dimensionality of this space depends on the specific model variant, typically 512 for ViT-B and 768 for ViT-L on the image side, with a final joint projection that aligns image and text embedding dimensionality.[1][5]
The CLIP paper evaluated two families of image encoders.
ResNet variants. The authors trained five modified ResNet models: ResNet-50, ResNet-101, and three larger variants following the EfficientNet-style compound scaling pattern, RN50x4, RN50x16, and RN50x64, where the multiplier indicates approximate compute relative to a ResNet-50. These networks used an attention pooling mechanism in place of global average pooling: the pooling layer performed multi-head attention with a single learned query token that attended to the spatial feature map.[1]
Vision Transformer (ViT) variants. The authors also trained ViT-B/32, ViT-B/16, and ViT-L/14, closely following the architecture of Dosovitskiy et al. (2021).[8] The image is divided into fixed-size patches (32x32, 16x16, or 14x14 pixels), each patch is linearly embedded, and the resulting sequence of patch embeddings is processed by a Transformer encoder. A learnable [CLS] token is prepended to the sequence, and its output representation at the final layer serves as the image embedding. The ViT-L/14 model was subsequently fine-tuned for one additional epoch at 336x336 resolution to produce ViT-L/14@336px.[1] The naming convention "ViT-B/32" denotes a base-sized Transformer with 32x32 pixel patches.
The ViT-L/14 model generally outperformed all ResNet variants, including the largest RN50x64, despite requiring less compute to train.[1] OpenAI's ViT-L/14 (vision tower) has roughly 304M parameters; combined with the text tower the full checkpoint is approximately 428M parameters.[9]
The text encoder is a Transformer with the following specifications:
| Parameter | Value |
|---|---|
| Parameters | 63 million[1] |
| Layers | 12 |
| Width | 512 |
| Attention heads | 8 |
| Vocabulary size | 49,152 (lower-cased byte-pair encoding) |
| Max context length | 76 tokens (plus special tokens, 77 total) |
| Architecture style | Decoder-only with causal masking, similar to GPT-2 |
The text input is bracketed by [SOS] (start of sequence) and [EOS] (end of sequence) tokens. The representation at the [EOS] token position in the final layer is taken as the text embedding. This embedding is then linearly projected to match the dimensionality of the image embedding space.[1]
Both the image encoder output and the text encoder output are linearly projected into the shared embedding space and L2-normalized. The similarity between an image and a text is computed as the cosine similarity (equivalently, the dot product of the normalized vectors) between their respective embeddings, scaled by a learned temperature parameter.[1] The temperature is parameterised as exp(t) where t is a free scalar initialised at log(1/0.07) and clipped to prevent training instability. It controls the sharpness of the softmax distribution over similarities.
The CLIP architecture trains the entire model from scratch with no warm-start from pretrained image or text models; all weights are randomly initialised before contrastive training.[1]
CLIP was trained on a private dataset called WebImageText (WIT), constructed by OpenAI. WIT contains approximately 400 million image-text pairs scraped from the internet. The dataset was assembled by searching for images associated with a set of 500,000 text queries. These queries were derived from multiple sources:[1]
Each query could return up to 20,000 image-text pairs, yielding a total corpus of roughly 400 million pairs. The total text in the dataset amounts to roughly 40 gigabytes, comparable in scale to the WebText dataset used to train GPT-2.[1] WIT has not been publicly released. (The name conflicts with a separate Wikipedia-based dataset of the same acronym from Google Research; the two are unrelated.)
Subsequent reproduction efforts (notably MetaCLIP[10], OpenCLIP's training on LAION-400M and LAION-5B, and DataComp) helped clarify what made WIT effective, including aggressive subsampling per query, balancing across concepts, and stripping out near-duplicates.[10][11][12]
CLIP is trained using a symmetric contrastive loss, sometimes called a multi-class N-pair loss or InfoNCE loss (van den Oord et al., 2018).[13] Given a mini-batch of N image-text pairs, the training objective works as follows:[1]
Concretely, the per-image loss is L_i2t = -log( exp(s_ii / τ) / Σ_j exp(s_ij / τ) ), the analogous text-to-image loss is L_t2i = -log( exp(s_ii / τ) / Σ_j exp(s_ji / τ) ), and the total loss is (L_i2t + L_t2i) / 2, averaged over the batch. The temperature τ = exp(-t) controls how sharply the softmax concentrates on the diagonal.[1] During training, the N-1 non-matching pairs in each row and column serve as in-batch negatives.
The contrastive objective was chosen over alternatives such as predicting the exact caption tokens given an image. Starting from a transformer language model that predicts captions, the CLIP authors report that "we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet."[1] In other words, the contrastive approach was roughly 4x more compute-efficient than a token-prediction approach at equal zero-shot accuracy, which is what made it practical to train on the full 400 million pair dataset within realistic time budgets.[1]
The models were trained for 32 epochs on the WIT dataset. The training used very large global batch sizes (32,768 for the largest models), which is important for contrastive learning because larger batches provide more negative examples per step.[1] Mixed-precision training was used to reduce memory requirements, gradient checkpointing was applied to selected layers, and the cosine similarity computation was sharded across devices to fit the NxN matrix in memory. The Adam optimizer was used with a cosine learning-rate schedule and decoupled weight decay; full hyperparameters are listed in the paper's appendix.[1]
| Model | Hardware | Training time |
|---|---|---|
| RN50x64 (largest ResNet) | 592 V100 GPUs | 18 days[1] |
| ViT-L/14 (largest ViT) | 256 V100 GPUs | 12 days[1] |
The ViT-L/14@336px variant was produced by fine-tuning ViT-L/14 for one additional epoch at the higher 336x336 resolution, which improved zero-shot accuracy by roughly one percentage point on ImageNet.[1]
Images are resized and center-cropped to the model's native resolution (224x224 for most models, 336x336 for ViT-L/14@336px) using bicubic interpolation. Pixel values are scaled to [0,1] and then standardized with the following per-channel statistics:[6]
These constants come from the WIT data distribution rather than ImageNet's standard normalisation, and most downstream pipelines reuse them when feeding images to CLIP encoders.
One of CLIP's most significant properties is its ability to perform zero-shot image classification, meaning it can classify images into categories it was never explicitly trained on. As the paper puts it, after pre-training "natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks," and the model "transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training."[1]
To classify an image using CLIP:[1]
This process requires no gradient updates, no labeled training data for the specific task, and no modification to the model. The only input needed is the list of class names. Because text embeddings can be precomputed and cached, the entire zero-shot classifier reduces to a single matrix multiplication at inference time.
On ImageNet, zero-shot CLIP (ViT-L/14@336px) achieved 76.2% top-1 accuracy, matching the performance of the original supervised ResNet-50 trained directly on ImageNet's 1.28 million labelled examples.[1] The paper benchmarked CLIP on "over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification."[1] Across a representative suite of 27 of these evaluation datasets (covering OCR, texture recognition, satellite imagery, action recognition, geo-localization, country identification, fine-grained pet classification, and more), zero-shot CLIP outperformed a fully supervised linear classifier trained on ResNet-50 features on 16 of 27 datasets.[1]
When a linear classifier was fitted on top of CLIP's frozen features (a technique called linear probing), performance on ImageNet improved by nearly 10 percentage points. The linear probe on CLIP ViT-L/14 features outperformed the Noisy Student EfficientNet-L2, the state-of-the-art supervised model at the time, on 21 out of 27 datasets.[1]
CLIP showed strong robustness to natural distribution shifts. On variants of ImageNet designed to test robustness (ImageNet-V2, ImageNet-R, ImageNet Sketch, ObjectNet, ImageNet-A), zero-shot CLIP significantly outperformed standard ImageNet-trained models at equivalent ImageNet accuracy. CLIP narrowed the "effective robustness gap," the difference between in-distribution and out-of-distribution accuracy, by up to 75% compared to more than 200 supervised models evaluated on the same benchmarks.[1][40] This result suggested that learning from natural language supervision produces representations that are more robust than those learned from fixed label sets, a finding that has held up under follow-up analyses[14] though the exact mechanism is debated.
The text prompts used during zero-shot classification significantly affect CLIP's accuracy. Carefully designed prompts have produced a sub-field of prompt engineering specifically for CLIP and similar vision-language models.
Using the bare class name (e.g., "cat") as the text input tends to produce lower accuracy than wrapping it in a descriptive template. OpenAI's reference template is:
"a photo of a {class}."
More specific templates can improve accuracy for particular domains.[1]
| Domain | Example template |
|---|---|
| General object recognition | "a photo of a {class}" |
| Fine-grained recognition | "a photo of a {class}, a type of pet" |
| Satellite imagery | "a satellite photo of {class}" |
| Texture classification | "a photo of a {class} texture" |
| Action recognition | "a video of a person doing {class}" |
| Food classification | "a photo of {class}, a type of food" |
| Medical imaging | "a medical image showing {class}" |
OpenAI found that averaging the text embeddings from many prompt templates per class improved accuracy by approximately 3.5 percentage points on ImageNet. The released code ships with 80 templates per class, including phrasings such as "a bad photo of a {class}," "a sculpture of a {class}," "a low-resolution photo of a {class}," and "a photo of many {class}."[1][6] Because each template's embedding can be precomputed and the average cached as a single vector per class, prompt ensembling adds no inference-time cost.
Context Optimization (CoOp), proposed by Zhou et al. (2022), replaces the hand-crafted prompt template with learnable continuous vectors that are optimized on a small set of labeled examples. While the CLIP encoders remain frozen, the prompt tokens are updated via backpropagation. CoOp significantly outperforms hand-crafted prompts in few-shot settings, with as few as 16 labeled examples per class yielding roughly 15 percentage points of improvement over zero-shot CLIP on average.[15] Conditional Context Optimization (CoCoOp) extends this idea by generating input-conditional prompt tokens, improving generalization to unseen classes.[16]
CLIP and its variants have become infrastructure components in many AI systems.
CLIP plays a central role in several major image generation systems.
DALL-E 2 (unCLIP). Introduced by Ramesh et al. (2022), DALL-E 2 uses CLIP as a core component.[3] The system consists of two stages: a "prior" model that generates a CLIP image embedding from a text caption, and a "decoder," a diffusion model, that generates an image from the CLIP image embedding. The system is referred to as "unCLIP" because the decoder effectively inverts the CLIP image encoder. This architecture allows the model to generate diverse images for a single prompt because many different images can map to similar CLIP embeddings. The authors experimented with autoregressive and diffusion priors and found the diffusion prior more efficient.[3]
Stable Diffusion. Stable Diffusion 1.x (Rombach et al., 2022) uses CLIP's text encoder (specifically OpenAI's frozen ViT-L/14 text encoder) to convert text prompts into conditioning embeddings for its latent diffusion model.[4] The non-pooled output sequence of the text encoder is fed into the U-Net backbone via cross-attention layers; this combination of an 860M-parameter U-Net with a 123M-parameter CLIP text encoder defined the SD 1.4 / 1.5 baseline.[4] Stable Diffusion 2.x switched the text encoder to OpenCLIP ViT-H/14, SDXL (Podell et al., 2023) used a dual setup of OpenAI's CLIP ViT-L/14 alongside OpenCLIP ViT-bigG/14 with concatenated penultimate-layer outputs, and Stable Diffusion 3 added a third T5-XXL encoder while keeping both CLIP variants in its MMDiT architecture.[17][18]
CLIP guidance. In diffusion-based image generation, CLIP can serve as a gradient signal to steer the generation process toward a target text description. During each denoising step, the partially generated image is encoded by CLIP's image encoder, and the gradient of the CLIP similarity score (between the image embedding and the target text embedding) is added to the denoising update. Nichol et al. (2022) found in the GLIDE paper that classifier-free guidance generally outperforms CLIP guidance in terms of image quality, and most modern systems use classifier-free guidance instead.[19]
Newer text-to-image stacks continue to evolve away from CLIP-only conditioning. FLUX.1, Black Forest Labs' 2024 text-to-image model, pairs a CLIP text encoder with a T5-XXL encoder; this dual-encoder pattern has become a common 2024-2025 design.[17][20] Even as systems move toward larger language models for prompt understanding, CLIP-derived embeddings remain useful for short-prompt aesthetic conditioning and for backwards compatibility with the large ecosystem of CLIP-trained LoRAs and embeddings.
Because CLIP produces aligned embeddings for images and text, it can be used to build text-to-image and image-to-text retrieval systems. A database of images is encoded once into CLIP space, and then natural language queries are encoded at search time. The images whose embeddings are closest to the query embedding are returned as results. This approach powers visual search features in applications such as Pinterest's "Lens" workflow, Unsplash's natural-language search, and the LAION-5B index, and it is widely used in vector databases for multimodal retrieval.[11]
CLIP's zero-shot capabilities allow it to classify images against content policy categories without task-specific training data. Moderation labels can be defined as text prompts (for example, "a photo containing violence," "safe content," "a photo of a weapon"), and CLIP assigns similarity scores. This makes it possible to update moderation policies by changing the text prompts, without retraining the model. The Stable Diffusion releases use a CLIP-based "safety checker" trained to detect NSFW content in generated images by comparing image embeddings to a held-out set of unsafe concept embeddings.[21]
Fine-tuned versions of CLIP are used to predict aesthetic quality scores for images. The LAION aesthetics predictor is a small MLP trained on top of frozen CLIP embeddings to predict human aesthetic ratings on a 1-10 scale. These scores are used to filter training data for image generation models, including the LAION-Aesthetics subset that helped train Stable Diffusion.[11]
Hessel et al. (2021) introduced CLIPScore, a reference-free metric for image captioning and text-to-image generation that scores image-caption pairs by their CLIP cosine similarity.[22] CLIPScore correlates more strongly with human judgments than reference-based metrics like CIDEr and SPICE on multiple captioning benchmarks, and it has become a standard automatic metric for measuring image-text alignment in generative models. Variants like RefCLIPScore combine the reference-free signal with reference-based comparison.[22]
CLIP's image encoder is used as a visual backbone in many multimodal AI systems. Google DeepMind's Flamingo (2022) uses a frozen contrastively trained NFNet visual encoder built in the spirit of CLIP, while LLaVA (Liu et al., 2023) connects a CLIP ViT-L/14 visual encoder to a large language model using a simple linear projection (later upgraded to a two-layer MLP) to translate CLIP visual features into the language model's input space.[23] Many open-weight vision-language models, including MiniGPT-4, InstructBLIP, Qwen-VL, InternVL, and the LLaVA-1.5/1.6/NeXT family, use CLIP or OpenCLIP encoders as the visual front end.[23]
CLIP embeddings are also used in robotics, audio (via Wav2CLIP-style joint embedding spaces), and 3D systems such as OpenAI's Point-E, where a CLIP image encoder provides language-aligned conditioning for point cloud generation.
Since CLIP's release, several organizations have developed improved variants addressing different limitations of the original model.
OpenCLIP is an open-source reimplementation of CLIP developed by Ilharco, Wightman, Schmidt, and collaborators across the LAION community and academic groups.[11] Unlike OpenAI's CLIP, which was trained on the private WIT dataset, OpenCLIP models are trained on publicly available datasets: LAION-400M, LAION-2B, and DataComp-1B.[12] OpenCLIP reproduces and extends the original CLIP training procedure, offering models at scales ranging from ViT-B/32 to ViT-G/14 and beyond, plus architectures such as ConvNeXt.
Cherti et al. ("Reproducible scaling laws for contrastive language-image learning," CVPR 2023) systematically swept model size, training samples seen, and dataset across LAION and identified power-law scaling for zero-shot classification, retrieval, linear probing, and fine-tuning, while showing that OpenAI's and OpenCLIP's models exhibit measurably different scaling behaviour despite identical architectures and similar recipes, reflecting differences in pretraining distribution.[24] OpenCLIP's flagship ViT-bigG/14 model trained on LAION-2B (39B samples seen at batch 160k) reaches 80.1% zero-shot top-1 on ImageNet.[25] LAION's earlier H/14 release achieves 78.0% top-1 and 73.4% Recall@5 on MS COCO image retrieval.[26]
SigLIP (Sigmoid Loss for Language Image Pre-training), introduced by Zhai et al. (2023) at Google, replaces CLIP's softmax-based contrastive loss with a pairwise sigmoid loss.[27] While CLIP's loss requires computing a global NxN similarity matrix across the entire batch and normalizing with softmax, SigLIP evaluates each image-text pair independently using a binary sigmoid classification (is this pair a match or not?). As the SigLIP paper explains, "the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization."[27]
This change has several practical consequences:
| Property | CLIP (softmax loss) | SigLIP (sigmoid loss) |
|---|---|---|
| Loss computation | Global NxN matrix, softmax normalization | Pairwise, independent sigmoid |
| Memory scaling | Quadratic in batch size | Linear in batch size |
| Performance at small batch sizes | Lower | Higher |
| Saturating batch size | ~32k | ~32k |
| ImageNet zero-shot (ViT-L, 256px) | 75.5% | 80.5%[27] |
SigLIP's loss factorises across devices, which allowed Zhai et al. to scale to a million-sample batch (with diminishing returns past 32k). Combined with Locked-image Tuning (LiT), the authors report that "with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days."[27]
SigLIP 2, released by Tschannen, Gritsenko, Wang, and colleagues in February 2025, unifies the sigmoid loss with decoder-based captioning (LocCa), self-distillation, masked prediction, and online data curation (ACID).[28] SigLIP 2 ships four model sizes (ViT-B at 86M, ViT-L at 303M, ViT-So400m at 400M, and ViT-g at roughly 1B parameters), is trained on a multilingual WebLI mixture (90% English, 10% other languages), and includes a NaFlex variant that supports multiple input resolutions while preserving native aspect ratios. SigLIP 2 reports 79.1% (B/16, 256px), 82.5% (L/16, 256px), 83.4% (So400m/16, 256px), and 84.5% (g/16, 256px) zero-shot ImageNet top-1, beating both the original SigLIP and OpenAI CLIP at matched compute.[28] The release is widely used as the visual encoder in 2025-era open-source multimodal LLMs.
EVA-CLIP, developed by Sun, Fang et al. (2023) at the Beijing Academy of Artificial Intelligence (BAAI), improves CLIP training efficiency by initializing the image encoder with weights from EVA, a masked image modeling pre-trained ViT.[29] Rather than training the visual encoder from scratch, EVA-CLIP leverages pre-trained representations and applies improved training techniques such as LAMB optimisation and bfloat16 mixed precision.
The largest 2023 model, EVA-02-CLIP-E/14+ (5 billion parameters), achieves 82.0% zero-shot top-1 accuracy on ImageNet with only 9 billion training samples seen.[29] EVA-02-CLIP-L/14+ (430 million parameters) achieves 80.4% zero-shot accuracy with only 6 billion samples, making it one of the most compute-efficient CLIP-scale models per accuracy point.
In February 2024, BAAI released EVA-CLIP-18B, an 18-billion-parameter ViT-based CLIP that averages 80.7% zero-shot top-1 across 27 image-classification benchmarks while training on the openly available LAION-2B and COYO-700M combination (2B image-text pairs total). At the time of release it was the largest publicly available CLIP model, showing that EVA-style weak-to-strong scaling continued to deliver gains beyond 5B parameters.[30]
CLIPA ("An Inverse Scaling Law for CLIP Training"), introduced by Li, Wang, and Xie (2023) at UC Santa Cruz, identified that larger encoders can be trained effectively with shorter input sequences.[31] Specifically, when using a larger image encoder, the image can be resized to a lower resolution (reducing the number of patch tokens), and the text can be truncated more aggressively. This inverse scaling law dramatically reduces the computational cost of training.
CLIPA achieves practical results on modest hardware: using 8 A100 GPUs, it reaches 63.2% zero-shot ImageNet accuracy in about 2 days and 69.3% in about 4 days. The CLIPA-v2 variant, using a ViT-G/14 encoder, achieves 83.0% zero-shot ImageNet accuracy while being roughly 33x faster to train than the equivalent OpenCLIP model. The paper was published at NeurIPS 2023.[31]
ALIGN ("A Large-scale ImaGe and Noisy-text embedding"), developed by Jia et al. (2021) at Google, used a contrastive image-text objective on more than 1 billion image-text pairs collected from raw alt-text without aggressive cleaning.[32] ALIGN uses an EfficientNet image encoder and a BERT-style text encoder, showing that the contrastive approach works with different architecture choices and much noisier data, and reaching 76.4% zero-shot ImageNet top-1 with EfficientNet-L2. ALIGN appeared on arXiv within weeks of CLIP and is often discussed alongside it as the second of the two concurrent demonstrations that contrastive image-text training scales.[32]
CoCa ("Contrastive Captioners are Image-Text Foundation Models"), by Yu et al. (2022) at Google, merges a contrastive objective (as in CLIP) with a captioning loss in a single encoder-decoder model.[33] The first half of the decoder is unimodal, producing aligned text embeddings via contrastive loss, and the second half cross-attends to image features to produce a captioning loss. CoCa reaches 86.3% zero-shot top-1 on ImageNet (with a much larger model than CLIP and additional training data) and reports state-of-the-art results across image captioning, retrieval, action recognition, and VQA tasks.[33] CoCa demonstrated that contrastive and generative objectives are complementary, an insight later folded into SigLIP 2.
MetaCLIP (Xu et al., 2023; "Demystifying CLIP Data") attempted to reverse-engineer CLIP's data curation by extracting and balancing a 500k-entry "metadata" vocabulary derived from the CLIP paper, then filtering Common Crawl with it.[10] MetaCLIP-400M (matched in size to WIT) reaches 70.8% ImageNet zero-shot on ViT-B/32 versus OpenAI's 68.3%, and a 2.5B-image scale-up further improves accuracy. The paper's headline contribution is a transparent algorithm for CLIP-style curation, with the result that data sourcing, rather than architecture or loss, accounts for much of CLIP's effectiveness.[10] A 2025 follow-up, Meta CLIP 2, extends the recipe to a worldwide multilingual setting.
DataComp (Gadre et al., NeurIPS 2023 Datasets and Benchmarks) is a public benchmark for image-text dataset design centred on a 12.8B Common Crawl candidate pool.[12] Participants submit filtering or curation strategies, then evaluate by training a CLIP architecture under fixed compute and measuring accuracy across 38 downstream tasks. The reference dataset, DataComp-1B (1.4B image-text pairs), trains a ViT-L/14 to 79.2% zero-shot ImageNet top-1, 3.7 points above OpenAI's ViT-L/14 at matched compute.[12]
Apple's Data Filtering Networks (DFN; Fang et al., 2023, ICLR 2024) trained small auxiliary networks to filter raw image-text pools. DFN-5B (filtered from 43B raw pairs) produces a ViT-H/14 that reaches 83.0% ImageNet zero-shot, and Apple's release of DFN-5B-CLIP-ViT-H-14-378 (378x378 fine-tune) is widely used as a vision encoder.[34]
DINOv2 (Oquab et al., Meta, 2023) is a contemporary self-supervised vision encoder that uses image-only self-distillation rather than image-text contrastive supervision. DINOv2 produces strong dense features and outperforms OpenCLIP on dense prediction tasks like segmentation and depth estimation, while CLIP retains advantages on text-conditioned tasks like zero-shot classification and OCR.[35] In modern VLMs both encoders are sometimes used in combination, with CLIP providing semantic alignment to language and DINOv2 providing fine spatial features.
| Model | Organization | Year | Image encoder | Training data | ImageNet zero-shot (best) | Key innovation |
|---|---|---|---|---|---|---|
| CLIP | OpenAI | 2021 | ResNet / ViT | WIT (400M, private) | 76.2% (ViT-L/14@336)[1] | Original contrastive vision-language model |
| ALIGN | 2021 | EfficientNet | 1.8B noisy alt-text pairs | 76.4%[32] | Larger noisy data | |
| CoCa | 2022 | ViT | JFT-3B + ALIGN | 86.3%[33] | Joint contrastive + captioning | |
| OpenCLIP | LAION | 2022+ | ViT / ConvNeXt | LAION-2B, DataComp-1B | 80.1% (bigG/14)[25] | Open-source, public data, scaling laws |
| SigLIP | 2023 | ViT | WebLI | 84.5% (with LiT)[27] | Sigmoid loss, memory efficient | |
| EVA-CLIP | BAAI | 2023 | EVA ViT | Merged-2B | 82.0% (E/14+)[29] | Pre-trained vision encoder init |
| CLIPA | UC Santa Cruz | 2023 | ViT | LAION-2B | 83.0% (G/14)[31] | Inverse scaling law, training efficiency |
| MetaCLIP | Meta | 2023 | ViT | Common Crawl 2.5B | 79.2% (H/14)[10] | Reproduced WIT-style curation |
| DataComp-1B | DataComp consortium | 2023 | ViT | DataComp-1B (1.4B) | 79.2% (L/14)[12] | Public data curation benchmark |
| DFN | Apple / U.Washington | 2023 | ViT | DFN-5B | 83.0% (H/14)[34] | Learned data filtering |
| EVA-CLIP-18B | BAAI | 2024 | ViT (18B) | LAION-2B + COYO-700M | 80.7% (avg 27 sets)[30] | Largest open CLIP at release |
| SigLIP 2 | 2025 | ViT (B, L, So400m, g) | Multilingual WebLI | 84.5% (g/16)[28] | Captioning + distillation + multilingual |
Despite its versatility, CLIP has several well-documented limitations.
While CLIP performs well on common object recognition, it struggles on certain specialized tasks. On MNIST handwritten digits, zero-shot CLIP achieves only 88% accuracy, far below the 99.75% that humans achieve and the near-perfect accuracy of simple supervised models trained directly on MNIST.[1] CLIP also performs poorly on fine-grained classification tasks that require distinguishing visually similar subcategories (such as bird species or flower varieties in datasets like CUB-200 or Oxford Flowers), counting objects, and understanding spatial relationships such as "left of" or "above."[1]
CLIP has limited ability to perform compositional or systematic reasoning. For instance, it may struggle to distinguish "a red cube on top of a blue sphere" from "a blue cube on top of a red sphere" because its contrastive training does not explicitly teach compositional understanding of spatial arrangements or attribute binding. Probe sets such as ARO (Attribution, Relation, and Order) and SugarCrepe have measured CLIP's compositional weaknesses systematically.[36]
CLIP is vulnerable to typographic attacks, in which placing text on an image causes the model to misclassify the image based on the text content rather than the visual content. Goh et al. (2021) at OpenAI showed that writing the word "iPod" on an apple causes CLIP to classify the apple as an iPod, sometimes with higher confidence than the unmodified image.[37] These attacks generalise: writing "$$$" near a piggy bank object can cause activations of a "finance" neuron, while writing "robot" on a shirt can fool detectors trained on CLIP features. The same paper documented multimodal neurons inside CLIP that respond consistently to a concept across photographs, illustrations, sketches, and the printed word, which is the underlying mechanism enabling the attacks.[37]
Because CLIP was trained on unfiltered internet data, it inherits biases present in that data. Agarwal et al. (2021) and the model card document that CLIP can encode racial, gender, age, and other social stereotypes.[38][6] In the FairFace evaluation reported on the model card, CLIP correctly classifies gender at over 96% accuracy averaged across races, but classification of race (about 93%) and age (about 63%) varies more, and denigration probes show disparities in associations between racial categories and crime- or animal-related terms. The OpenAI team explicitly notes that deployment of CLIP-based systems in sensitive domains requires careful bias auditing, and that "any deployed use case of the model, whether commercial or not, is currently out of scope."[6]
CLIP's text encoder has a maximum context length of 76 tokens (plus the start and end tokens, for 77 total positions), which limits the complexity of text descriptions it can process. Long or detailed descriptions are truncated, potentially losing important information.[1] Some downstream applications address this by chunking long texts and aggregating their embeddings, or by replacing the CLIP text encoder with a longer-context one like Long-CLIP or T5.
CLIP's zero-shot classification outputs are not well-calibrated, meaning the similarity scores do not correspond directly to reliable probability estimates. The model may produce high confidence scores for incorrect predictions, and the scores are sensitive to the number and composition of candidate classes being evaluated.[1]
CLIP's WIT training data was scraped from the English-speaking web, which produces several biases. The model performs best on concepts well-represented in English internet content and worse on culturally specific concepts, non-Latin scripts, and specialized domain imagery (medicine, satellite, scientific) that is underrepresented online.[1] Birhane, Prabhu, and Kahembwe (2021) audited LAION-400M, the open analogue of WIT, and documented misogynistic, racist, and pornographic imagery in the dataset, raising similar concerns for any CLIP-like model trained on uncurated web data.[39]
Modern text-to-image systems are increasingly moving away from using a frozen CLIP text encoder as the sole conditioning signal. Stable Diffusion 3 and FLUX.1 pair CLIP with T5-XXL, which produces stronger prompt understanding because it was trained on much longer text. Some 2024-2025 systems fine-tune the text encoder end-to-end with the diffusion model, accepting the additional cost in exchange for more accurate adherence to long, compositional prompts.[17] CLIP's role in text-to-image generation is shifting from "sole prompt encoder" toward "aesthetic and style encoder used in combination with a larger LLM-style text model."
CLIP has had a broad influence on the field of machine learning and computer vision.
The paper demonstrated that natural language supervision can serve as a scalable training signal for visual representations, opening an alternative to the traditional approach of collecting fixed-label datasets. This insight has been adopted widely, with subsequent models such as ALIGN, Florence, CoCa, BLIP, SigLIP, and DataComp building on the same principle.
CLIP's image encoder has become a standard visual backbone in multimodal systems. Models such as LLaVA, MiniGPT-4, Qwen-VL, InternVL, and many open-source vision-language models use CLIP or OpenCLIP encoders as their visual front end.[23] CLIP's text encoder is the conditioning mechanism in Stable Diffusion 1.x and is one of several encoders used in SDXL, SD3, and FLUX.[4][17]
The CLIP embedding space has become a de facto standard for measuring image-text alignment. CLIPScore is widely used as an automatic evaluation metric for text-to-image generation models[22], and CLIP retrieval indices (clip-retrieval, FAISS-based search engines over LAION-5B) are common tools for dataset analysis and example mining.[11]
By demonstrating that a single model trained with natural language supervision could match or exceed the performance of task-specific supervised models across dozens of datasets, CLIP shifted the field toward foundation models that learn general-purpose representations from broad data, rather than narrow models trained for individual tasks. Five years after its initial release, CLIP and its open-source descendants remain central infrastructure for vision-language AI.