CLIP (Contrastive Language-Image Pre-training) is a multimodal neural network developed by OpenAI that learns to associate images with natural language descriptions. Introduced in January 2021 by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, CLIP was trained on approximately 400 million image-text pairs collected from the internet using a contrastive learning objective [1]. The paper, titled "Learning Transferable Visual Models From Natural Language Supervision," was published at ICML 2021 and has since become one of the most influential works in modern AI, catalyzing the development of text-to-image generation systems like DALL-E, Stable Diffusion, and a wide range of multimodal applications.
CLIP's central insight is deceptively simple: rather than training a vision model on a fixed set of categorical labels (as was standard practice with ImageNet classification), train it to predict which textual description goes with which image. This approach produces a model that understands visual concepts in terms of natural language, enabling remarkable flexibility. CLIP can perform zero-shot classification on new tasks simply by being given textual descriptions of the target categories, without any task-specific training data or fine-tuning.
Before CLIP, the dominant paradigm in computer vision involved training models on large labeled datasets, most notably ImageNet, with its 1,000 predefined object categories. While this approach produced highly capable classifiers, it had fundamental limitations. Models trained on fixed label sets could only recognize the categories they were trained on. Extending a model to new categories required collecting and annotating new training data, then retraining or fine-tuning the model. This made visual recognition systems brittle and expensive to adapt.
Meanwhile, natural language processing had undergone a revolution with models like GPT-2 and GPT-3, which demonstrated that pre-training on massive amounts of internet text could produce models with strong zero-shot and few-shot capabilities across a wide range of tasks. The question Radford et al. asked was whether a similar approach could work for vision: could a model trained on the vast quantity of image-text pairs naturally occurring on the internet learn visual representations that generalize broadly, without relying on manually curated label sets?
Earlier work had explored related ideas. ConVIRT (2020) applied contrastive learning to medical image-text pairs, and ALIGN (Google, 2021) trained on a noisy dataset of over one billion image-alt-text pairs. But CLIP was the first to demonstrate, at scale, that contrastive pre-training on image-text pairs could match or exceed the performance of fully supervised models on standard benchmarks, while simultaneously enabling zero-shot transfer to dozens of new tasks.
CLIP's architecture and training procedure are built around the idea of learning a shared embedding space where images and text can be directly compared.
CLIP consists of two separate neural network encoders that operate in parallel:
Image Encoder: Processes an input image and produces a fixed-dimensional vector representation. OpenAI trained CLIP with two families of image encoders: ResNet-based models (modified ResNets with attention pooling) and Vision Transformer (ViT) models. The best-performing variant used a ViT-L/14 architecture, which divides the image into 14x14 pixel patches and processes them through a transformer encoder with 24 layers, a hidden dimension of 1,024, and 16 attention heads.
Text Encoder: Processes a text string (such as a caption or category description) and produces a vector in the same embedding space. The text encoder is a standard transformer with 12 layers, a hidden dimension of 512, and 8 attention heads, operating on byte pair encoded (BPE) tokenized text with a maximum context length of 76 tokens.
Both encoders project their outputs into a shared embedding space of 512 or 768 dimensions (depending on the model variant), where cosine similarity can be used to measure how well an image and a piece of text correspond to each other.
CLIP is trained using a symmetric contrastive loss inspired by the InfoNCE objective. Given a batch of N image-text pairs, the model computes:
The training objective is applied symmetrically: for each image, the matching text should be the most similar among all texts in the batch, and for each text, the matching image should be the most similar among all images. A learnable temperature parameter scales the logits before the softmax computation.
This contrastive approach is far more efficient than alternative strategies the authors considered. They initially experimented with a generative objective (predicting the exact text for each image, similar to an image captioning model), but found that a contrastive objective achieved the same performance with 4x less compute. The contrastive formulation also naturally scales with batch size, as larger batches provide more negative examples per positive pair.
CLIP was trained on a proprietary dataset called WebImageText (WIT), consisting of approximately 400 million image-text pairs scraped from the internet. OpenAI constructed WIT by searching for images associated with a set of 500,000 text queries derived from Wikipedia and other sources, filtering for quality and relevance. The dataset was not publicly released.
The scale and diversity of WIT is central to CLIP's success. By training on hundreds of millions of naturally occurring image-text pairs covering an enormous range of visual concepts, CLIP learned to associate images with a far broader vocabulary of concepts than any fixed-category dataset could provide.
Training CLIP at scale required substantial computational resources. In the original paper, the authors trained 5 ResNet variants and 3 ViT variants, each for 32 epochs on WIT. The largest ResNet model (RN50x64) required 18 days of training on 592 NVIDIA V100 GPUs. The largest ViT model (ViT-L/14) took 12 days on 256 V100 GPUs. OpenAI later trained a ViT-L/14 variant at 336-pixel resolution, which further improved performance.
The original CLIP paper evaluated several model configurations to study the effect of architecture and scale on performance.
| Model | Image Encoder | Image Resolution | Parameters (approx.) | ImageNet Zero-Shot Top-1 (%) |
|---|---|---|---|---|
| RN50 | ResNet-50 | 224 | 102M | 58.2 |
| RN101 | ResNet-101 | 224 | 120M | 62.2 |
| RN50x4 | ResNet-50 (4x width) | 288 | 178M | 65.8 |
| RN50x16 | ResNet-50 (16x width) | 384 | 420M | 70.6 |
| RN50x64 | ResNet-50 (64x width) | 448 | 840M | 73.6 |
| ViT-B/32 | ViT-Base, 32px patches | 224 | 151M | 63.2 |
| ViT-B/16 | ViT-Base, 16px patches | 224 | 150M | 68.3 |
| ViT-L/14 | ViT-Large, 14px patches | 224 | 428M | 75.5 |
| ViT-L/14@336px | ViT-Large, 14px patches | 336 | 428M | 76.2 |
The ViT-L/14@336px model achieved the highest zero-shot accuracy at 76.2% on ImageNet, matching the performance of a fully supervised ResNet-50 trained on the full ImageNet training set, despite never seeing a single ImageNet training example [1].
CLIP's most celebrated capability is zero-shot image classification: classifying images into categories the model was never explicitly trained on. The process works as follows:
Define categories in natural language: For a given classification task, each category is described using a text prompt. For example, for ImageNet classification, the 1,000 class names are each embedded in a template like "a photo of a {class name}."
Compute text embeddings: The text encoder processes each category description and produces a set of text embedding vectors, one per category.
Compute image embedding: The image encoder processes the input image and produces a single image embedding vector.
Match by similarity: The image embedding is compared to all text embeddings using cosine similarity. The category with the highest similarity is the predicted class.
This process requires no gradient updates, no labeled training data for the target task, and no modification to the model. The user simply describes what they want to classify in natural language.
The authors found that the choice of text prompt significantly affects zero-shot performance. A bare class name like "dog" performed worse than a contextualized prompt like "a photo of a dog." To further improve accuracy, they used prompt ensembling: computing text embeddings for multiple prompt templates per class (e.g., "a photo of a dog," "a good photo of a dog," "a close-up photo of a dog") and averaging the resulting embeddings. Using 80 different prompt templates improved ImageNet zero-shot accuracy by approximately 3.5 percentage points over single-prompt evaluation [1].
CLIP's results demonstrated several breakthrough capabilities that reshaped the field.
The headline result was that CLIP's ViT-L/14@336px achieved 76.2% top-1 accuracy on ImageNet in a zero-shot setting, matching a fully supervised ResNet-50 (76.1%). This was striking because CLIP had never seen any ImageNet training images or labels. The model's visual understanding was derived entirely from natural language supervision on web-scraped image-text pairs.
Beyond ImageNet, the authors evaluated CLIP on 27 diverse datasets spanning general object classification, fine-grained recognition (flowers, cars, aircraft, food), scene recognition, texture classification, satellite imagery, and OCR. CLIP achieved competitive or state-of-the-art zero-shot performance on many of these benchmarks, outperforming task-specific supervised baselines in some cases.
| Dataset | Task | CLIP Zero-Shot Accuracy (%) | Best Prior Zero-Shot (%) |
|---|---|---|---|
| ImageNet | Object classification | 76.2 | 11.5 (Visual N-Grams) |
| CIFAR-10 | Object classification | 95.7 | - |
| CIFAR-100 | Object classification | 77.5 | - |
| STL-10 | Object classification | 99.3 | - |
| Stanford Cars | Fine-grained (car models) | 77.3 | - |
| Food101 | Fine-grained (food) | 92.9 | - |
| Oxford Pets | Fine-grained (pet breeds) | 93.5 | - |
| Flowers102 | Fine-grained (flower species) | 79.2 | - |
| Country211 | Geolocation | 31.8 | - |
| SST-2 | Sentiment analysis (from images of text) | 60.2 | - |
CLIP showed notably better robustness to natural distribution shifts compared to ImageNet-trained models. On variants like ImageNet-V2, ImageNet-R, ImageNet-Sketch, and ObjectNet (which test how well models generalize beyond the standard ImageNet test set), CLIP maintained higher accuracy relative to its baseline performance. This suggested that learning from natural language descriptions produces more generalizable visual features than learning from a fixed label taxonomy.
For the strongest models, CLIP uses a Vision Transformer as the image encoder. The input image is divided into fixed-size patches (e.g., 14x14 pixels for ViT-L/14). Each patch is linearly embedded and combined with learnable positional encodings. The resulting sequence of patch embeddings is processed through multiple transformer layers with multi-head self-attention and feed-forward networks. The final image representation is taken from a special [CLS] token prepended to the sequence, which is then linearly projected into the shared embedding space.
For ResNet-based variants, CLIP uses a modified ResNet architecture. The modifications include replacing the global average pooling layer with an attention pooling mechanism (a single layer of multi-head attention), using anti-aliased rectified activations, and adjusting the stem to use three convolutions instead of one. These modifications improve performance while maintaining the standard ResNet structure.
The text encoder is a standard transformer with masked self-attention (causal attention, as in GPT). Text is tokenized using byte pair encoding with a vocabulary of 49,152 tokens and a maximum sequence length of 76 tokens. The representation of the [EOS] (end of sequence) token at the final layer serves as the text embedding, which is linearly projected into the shared space.
Both the image and text representations are linearly projected into a shared embedding space and L2-normalized. Similarity between an image and text is measured using cosine similarity, scaled by a learnable temperature parameter that is optimized during training as a log-parameterized multiplicative scalar.
CLIP's impact on the AI field has been profound and wide-ranging, extending far beyond the zero-shot classification results reported in the original paper.
Perhaps CLIP's most consequential downstream impact was its role in enabling modern text-to-image generation. DALL-E 2 (OpenAI, 2022) used CLIP embeddings as the bridge between text and image generation: a prior model maps CLIP text embeddings to CLIP image embeddings, which then condition a diffusion model to generate images [2]. Stable Diffusion and other latent diffusion models use CLIP's text encoder to produce the conditioning signal that guides image generation through cross-attention [3]. Without CLIP's ability to create a semantically meaningful shared space between language and vision, these generation systems would not have been possible in their current form.
CLIP demonstrated that joint vision-language pre-training could produce versatile visual backbones. The CLIP ViT encoder has become one of the most widely adopted image encoders in the field, serving as the visual component in multimodal models including LLaVA, GPT-4V, InstructBLIP, and many other vision-language models. These systems typically freeze or lightly fine-tune CLIP's image encoder and connect it to a large language model through a projection layer or adapter.
CLIP's text-image alignment enabled a new category of vision models that can detect or segment objects described in natural language, rather than from a fixed set of categories. Systems like OWL-ViT (Google), Grounding DINO, and ODISE leverage CLIP features to perform open-vocabulary object detection and segmentation, where the user specifies target objects via text prompts.
CLIP's shared embedding space naturally supports cross-modal search. Users can search an image database using text queries (or search a text database using image queries) by computing similarities in the CLIP embedding space. This capability powers numerous commercial and open-source image search systems.
CLIP-based classifiers are widely used for automated content moderation, NSFW filtering, and image safety classification. Because CLIP can classify images using arbitrary text descriptions, it can be rapidly adapted to new content categories without collecting labeled training data.
The success of CLIP inspired a rich ecosystem of variants, reproductions, and improvements.
| Model | Organization | Year | Key Difference from CLIP | Notable Achievement |
|---|---|---|---|---|
| OpenCLIP | LAION / ML Foundations | 2022-present | Open-source reproduction; trained on LAION-2B and DataComp-1B | Reproduced and exceeded original CLIP performance with open data |
| SigLIP | 2023 | Sigmoid loss instead of softmax contrastive loss | Better performance at smaller batch sizes; more efficient training | |
| CLIPA | Various | 2023 | Aggressive training efficiency optimizations | 81.8% ImageNet zero-shot within $14,000 compute budget |
| EVA-CLIP | BAAI | 2023 | Initialized from EVA pre-trained ViT; scaled to ViT-E (4.4B params) | 82.0% zero-shot ImageNet with ViT-E/14 |
| MetaCLIP | Meta | 2023 | Curated training data using metadata from CLIP's data pipeline | Outperformed original CLIP with same architecture; better data curation |
| ALIGN | 2021 | Trained on 1.8B noisy image-alt-text pairs; EfficientNet encoder | Demonstrated scale can compensate for noisier data | |
| DFN (Data Filtering Networks) | Apple | 2023 | Used data quality filtering networks to curate training data | Improved data efficiency and final model quality |
| InternVL | Shanghai AI Lab | 2024 | Scaled vision-language alignment to 6B vision encoder | Strong performance on vision-language benchmarks |
OpenCLIP is an open-source reproduction of CLIP developed by the ML Foundations group, with contributions from researchers at LAION, University of Washington, and other institutions [4]. OpenCLIP provides training code, evaluation tools, and pre-trained models covering a wide range of architectures and training datasets, including LAION-400M, LAION-2B, and DataComp-1B. OpenCLIP models have matched and in some configurations exceeded the performance of OpenAI's original CLIP models. The project also published detailed studies on the scaling properties of contrastive language-image learning, establishing reproducible scaling laws for the CLIP training paradigm.
SigLIP, introduced by Google Research in 2023, replaces CLIP's softmax-based contrastive loss with a simpler sigmoid loss applied independently to each image-text pair in the batch [5]. This change eliminates the need for a global normalization across the batch, making the loss function more memory-efficient and enabling training with larger batch sizes on the same hardware. SigLIP also removes the need for inter-device communication of embeddings during distributed training. Empirically, SigLIP performs comparably to or better than CLIP, particularly at smaller batch sizes (4,000 to 8,000), where CLIP's softmax loss becomes less effective due to fewer negative examples. Google also trained a multilingual SigLIP variant supporting multiple languages.
MetaCLIP, developed by researchers at Meta, Facebook AI Research, New York University, and the University of Washington, focused on understanding and improving the data curation process used to train CLIP models [6]. The authors reverse-engineered CLIP's data collection methodology and proposed a metadata-based data curation algorithm that balances concept frequencies across the training set. Using this curated data pipeline, MetaCLIP matched or exceeded the original CLIP's performance on standard benchmarks while using publicly available data sources. Meta subsequently released MetaCLIP 2, which further scaled the curation recipe.
EVA-CLIP, developed by the Beijing Academy of Artificial Intelligence (BAAI), combined the EVA pre-training approach (masked image modeling with CLIP feature targets) with standard CLIP contrastive training [7]. By initializing the vision encoder from an EVA-pretrained model rather than from scratch, EVA-CLIP achieved stronger performance, especially at very large model scales. EVA-CLIP scaled the vision encoder to ViT-E (4.4 billion parameters), achieving 82.0% zero-shot ImageNet accuracy.
CLIP and its variants are used across a wide range of practical applications.
CLIP text encoders serve as the primary text conditioning mechanism in most modern image generation systems. In Stable Diffusion 1.x and 2.x, the CLIP text encoder converts user prompts into embedding vectors that guide the denoising process through cross-attention layers in the U-Net [3]. Stable Diffusion 3 uses two CLIP text encoders alongside a T5 text encoder. DALL-E 2 conditions on CLIP image embeddings generated from CLIP text embeddings via a prior model.
CLIP is widely used for classification tasks where labeled training data is scarce or unavailable. Researchers and practitioners can classify images into any set of categories by simply providing textual descriptions, making CLIP especially valuable for specialized domains (medical imaging, satellite analysis, industrial inspection) where collecting labeled datasets is expensive.
CLIP powers cross-modal retrieval systems that allow searching images with text and vice versa. This is used in commercial photo libraries, e-commerce product search, content management systems, and research tools.
CLIP's image encoder serves as the visual backbone in many visual question answering and image captioning systems. Models like LLaVA, InstructBLIP, and BLIP-2 connect CLIP's visual features to large language models to enable conversational visual understanding.
CLIP features have been adopted in robotics for grounding natural language instructions in visual observations. Systems like CLIPort and SayCan use CLIP to connect language commands with visual perception, enabling robots to identify and manipulate objects described in natural language.
Despite its versatility, CLIP has several well-documented limitations.
CLIP struggles with tasks requiring fine-grained discrimination (distinguishing between similar car models or bird species at expert level), counting objects in images, understanding spatial relationships ("the cat is to the left of the dog"), and performing systematic compositional reasoning. The contrastive training objective encourages matching images to captions at a holistic level, but does not explicitly teach the model to parse detailed spatial or quantitative information.
CLIP is famously vulnerable to typographic attacks: overlaying text on an image can cause CLIP to misclassify it. For example, placing the word "iPod" on a picture of an apple can cause CLIP to classify it as an iPod rather than an apple [1]. This vulnerability arises because CLIP has learned to heavily weight textual content visible in images, treating it as a strong signal for classification. Studies have shown that typographic attacks cause an average performance drop of approximately 34.7% across CLIP models [8].
CLIP inherits biases present in its internet-scraped training data. Audits have found that CLIP misclassified 4.9% of images into non-human categories, with photos of Black individuals experiencing the highest misclassification rate at roughly 14% [1]. The model also encodes gender and racial stereotypes; for instance, it associates certain occupations more strongly with particular genders or racial groups. These biases propagate to downstream systems that rely on CLIP embeddings.
While CLIP shows improved robustness to natural distribution shifts (variations in style, context, and viewpoint), it is more vulnerable to adversarial attacks than standard ImageNet-trained models. Under synthetic perturbations and transfer-based attacks, CLIP's accuracy degrades by an average of 11.8% more than comparable standard models [8]. CLIP also produces overconfident incorrect predictions when attacked, making failure detection harder.
OpenAI did not release the WIT training dataset, making it impossible for external researchers to fully audit what the model learned, identify specific biases, or reproduce the exact training setup. This opacity has been partially addressed by community efforts like OpenCLIP and MetaCLIP, which train on publicly available data.
CLIP's text encoder has a maximum context length of 76 tokens, which limits its ability to process detailed or lengthy text descriptions. This constraint becomes apparent in applications like image generation, where users write complex multi-sentence prompts that exceed CLIP's context window. Later text encoders (such as T5) have been added alongside CLIP in systems like Stable Diffusion 3 to address this limitation.
As of early 2026, CLIP remains one of the most widely used and cited models in AI research and industry. Its core architecture has proven remarkably durable.
CLIP's text encoder continues to serve as the default text conditioning mechanism in diffusion-based image generation, though newer systems increasingly supplement it with larger language model encoders. The CLIP ViT image encoder remains the most common visual backbone in vision-language models, though alternatives like SigLIP and DINOv2 are gaining ground for specific use cases.
The CLIP training paradigm has influenced virtually every subsequent vision-language model. The idea of learning visual representations through natural language supervision, rather than discrete labels, has become a foundational principle in modern AI. Research continues to refine the contrastive learning objective, improve data curation, scale to larger models and datasets, and address CLIP's known limitations around compositionality, bias, and adversarial robustness.
Recent developments include the integration of CLIP-style alignment into video understanding models, 3D scene understanding, audio-visual learning, and medical imaging. The concept of aligning modalities through contrastive learning in a shared embedding space, which CLIP popularized for vision and language, has become a general-purpose tool applied across many combinations of modalities.