CLIP Score (also written as CLIPScore or CLIP-S) is an automatic evaluation metric that measures the compatibility between an image and a text caption using CLIP (Contrastive Language-Image Pre-training) embeddings. Introduced by Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi in their 2021 paper "CLIPScore: A Reference-free Evaluation Metric for Image Captioning," the metric computes the cosine similarity between the visual and textual embeddings produced by a pretrained CLIP model. Unlike traditional captioning metrics such as BLEU, METEOR, and CIDEr, CLIP Score does not require human-written reference captions, making it a reference-free metric. It was presented at the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) and has since become one of the most widely adopted metrics for evaluating both image captioning systems and text-to-image generation models.
Before CLIP Score, evaluation of image captioning systems relied heavily on reference-based metrics borrowed from machine translation and text summarization. These metrics compare a generated caption against one or more human-written reference captions and include:
| Metric | Type | What It Measures |
|---|---|---|
| BLEU | N-gram overlap | Precision of n-gram matches between candidate and reference |
| METEOR | Alignment-based | Unigram matches with stemming, synonymy, and paraphrasing |
| ROUGE | N-gram overlap | Recall-oriented n-gram overlap |
| CIDEr | TF-IDF weighted | Consensus-based n-gram similarity using TF-IDF weighting |
| SPICE | Semantic | Semantic propositional content via scene graphs |
| BERTScore | Embedding-based | Token-level similarity using BERT contextual embeddings |
These metrics share a fundamental limitation: they require human-written reference captions. Collecting high-quality reference captions is expensive and time-consuming. More importantly, there are many valid ways to describe an image, and reference-based metrics penalize correct captions that happen to use different words or sentence structures than the references. A caption like "A dog plays in the park" and "A puppy running on grass" may describe the same image accurately, but n-gram overlap metrics would give the second caption a low score if only the first was available as a reference.
OpenAI released CLIP (Contrastive Language-Image Pre-training) in January 2021. CLIP was trained on approximately 400 million image-text pairs collected from the internet using a contrastive learning objective. During training, CLIP learned to map images and their corresponding text descriptions to nearby points in a shared embedding space, while pushing non-matching image-text pairs apart. The resulting model demonstrated remarkable zero-shot transfer learning capabilities across a wide range of visual tasks.
Hessel et al. recognized that CLIP's learned alignment between visual and textual representations could serve as the foundation for a captioning evaluation metric. If CLIP could reliably judge whether an image and a piece of text "go together," then it could evaluate caption quality without needing any reference captions at all. This insight led to the development of CLIP Score.
CLIP Score measures the semantic similarity between an image and a caption by computing the cosine similarity of their CLIP embeddings. The process involves three steps:
The CLIP Score for an image I and candidate caption C is defined as:
CLIP-S(c, v) = w * max(cos(c, v), 0)
Where:
The cosine similarity itself is computed as:
cos(c, v) = (c · v) / (||c|| * ||v||)
Since CLIP embeddings are L2-normalized before the computation, the dot product of the normalized vectors directly yields the cosine similarity.
The scaling constant w = 2.5 was chosen by the authors to rescale the scores into a more interpretable range. Raw cosine similarities from CLIP typically fall in a narrow band, so multiplying by 2.5 spreads the scores out, making differences between good and poor captions more apparent.
The original CLIPScore implementation prepends the text "A photo depicts" to every candidate caption before passing it through CLIP's text encoder. This prompt engineering technique leverages CLIP's training distribution, where many of the 400 million training pairs consisted of photos with descriptive alt-text. By framing the caption as a description of a photo, the prompt helps CLIP produce more discriminative embeddings. The authors note that changing this prompt affects the resulting scores, and they recommend keeping it fixed for reproducibility.
In the original implementation with w = 2.5, CLIP Score values typically range from 0 to approximately 2.5. A score of 0 indicates no alignment between the image and caption (or negative alignment, which is clipped), while higher scores indicate stronger semantic correspondence.
In the PyTorch Metrics (torchmetrics) implementation, the formula uses a factor of 100 instead of 2.5:
CLIPScore(I, C) = max(100 * cos(E_I, E_C), 0)
This variant produces scores bounded between 0 and 100, which can be more intuitive to interpret.
While CLIP Score is designed to work without references, the authors also proposed RefCLIPScore for situations where reference captions are available. RefCLIPScore combines the reference-free CLIP Score with a text-only similarity component.
RefCLIPScore is computed as the harmonic mean (F-score) of two components:
The formula is:
RefCLIP-S = (2 * CLIP-S(c, v) * max_r cos(c, r)) / (CLIP-S(c, v) + max_r cos(c, r))
Where:
The harmonic mean is a natural choice for combining precision-like and recall-like components, as it penalizes cases where one score is very low while the other is high. A caption that perfectly matches the image but uses completely different language from all references, or one that copies a reference verbatim but does not match the image, would both receive a lower RefCLIPScore than a caption that performs well on both dimensions.
Hessel et al. evaluated CLIP Score against human judgments on several established benchmarks. The primary evaluation measured Kendall's tau correlation between metric scores and human quality ratings. The key results are summarized below.
| Metric | Kendall τ_c |
|---|---|
| BLEU-4 | 30.8 |
| ROUGE | 32.3 |
| METEOR | 41.8 |
| CIDEr | 43.9 |
| SPICE | 44.9 |
| CLIP-S (CLIPScore) | 51.2 |
On the Flickr8K-Expert corpus, where expert annotators rated caption quality, CLIP Score achieved a Kendall tau correlation of 51.2. This outperformed all reference-based metrics, including CIDEr (43.9) and SPICE (44.9), despite not using any reference captions.
| Metric | Kendall τ_b |
|---|---|
| BLEU-4 | 16.9 |
| ROUGE | 19.9 |
| METEOR | 22.3 |
| SPICE | 24.4 |
| CIDEr | 24.6 |
| CLIP-S (CLIPScore) | 34.4 |
On the Flickr8K-CF (CrowdFlower) corpus, which uses crowd-sourced quality judgments, CLIP Score achieved 34.4, substantially higher than CIDEr's 24.6.
| Metric | Kendall τ_c |
|---|---|
| BLEU-4 | 45.5 |
| ROUGE | 45.5 |
| METEOR | 46.2 |
| CIDEr | 48.0 |
| SPICE | 49.9 |
| CLIP-S (CLIPScore) | 53.8 |
| RefCLIP-S | 55.4 |
On the Composite dataset, CLIP Score reached 53.8 and RefCLIPScore further improved this to 55.4, demonstrating the benefit of incorporating reference information when available.
The paper also reported several other notable results:
CLIP Score's original and primary use case is evaluating image captioning models. It enables researchers to:
CLIP Score has become one of the standard metrics for evaluating text-to-image generation models such as Stable Diffusion, DALL-E, Imagen, and Midjourney. In this context, the metric measures how well a generated image matches the input text prompt.
For example, the Hugging Face Diffusers library includes CLIP Score as a built-in evaluation metric for comparing diffusion model checkpoints. When comparing Stable Diffusion v1.4 and v1.5 on sample prompts, typical CLIP Scores (using the torchmetrics 0-100 scale) were 34.9 and 36.2 respectively, indicating that v1.5 produced images slightly more aligned with the input prompts.
Major text-to-image benchmarks that incorporate CLIP Score include:
| Benchmark | Introduced By | Description |
|---|---|---|
| DrawBench | Google (Imagen) | ~200 prompts across 11 categories including colors, counting, spatial relations, and text rendering |
| PartiPrompts | Google (Parti) | 1,600 prompts spanning 12 categories and 11 challenge aspects |
| HEIM | Stanford CRFM | Holistic evaluation framework for image generation models |
| T2I-CompBench | Huang et al. | Compositional text-to-image generation benchmark |
| GenEval | Ghosh et al. | Object-focused text-to-image evaluation |
A related variant called CLIP Directional Similarity measures the consistency of image edits. Given an original image, an edited image, an original caption, and a modified caption, this metric computes:
DirectionalSimilarity = cos(E_edited - E_original, E_caption_new - E_caption_old)
This measures whether the direction of change in image space aligns with the direction of change in text space. It is particularly useful for evaluating image editing models like InstructPix2Pix.
Beyond evaluation, CLIP Score is used during training of generative models:
The official CLIPScore implementation is available on GitHub (jmhessel/clipscore). It uses the command-line interface:
python clipscore.py candidates.json /path/to/images --references_json refs.json
Key implementation details:
The torchmetrics library provides a widely used implementation:
from torchmetrics.multimodal import CLIPScore
import torch
metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
score = metric(torch.randint(255, (3, 224, 224)), "a photo of a cat")
Supported CLIP model variants in torchmetrics include:
| Model | Resolution | Notes |
|---|---|---|
| openai/clip-vit-base-patch16 | 224x224 | Base model with patch size 16 |
| openai/clip-vit-base-patch32 | 224x224 | Base model with patch size 32 |
| openai/clip-vit-large-patch14 | 224x224 | Default in torchmetrics; larger model |
| openai/clip-vit-large-patch14-336 | 336x336 | Higher resolution variant |
| zer0int/LongCLIP-L-Diffusers | 224x224 | Supports 248-token sequences (vs. 77 default) |
The Hugging Face Diffusers library includes CLIP Score computation as part of its evaluation documentation, with code examples for comparing diffusion model checkpoints.
Since CLIP Score relies entirely on the CLIP model, it inherits all of CLIP's biases and limitations:
One of the most significant limitations of CLIP Score is its weak compositional reasoning:
Standard CLIP models support a maximum input sequence of 77 tokens. For longer captions or detailed prompts, the text is truncated, potentially losing important information. LongCLIP variants have been developed to address this, supporting up to 248 tokens.
CLIP Scores tend to cluster in a narrow range, making it difficult to distinguish between subtly different caption qualities. Embedding-based metrics like CLIP Score, HPS, and BLIP-2 typically concentrate their scores around 0.25 to 0.5 (on the raw cosine similarity scale), reducing their discriminative power for fine-grained comparisons.
When CLIP Score is used to evaluate images generated by models that were themselves trained or guided using CLIP (such as some Stable Diffusion variants or DALL-E 2's CLIP-based prior), the evaluation becomes somewhat circular. The generator may have learned to produce images that score well on CLIP without necessarily improving perceptual quality.
The limitations of CLIP Score have motivated the development of several alternative and complementary metrics:
| Metric | Year | Approach | Key Improvement |
|---|---|---|---|
| ImageReward | 2023 | Trained on 137K expert comparisons | Outperforms CLIP by 38.6% on human preference prediction |
| HPS v2 | 2023 | Trained on 798K human preference choices | 83.3% accuracy vs. CLIP's 65.1% on preference prediction |
| PickScore | 2023 | Fine-tuned on Pick-a-Pic human preference dataset | Better alignment with user preferences for generated images |
| HPSv3 | 2025 | Wide-spectrum preference modeling | Highest alignment with human judgments among preference models |
These metrics fine-tune CLIP or similar models on human preference data specifically collected for evaluating generated images, resulting in substantially better correlation with what humans actually prefer.
| Metric | Year | Approach |
|---|---|---|
| TIFA | 2023 | Generates question-answer pairs from prompts and checks if a VQA model can answer them from the image |
| DSG | 2023 | Decomposes prompts into scene graphs and evaluates element-by-element |
| VQAScore | 2024 | Uses image-to-text generation models to evaluate alignment; significantly outperforms CLIP Score on compositional prompts |
VQA-based metrics address CLIP Score's compositional weakness by breaking down evaluation into individual verifiable questions.
For evaluating the overall quality of a generative model (rather than individual image-text pairs), distribution-based metrics remain important:
These metrics are complementary to CLIP Score: FID and IS measure image quality and diversity without considering text alignment, while CLIP Score measures text alignment without directly assessing visual quality.
The choice of CLIP model variant affects the resulting scores. Larger models (e.g., ViT-L/14) generally produce more discriminative scores than smaller ones (e.g., ViT-B/32), but they also require more computation. For evaluating Stable Diffusion outputs, using the same CLIP variant that was employed during the model's training (typically openai/clip-vit-large-patch14) provides the most meaningful results.
Several factors affect CLIP Score reproducibility:
For fair comparisons, researchers should report the exact CLIP model used, the preprocessing pipeline, and the hardware configuration.
CLIP Score should be interpreted comparatively rather than absolutely. A score of 0.28 versus 0.25 (on the raw cosine similarity scale) is meaningful as a relative comparison, but there is no universal threshold that defines a "good" score. The distribution of scores varies by domain, CLIP model variant, and the complexity of the text prompts.
When reporting CLIP Scores, it is important to specify which scaling convention is being used (the original w = 2.5 scaling, the torchmetrics 0-100 scaling, or raw cosine similarity) to avoid confusion.
CLIP Score has had a substantial impact on the fields of image captioning and text-to-image generation:
As of the paper's publication in 2021, it had already become one of the most cited works in the image captioning evaluation literature, and its influence has only grown with the explosion of text-to-image generation research in 2022 through 2025.