# CLIP Score

> Source: https://aiwiki.ai/wiki/clip_score
> Updated: 2026-06-24
> Categories: AI Benchmarks, Computer Vision, Image Generation, Multimodal AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**CLIP Score** (also written **CLIPScore** or **CLIP-S**) is a reference-free automatic evaluation metric that measures how well a text caption matches an image, computed as the rescaled [cosine similarity](/wiki/cosine_similarity) of the image and text embeddings produced by [CLIP](/wiki/clip) (Contrastive Language-Image Pre-training). It is defined as CLIP-S(c, v) = w * max(cos(c, v), 0) with a rescaling constant w = 2.5, where c is the CLIP text embedding of the candidate caption and v is the CLIP visual [embedding](/wiki/embeddings) of the image. [1] Introduced by Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi in the 2021 paper "CLIPScore: A Reference-free Evaluation Metric for Image Captioning," presented at the Conference on Empirical Methods in [Natural Language Processing](/wiki/natural_language_processing) (EMNLP), CLIP Score is "reference-free" because, unlike BLEU, METEOR, [CIDEr](/wiki/cider_metric), and SPICE, it requires no human-written reference captions. [1] The authors reported "the surprising empirical finding that CLIP ... can be used for robust automatic evaluation of image captioning without the need for references," achieving the highest correlation with human judgements among the metrics they tested. [1] It has since become one of the most widely adopted metrics for evaluating both [image captioning](/wiki/image_captioning) systems and [text-to-image generation](/wiki/text_to_image) models, where it measures how faithfully a generated image matches its input prompt.

## Background and Motivation

### Why were reference-free metrics needed?

Before CLIP Score, evaluation of image captioning systems relied heavily on reference-based metrics borrowed from [machine translation](/wiki/machine_translation) and text summarization. These metrics compare a generated caption against one or more human-written reference captions and include:

| Metric | Type | What It Measures |
|--------|------|------------------|
| [BLEU](/wiki/bleu) | N-gram overlap | Precision of n-gram matches between candidate and reference |
| METEOR | Alignment-based | Unigram matches with stemming, synonymy, and paraphrasing |
| ROUGE | N-gram overlap | Recall-oriented n-gram overlap |
| CIDEr | TF-IDF weighted | Consensus-based n-gram similarity using TF-IDF weighting |
| SPICE | Semantic | Semantic propositional content via scene graphs |
| [BERTScore](/wiki/bert) | Embedding-based | Token-level similarity using [BERT](/wiki/bert) contextual embeddings |

These metrics share a fundamental limitation: they require human-written reference captions. Collecting high-quality reference captions is expensive and time-consuming. More importantly, there are many valid ways to describe an image, and reference-based metrics penalize correct captions that happen to use different words or sentence structures than the references. A caption like "A dog plays in the park" and "A puppy running on grass" may describe the same image accurately, but n-gram overlap metrics would give the second caption a low score if only the first was available as a reference. The paper frames this as a mismatch with how people actually work: reference-based evaluation "is in contrast to the reference-free manner in which humans assess caption quality." [1]

### The Arrival of CLIP

[OpenAI](/wiki/openai) released CLIP (Contrastive Language-Image Pre-training) in January 2021. CLIP was trained on approximately 400 million image-text pairs collected from the internet using a [contrastive learning](/wiki/contrastive_learning) objective; the dataset was assembled by running 500,000 search queries (common unigrams, bigrams, named entities, and similar terms) and collecting up to 20,000 image-text pairs per query. [2] During training, CLIP learned to map images and their corresponding text descriptions to nearby points in a shared [embedding](/wiki/embeddings) space, while pushing non-matching image-text pairs apart. The resulting model demonstrated strong zero-shot [transfer learning](/wiki/transfer_learning) capabilities across a wide range of visual tasks. [2]

Hessel et al. recognized that CLIP's learned alignment between visual and textual representations could serve as the foundation for a captioning evaluation metric. If CLIP could reliably judge whether an image and a piece of text "go together," then it could evaluate caption quality without needing any reference captions at all. This insight led to the development of CLIP Score. [1]

## How does CLIP Score work?

### Core Computation

CLIP Score measures the semantic similarity between an image and a caption by computing the cosine similarity of their CLIP embeddings. The process involves three steps:

1. **Encode the image**: Pass the image through CLIP's visual [encoder](/wiki/encoder) (typically a [Vision Transformer](/wiki/vision_transformer) or [ResNet](/wiki/resnet)) to produce a visual embedding vector **v**.
2. **Encode the caption**: Pass the caption through CLIP's text encoder (a [Transformer](/wiki/transformer)-based model) to produce a textual embedding vector **c**. In the original implementation, the caption is prepended with the prompt "A photo depicts" before encoding. [1]
3. **Compute cosine similarity**: Calculate the cosine similarity between the two normalized embedding vectors.

The original paper uses the CLIP ViT-B/32 model, whose text and image networks each output a single 512-dimensional vector; the image network represents pictures via a Vision Transformer with 12 transformer layers and 86 million parameters, and the text network is a 12-layer transformer over a vocabulary of roughly 49,000 byte-pair-encoding token types. [1]

### Mathematical Formula

The CLIP Score for an image *I* and candidate caption *C* is defined as:

```
CLIP-S(c, v) = w * max(cos(c, v), 0)
```

Where:
- **c** is the CLIP text embedding of the candidate caption
- **v** is the CLIP visual embedding of the image
- **cos(c, v)** is the cosine similarity between the two embeddings
- **w** is a constant scaling factor set to **2.5** in the original implementation
- The **max(., 0)** operation clips negative similarity values to zero

The cosine similarity itself is computed as:

```
cos(c, v) = (c · v) / (||c|| * ||v||)
```

Since CLIP embeddings are L2-normalized before the computation, the dot product of the normalized vectors directly yields the cosine similarity. To produce a corpus-level score, the metric is simply averaged over all (candidate, image) pairs, and the authors note that "this evaluation does not depend on underlying references." [1]

The scaling constant *w* = 2.5 was chosen by the authors to rescale the scores into a more interpretable range. As they explain in the paper, "we never observe a negative cosine similarity, and we generally observe values ranging from roughly zero to roughly .4. The particular value of w we advocate for, w = 2.5, attempts to stretch the range of the score distribution to [0,1]." [1]

### The "A photo depicts" Prompt

The original CLIPScore implementation prepends the text "A photo depicts" to every candidate caption before passing it through CLIP's text encoder. The authors found that "prefixing candidates with the prompt: 'A photo depicts' improved correlations slightly" over feeding the bare caption. [1] This prompt engineering technique leverages CLIP's training distribution, where many of the 400 million training pairs consisted of photos with descriptive alt-text. By framing the caption as a description of a photo, the prompt helps CLIP produce more discriminative embeddings. The prompt is hard-coded in the official repository, and the authors recommend keeping it fixed for reproducibility. [1][11]

### How fast is CLIP Score?

CLIP Score is computationally cheap because it requires only a single forward pass through each encoder, with no reference comparison. The authors report that "on our single consumer GPU and hard drive, roughly 4K image-candidate pairings can be processed per minute" using the ViT-B/32 backbone. [1]

### Score Range

In the original implementation with *w* = 2.5, CLIP Score values typically range from 0 to approximately 2.5. A score of 0 indicates no alignment between the image and caption (or negative alignment, which is clipped), while higher scores indicate stronger semantic correspondence.

In the [PyTorch](/wiki/pytorch) Metrics (torchmetrics) implementation, the formula uses a factor of 100 instead of 2.5:

```
CLIPScore(I, C) = max(100 * cos(E_I, E_C), 0)
```

This variant produces scores bounded between 0 and 100, which can be more intuitive to interpret. [11]

## What is RefCLIPScore?

While CLIP Score is designed to work without references, the authors also proposed **RefCLIPScore** for situations where reference captions are available. The paper found via information-gain experiments that CLIP Score "is complementary to existing reference-based metrics that emphasize text-text similarities," which motivated a reference-augmented version that "achieves even higher correlation." [1] RefCLIPScore combines the reference-free CLIP Score with a text-only similarity component.

### How RefCLIPScore Works

RefCLIPScore is computed as the harmonic mean (F-score) of two components:

1. **Image-text score**: The standard CLIP Score measuring how well the caption matches the image.
2. **Text-text score**: The maximum cosine similarity between the CLIP text embedding of the candidate caption and the CLIP text embeddings of the available reference captions.

The formula is:

```
RefCLIP-S = (2 * CLIP-S(c, v) * max_r cos(c, r)) / (CLIP-S(c, v) + max_r cos(c, r))
```

Where:
- **CLIP-S(c, v)** is the standard CLIP Score
- **max_r cos(c, r)** is the maximum cosine similarity between the candidate and any reference caption, computed in CLIP's text embedding space
- The harmonic mean balances both components, requiring strong performance on each

### Why Harmonic Mean?

The harmonic mean is a natural choice for combining precision-like and recall-like components, as it penalizes cases where one score is very low while the other is high. A caption that perfectly matches the image but uses completely different language from all references, or one that copies a reference verbatim but does not match the image, would both receive a lower RefCLIPScore than a caption that performs well on both dimensions.

## How well does CLIP Score correlate with human judgment?

### Correlation with Human Judgments

Hessel et al. evaluated CLIP Score against human judgments on several established benchmarks. The primary evaluation measured Kendall's tau correlation between metric scores and human quality ratings. The key results are summarized below.

#### Flickr8K-Expert Dataset

Flickr8K-Expert contains roughly 17,000 expert human judgments across 5,664 images, where annotators graded captions on a scale of 1 to 4 (4 = "caption describes the image without any errors"). [1]

| Metric | Kendall τ_c |
|--------|------------|
| BLEU-4 | 30.8 |
| ROUGE | 32.3 |
| METEOR | 41.8 |
| CIDEr | 43.9 |
| SPICE | 44.9 |
| **CLIP-S (CLIPScore)** | **51.2** |
| **RefCLIP-S** | **53.0** |

On the Flickr8K-Expert corpus, CLIP Score achieved a Kendall tau correlation of 51.2, and RefCLIPScore reached 53.0. CLIP Score outperformed all reference-based metrics, including CIDEr (43.9) and SPICE (44.9), despite using no reference captions at all. [1]

#### Flickr8K-CF Dataset

| Metric | Kendall τ_b |
|--------|------------|
| BLEU-4 | 16.9 |
| ROUGE | 19.9 |
| METEOR | 22.2 |
| SPICE | 24.4 |
| CIDEr | 24.6 |
| **CLIP-S (CLIPScore)** | **34.4** |
| **RefCLIP-S** | **36.4** |

Flickr8K-CF (CrowdFlower) is a set of about 145,000 binary quality judgments gathered over 48,000 image-caption pairs (1,000 unique images), scored by the mean proportion of "yes" annotations. [1] On this corpus, CLIP Score achieved 34.4 and RefCLIPScore 36.4, substantially higher than CIDEr's 24.6.

#### Composite Dataset

| Metric | Kendall τ_c |
|--------|------------|
| BLEU-4 | 30.6 |
| ROUGE | 32.4 |
| METEOR | 38.9 |
| CIDEr | 37.7 |
| SPICE | 40.3 |
| **CLIP-S (CLIPScore)** | **53.8** |
| **RefCLIP-S** | **55.4** |

The Composite dataset contains about 12,000 human judgments spanning images from MSCOCO, Flickr8k, and Flickr30k. [1] On this corpus, CLIP Score reached 53.8 and RefCLIPScore further improved this to 55.4, demonstrating the benefit of incorporating reference information when available.

### Additional Findings

The paper also reported several other notable results:

- **Alt-text evaluation**: CLIP Score achieved a Kendall tau of 48.4 when evaluating the quality of 2,800 human-rated alt-texts on Twitter images, demonstrating applicability beyond standard captioning benchmarks. [1]
- **Robustness on FOIL**: On the FOIL adversarial dataset, where a single noun phrase is swapped to make a caption incorrect, CLIP Score reliably assigned higher scores to the true caption over the foil, even without access to references. [1]
- **Generalization**: On a set of 250 personal photographs never posted to the internet, CLIP Score "correctly recovers majority human preference in 86% of cases," against a human agreement ceiling of 93% for the same corpus, suggesting the metric generalizes beyond memorized training data. [1]

## What is CLIP Score used for?

### Image Captioning Evaluation

CLIP Score's original and primary use case is evaluating [image captioning](/wiki/image_captioning) models. It enables researchers to:

- Compare captioning models without expensive reference caption collection
- Evaluate captions on new domains where reference captions do not exist
- Obtain automatic quality scores that correlate well with human judgments

### Text-to-Image Generation Evaluation

CLIP Score has become one of the standard metrics for evaluating [text-to-image generation](/wiki/text_to_image) models such as [Stable Diffusion](/wiki/stable_diffusion), [DALL-E](/wiki/dall-e), [Imagen](/wiki/imagen), and [Midjourney](/wiki/midjourney). In this context, the metric measures how well a generated image matches the input text prompt; as the Hugging Face Diffusers documentation puts it, "higher CLIP scores imply higher compatibility," and the score "was found to have high correlation with human judgement." [12]

For example, the Hugging Face Diffusers library includes CLIP Score as a built-in evaluation metric for comparing [diffusion model](/wiki/diffusion_model) checkpoints. When comparing Stable Diffusion v1.4 and v1.5 on six sample prompts (using the torchmetrics 0-100 scale with the ViT-B/16 backbone), the reported CLIP Scores were 34.91 and 36.21 respectively, indicating that v1.5 produced images slightly more aligned with the input prompts. [12]

Major text-to-image benchmarks that incorporate CLIP Score include:

| Benchmark | Introduced By | Description |
|-----------|---------------|-------------|
| DrawBench | Google (Imagen) | ~200 prompts across 11 categories including colors, counting, spatial relations, and text rendering |
| PartiPrompts | Google (Parti) | 1,600 prompts spanning 12 categories and 11 challenge aspects |
| HEIM | Stanford CRFM | Holistic evaluation framework for image generation models |
| T2I-CompBench | Huang et al. | Compositional text-to-image generation benchmark |
| GenEval | Ghosh et al. | Object-focused text-to-image evaluation |

### CLIP Directional Similarity

A related variant called **CLIP Directional Similarity** measures the consistency of image edits. Given an original image, an edited image, an original caption, and a modified caption, this metric computes:

```
DirectionalSimilarity = cos(E_edited - E_original, E_caption_new - E_caption_old)
```

This measures whether the direction of change in image space aligns with the direction of change in text space. It is particularly useful for evaluating image editing models like InstructPix2Pix. [12]

### Model Training and Selection

Beyond evaluation, CLIP Score is used during training of generative models:

- **Guidance during generation**: Some diffusion model sampling procedures use CLIP similarity to guide the generation process toward better text alignment.
- **Model selection**: CLIP Score helps practitioners select the best model checkpoints or the best generated images from a batch.
- **[Prompt engineering](/wiki/prompt_engineering)**: Researchers use CLIP Score to optimize text prompts for better image generation results.

## Implementation and Libraries

### Official Implementation

The official CLIPScore implementation is available on GitHub (jmhessel/clipscore). It uses the command-line interface:

```bash
python clipscore.py candidates.json /path/to/images --references_json refs.json
```

Key implementation details: [11]
- Uses the CLIP ViT-B/32 model by default
- Prepends "A photo depicts" to all captions
- GPU execution uses float16 precision while CPU uses float32, which can produce slightly different results (the repository's example run yields a CLIPScore of 0.8585 on CPU versus 0.8584 on GPU)
- Image compression and resizing affect scores, so the authors provide SHA checksums for reproducibility

### PyTorch Metrics (torchmetrics)

The torchmetrics library provides a widely used implementation:

```python
from torchmetrics.multimodal import CLIPScore
import torch

metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
score = metric(torch.randint(255, (3, 224, 224)), "a photo of a cat")
```

Supported CLIP model variants in torchmetrics include: [11]

| Model | Resolution | Notes |
|-------|-----------|-------|
| openai/clip-vit-base-patch16 | 224x224 | Base model with patch size 16 |
| openai/clip-vit-base-patch32 | 224x224 | Base model with patch size 32 |
| openai/clip-vit-large-patch14 | 224x224 | Default in torchmetrics; larger model |
| openai/clip-vit-large-patch14-336 | 336x336 | Higher resolution variant |
| zer0int/LongCLIP-L-Diffusers | 224x224 | Supports 248-token sequences (vs. 77 default) |

### Hugging Face Diffusers

The Hugging Face Diffusers library includes CLIP Score computation as part of its evaluation documentation, with code examples for comparing diffusion model checkpoints. The documentation now notes that it "has grown outdated given the emergence of existing evaluation frameworks for diffusion models," pointing users toward HEIM, T2I-CompBench, and GenEval. [12]

### Additional Libraries

- **clip-score** (PyPI package by Taited): A lightweight Python package for quick CLIP similarity calculations
- **DALLE_clip_score**: A specialized script for evaluating DALL-E model outputs using CLIP-based scoring

## What are the limitations of CLIP Score?

### Inherited Biases from CLIP

Since CLIP Score relies entirely on the CLIP model, it inherits all of CLIP's biases and limitations. The Hugging Face documentation cautions that "both CLIP score and CLIP direction similarity rely on the CLIP model, which can make the evaluations biased." [12] Specific issues include:

- **Training data biases**: CLIP was trained on internet-sourced image-text pairs, which reflect the biases present in web data. The metric may systematically favor certain types of content.
- **Portrait bias**: Studies have shown that CLIP Score produces higher scores for portraits of people compared to landscapes, animals, or objects. Generated images of well-known public figures receive even higher scores, likely because such content was heavily represented in CLIP's training data.
- **Style sensitivity**: CLIP may not accurately distinguish between artistic styles, and artist name references in prompts can produce misleading scores.

### Compositional Understanding

One of the most significant limitations of CLIP Score is its weak compositional reasoning:

- **Attribute binding**: CLIP struggles to correctly associate attributes with objects. "A red car and a blue house" versus "a blue car and a red house" may receive similar CLIP Scores despite describing very different scenes.
- **Spatial relations**: The metric is largely insensitive to spatial relationships like "above," "below," "left of," and "right of."
- **Counting**: CLIP has difficulty distinguishing between different numbers of objects (e.g., "two cats" vs. "five cats").
- **Negation**: Phrases with negation ("a room without chairs") are poorly handled because CLIP tends to focus on the positive content words.

### Token Length Limitation

Standard CLIP models support a maximum input sequence of 77 tokens. For longer captions or detailed prompts, the text is truncated, potentially losing important information. LongCLIP variants have been developed to address this, supporting up to 248 tokens. [11]

### Narrow Score Distribution

CLIP Scores tend to cluster in a narrow range, making it difficult to distinguish between subtly different caption qualities. The paper itself observes that raw CLIP cosine similarities "generally [range] from roughly zero to roughly .4," which is precisely why the *w* = 2.5 rescaling was introduced. [1] Embedding-based metrics like CLIP Score, HPS, and BLIP-2 typically concentrate their scores around 0.25 to 0.5 on the raw cosine similarity scale, reducing their discriminative power for fine-grained comparisons.

### Circular Evaluation Problem

When CLIP Score is used to evaluate images generated by models that were themselves trained or guided using CLIP (such as some [Stable Diffusion](/wiki/stable_diffusion) variants or DALL-E 2's CLIP-based prior), the evaluation becomes somewhat circular. The generator may have learned to produce images that score well on CLIP without necessarily improving perceptual quality.

## How does CLIP Score compare to other metrics?

The limitations of CLIP Score have motivated the development of several alternative and complementary metrics:

### Human Preference Models

| Metric | Year | Approach | Key Improvement |
|--------|------|----------|----------------|
| ImageReward | 2023 | Trained on 137K expert comparisons | Outperforms CLIP by 38.6% on human preference prediction [3] |
| HPS v2 | 2023 | Trained on 798K human preference choices | 83.3% accuracy vs. CLIP's 65.1% on preference prediction [4] |
| PickScore | 2023 | Fine-tuned on Pick-a-Pic human preference dataset | Better alignment with user preferences for generated images [11] |
| HPSv3 | 2025 | Wide-spectrum preference modeling; trained on HPDv3 (1.08M text-image pairs, 1.17M pairwise comparisons); VLM-based with uncertainty-aware ranking loss; presented at ICCV 2025 | Highest alignment with human judgments among preference models [12] |

These metrics [fine-tune](/wiki/fine_tuning) CLIP or similar models on human preference data specifically collected for evaluating generated images, resulting in substantially better correlation with what humans actually prefer.

### VQA-Based Metrics

| Metric | Year | Approach |
|--------|------|----------|
| TIFA | 2023 | Generates question-answer pairs from prompts and checks if a VQA model can answer them from the image [5] |
| DSG | 2023 | Decomposes prompts into scene graphs and evaluates element-by-element |
| VQAScore | 2024 | Uses image-to-text generation models to evaluate alignment; significantly outperforms CLIP Score on compositional prompts [6] |

VQA-based metrics address CLIP Score's compositional weakness by breaking down evaluation into individual verifiable questions. [GenAI-Bench](/wiki/genai_bench), a 1,600-prompt compositional benchmark accompanying VQAScore, has been adopted by Google DeepMind (Imagen 3 and Imagen 4), ByteDance Seed, NVIDIA, and others as a standard evaluation suite, reflecting a broader shift away from CLIP Score toward VQA-based evaluation for frontier text-to-image models. [13]

### Other Embedding-Based Metrics

- **PAC-S (Positive-Augmented Contrastive Learning Score)**: Presented at CVPR 2023, improves upon CLIP Score by using augmented contrastive learning. [7]
- **SigLIP-based scores**: Use sigmoid-based contrastive loss instead of CLIP's softmax-based loss, offering improved calibration.
- **L-CLIPScore**: A lightweight variant that achieves comparable performance to CLIP Score while being more computationally efficient.

### Distribution-Based Metrics

For evaluating the overall quality of a generative model (rather than individual image-text pairs), distribution-based metrics remain important:

- **[FID (Frechet Inception Distance)](/wiki/frechet_inception_distance)**: Measures the distance between distributions of real and generated images. Lower is better.
- **IS (Inception Score)**: Measures quality and diversity of generated images.
- **KID (Kernel Inception Distance)**: An unbiased alternative to FID.

These metrics are complementary to CLIP Score: FID and IS measure image quality and diversity without considering text alignment, while CLIP Score measures text alignment without directly assessing visual quality.

## Practical Considerations

### Choosing a CLIP Model

The choice of CLIP model variant affects the resulting scores. Larger models (e.g., ViT-L/14) generally produce more discriminative scores than smaller ones (e.g., ViT-B/32), but they also require more computation. For evaluating [Stable Diffusion](/wiki/stable_diffusion) outputs, using the same CLIP variant that was employed during the model's training (typically openai/clip-vit-large-patch14) provides the most meaningful results. [12]

### Reproducibility

Several factors affect CLIP Score reproducibility: [11]

- **Hardware precision**: GPU (float16) and CPU (float32) execution produce slightly different scores.
- **Image preprocessing**: Different image resizing and compression methods change the resulting scores.
- **Model version**: Different CLIP checkpoints yield different score distributions.
- **Prompt prefix**: Changing or removing the "A photo depicts" prefix alters the scores.

For fair comparisons, researchers should report the exact CLIP model used, the preprocessing pipeline, and the hardware configuration.

### Score Interpretation

CLIP Score should be interpreted comparatively rather than absolutely. A score of 0.28 versus 0.25 (on the raw cosine similarity scale) is meaningful as a relative comparison, but there is no universal threshold that defines a "good" score. The distribution of scores varies by domain, CLIP model variant, and the complexity of the text prompts.

When reporting CLIP Scores, it is important to specify which scaling convention is being used (the original *w* = 2.5 scaling, the torchmetrics 0-100 scaling, or raw cosine similarity) to avoid confusion.

## Influence and Impact

CLIP Score has had a substantial impact on the fields of image captioning and text-to-image generation:

- **Widespread adoption**: It is used as a standard metric in nearly all major text-to-image generation papers, including those introducing [DALL-E 2](/wiki/dall-e), [Stable Diffusion](/wiki/stable_diffusion), [Imagen](/wiki/imagen), Parti, and many others.
- **Benchmark inclusion**: CLIP Score is a component of holistic evaluation frameworks like HEIM (Stanford), T2I-CompBench, and GenEval.
- **Spawning improvements**: Its limitations, particularly around compositional understanding, have directly motivated the development of better metrics like VQAScore, TIFA, and DSG.
- **Democratized evaluation**: By eliminating the need for reference captions, CLIP Score made it possible for anyone to evaluate captioning and generation systems, lowering the barrier to entry for research in these areas.

The original paper has become one of the most cited works in the image captioning evaluation literature, and its influence has only grown with the explosion of text-to-image generation research from 2022 through 2025.

## See Also

- [CLIP](/wiki/clip)
- [Cosine Similarity](/wiki/cosine_similarity)
- [Contrastive Learning](/wiki/contrastive_learning)
- [Computer Vision](/wiki/computer_vision)
- [Vision Transformer](/wiki/vision_transformer)
- [Stable Diffusion](/wiki/stable_diffusion)
- [DALL-E](/wiki/dall-e)
- [Imagen](/wiki/imagen)
- [Frechet Inception Distance](/wiki/frechet_inception_distance)
- [GenAI-Bench](/wiki/genai_bench)
- [ZeroBench](/wiki/zerobench)
- [Multimodal AI](/wiki/multimodal_ai)
- [Embeddings](/wiki/embeddings)
- [Generative AI](/wiki/generative_ai)
- [Diffusion Model](/wiki/diffusion_model)

## References

1. Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., & Choi, Y. (2021). "CLIPScore: A Reference-free Evaluation Metric for Image Captioning." *Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. Association for Computational Linguistics. arXiv:2104.08718.

2. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2103.00020.

3. Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., & Dong, Y. (2023). "ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation." *Advances in Neural Information Processing Systems (NeurIPS) 36*.

4. Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., & Li, H. (2023). "Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis." arXiv:2306.09341.

5. Hu, Y., Liu, B., Kasai, J., Wang, Y., Ostendorf, M., Krishna, R., & Smith, N. A. (2023). "TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*.

6. Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., & Ramanan, D. (2024). "Evaluating Text-to-Visual Generation with Image-to-Text Generation." *Proceedings of the European Conference on Computer Vision (ECCV)*.

7. Sarto, S., Barraco, M., Cornia, M., Baraldi, L., & Cucchiara, R. (2023). "Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*.

8. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). "BLEU: A Method for Automatic Evaluation of Machine Translation." *Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)*.

9. Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). "CIDEr: Consensus-based Image Description Evaluation." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.

10. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., ... & Norouzi, M. (2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding." *Advances in Neural Information Processing Systems (NeurIPS) 35*.

11. Lightning AI. "CLIPScore." *TorchMetrics Documentation*. https://lightning.ai/docs/torchmetrics/stable/multimodal/clip_score.html ; Hessel, J. "clipscore" (official implementation). GitHub: https://github.com/jmhessel/clipscore

12. Hugging Face. "Evaluating Diffusion Models." *Diffusers Documentation*. https://huggingface.co/docs/diffusers/conceptual/evaluation

13. Lin, Z., et al. (2024). "GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation." arXiv:2406.13743.