CIDEr
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,247 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,247 words
Add missing citations, update stale details, or suggest a clearer explanation.
CIDEr is an automatic evaluation metric for image captioning that measures how well a machine-generated caption matches the consensus of several human reference captions for the same image. The name stands for Consensus-based Image Description Evaluation, and the metric was introduced by Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh in a paper presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2015. [1] A variant called CIDEr-D became one of the official scores on the MS COCO captioning evaluation server, and for several years it served as the headline number that captioning systems competed on. [2]
The central idea is consensus. A single image can be described correctly in many ways, so any one reference caption is a poor gold standard. CIDEr instead treats a set of human captions as the target and rewards a candidate for using the words and phrases that most humans tend to use, while discounting phrasings that are rare or idiosyncratic. To do that it borrows term frequency-inverse document frequency (TF-IDF) weighting from information retrieval, which lets common, low-information n-grams (such as "a" or "there is") count for little and reserves weight for n-grams that actually distinguish one image from another.
Before CIDEr, captioning research mostly reused metrics built for other tasks. BLEU and ROUGE came from machine translation and summarization, and METEOR was also a translation metric. These compare a candidate against references by counting overlapping words or n-grams, but they were not designed around the fact that a caption set encodes a distribution over acceptable descriptions. The CIDEr authors argued that a good metric for description should track human consensus directly: a caption is good to the degree that it says what people generally agree the image shows. [1]
To study consensus they also built two datasets with unusually deep annotation, PASCAL-50S and ABSTRACT-50S, each providing 50 human-written sentences per image rather than the usual handful, and they collected human judgments with a triplet protocol that asks which of two candidate sentences is more like a reference sentence. [1] Two design choices follow from the consensus goal. First, TF-IDF weighting down-weights n-grams that appear across many images and so carry little descriptive content. Second, the score is averaged over a whole set of references rather than computed against a single one, so a candidate is rewarded for agreeing with the crowd, not with one arbitrary writer.
CIDEr scores a candidate caption against a set of reference captions for the same image. Captions are first tokenized and reduced to their stems (in the original CIDEr), and each is represented by the n-grams it contains for n from 1 to 4.
Each n-gram is given a TF-IDF weight. For an n-gram omega_k in a sentence, the weight g_k combines two factors:
In the paper's notation:
g_k(s_ij) = [ h_k(s_ij) / sum_l h_l(s_ij) ] * log( |I| / sum_p min(1, sum_q h_k(s_pq)) )
Here h_k(s_ij) is the number of times omega_k occurs in reference sentence s_ij. The IDF term is what makes n-grams that show up everywhere contribute almost nothing, since for a very common n-gram the ratio inside the logarithm approaches one and the log approaches zero. [1]
Given these weights, the score for a single n-gram length n is the average cosine similarity between the candidate's weighted n-gram vector and each reference's weighted n-gram vector. For a candidate c_i and a reference set S_i with m references:
CIDEr_n(c_i, S_i) = (1/m) * sum_j [ g^n(c_i) . g^n(s_ij) ] / ( ||g^n(c_i)|| * ||g^n(s_ij)|| )
The vectors g^n are formed from the weights of all n-grams of length n. Because cosine similarity normalizes by vector magnitude, the measure responds both to whether the candidate includes important reference n-grams (a recall-like behavior) and to whether it avoids n-grams the references lack (a precision-like behavior). [1]
The final CIDEr score combines the four n-gram lengths with uniform weights:
CIDEr(c_i, S_i) = sum_{n=1..4} w_n * CIDEr_n(c_i, S_i), with w_n = 1/4
Unigrams capture word choice, while bigrams through 4-grams capture short phrases and local word order. [1]
The authors found that plain CIDEr could be gamed: a sentence that humans rate poorly can still earn a high score if it repeats high-weight n-grams or stuffs in long strings of confident-sounding phrases. CIDEr-D is a hardened version designed for the MS COCO server, and it differs from the original in three ways. [1][2]
First, it clips n-gram counts. The candidate count for each n-gram is capped at the number of times that n-gram appears in the reference, which stops a model from inflating its score by repeating a good word many times. Second, it adds a Gaussian penalty on the length difference between the candidate and each reference, so captions that are much longer or shorter than the references are pushed down. Third, it drops stemming and multiplies the result by a factor of 10 so the numbers land in a range similar to the other COCO metrics. The standard deviation of the length penalty is set to sigma = 6. The per-length CIDEr-D formula is:
CIDEr-D_n(c_i, S_i) = (10/m) * sum_j exp( -( l(c_i) - l(s_ij) )^2 / (2 * sigma^2) )
* [ min(g^n(c_i), g^n(s_ij)) . g^n(s_ij) ] / ( ||g^n(c_i)|| * ||g^n(s_ij)|| )
where l(c_i) and l(s_ij) are the candidate and reference sentence lengths, and the exponential term is the Gaussian length penalty. The four lengths are again averaged with weights 1/4. [2] The two metrics rank systems very similarly in practice; the CIDEr paper reports a Spearman correlation of about 0.94 between CIDEr and CIDEr-D, so the modifications mainly close gaming loopholes rather than change the overall ordering of good systems. [1]
In everyday usage people often say "CIDEr" when they mean the CIDEr-D score that the COCO server returns, which is worth keeping in mind when reading reported numbers.
CIDEr-D is most associated with MS COCO Captions, the dataset and evaluation server that anchored captioning research through the late 2010s. COCO provides five reference captions per image (the configuration labeled c5), plus a subset of 5,000 images with 40 references each (c40) for more stable scoring. The deeper reference pool suits CIDEr well, since the metric was designed to lean on many references. [2]
The COCO evaluation server reports a standard panel of metrics together: BLEU-1 through BLEU-4, ROUGE-L, METEOR, and CIDEr-D. [2] Later work added SPICE, a metric based on scene-graph tuples, and many papers report CIDEr-D alongside SPICE because the two capture different things. CIDEr-D is sensitive to fluent, consensus phrasing, while SPICE is sensitive to whether the right objects, attributes, and relations are mentioned.
CIDEr-D also became a training target, not just an evaluation score. Self-critical sequence training, introduced by Steven Rennie and colleagues at CVPR 2017, optimizes CIDEr-D directly with a REINFORCE-style policy gradient and uses the model's own greedy-decoded caption as the baseline. Doing so lifted the best reported COCO CIDEr score from 104.9 to 112.3, and the technique became a standard final stage in captioning pipelines. [3] Training against the metric improves it reliably, though it also tends to sharpen whatever blind spots the metric has.
The original paper validated CIDEr against the consensus judgments collected on PASCAL-50S and ABSTRACT-50S. Using 48 reference sentences, CIDEr reproduced human pairwise preferences about 84 percent of the time on both datasets, against a human ceiling of roughly 90 percent on PASCAL-50S and 83 percent on ABSTRACT-50S. METEOR was the closest competitor at around 80 to 82 percent, and the BLEU and ROUGE variants trailed further behind, especially when only a few references were available. [1] On the Flickr8K caption-quality dataset, CIDEr reached a Spearman correlation of 0.58 with human ratings, slightly ahead of METEOR at 0.56. [1]
How strongly CIDEr tracks humans depends a lot on how the question is framed. When the SPICE authors measured system-level (corpus-level) Pearson correlation against human quality rankings of the MS COCO 2015 challenge entries, CIDEr came in at 0.43, with METEOR at 0.53 and SPICE at 0.88. [4] These figures look much weaker than the PASCAL-50S accuracies, but they answer a harder question (ranking whole systems that are already close in quality) rather than the easier one of telling a clearly good caption from a clearly bad one. Both kinds of result appear in the literature, so it helps to note which is being cited.
CIDEr inherits the basic weakness of any n-gram method: it rewards surface overlap, so a correct caption phrased with different words than the references can be scored low. Paraphrase, synonymy, and reordering are only partly handled by the mix of unigrams through 4-grams, and the TF-IDF weighting softens this without giving the metric any model of meaning.
The metric also needs a healthy pool of references to behave well. Its TF-IDF statistics and cosine-against-the-set averaging are most meaningful when there are several captions per image; with only one or two references the consensus signal is thin and scores grow noisy. That is part of why CIDEr is dependable on COCO, with its five or forty references, and weaker on datasets with sparse annotation.
Gaming remains a concern even with CIDEr-D. Clipping and the length penalty block the most obvious tricks, but because the metric can be optimized directly during training, models can drift toward generic, high-frequency phrasing that scores well without describing an image vividly or accurately. A later metric, CIDEr-R, was proposed specifically to handle datasets where caption length varies a great deal, a regime where the fixed Gaussian penalty of CIDEr-D is a poor fit. [5]
A different line of work questions the reference-based framing altogether. CLIP Score, introduced by Jack Hessel and colleagues at EMNLP 2021, scores a caption by the cosine similarity between the image and the caption in the shared embedding space of CLIP, using no reference captions at all (CLIP-S = 2.5 * max(cos(image, caption), 0)). [6] On the Flickr8K-Expert benchmark it reached a Kendall correlation of 51.2 with human judgments, ahead of CIDEr at 43.9 and SPICE at 44.9, and it showed a similar advantage on the Composite benchmark. [6] Reference-free metrics like this sidestep CIDEr's dependence on reference quality and quantity, though they bring their own biases from the vision-language model they rely on. In current practice within computer vision, CIDEr-D is rarely used alone; teams report it together with SPICE and, increasingly, with embedding-based scores so that the picture does not hinge on a single n-gram metric.
| Metric | Year | Basis | References needed | Reported human correlation |
|---|---|---|---|---|
| BLEU-4 | 2002 | Modified n-gram precision (n up to 4) | Yes | System-level Pearson 0.05 on COCO 2015 [4] |
| ROUGE-L | 2004 | Longest common subsequence | Yes | System-level Pearson 0.15 on COCO 2015 [4] |
| METEOR | 2005 | Unigram matches with stems and synonyms | Yes | System-level Pearson 0.53 on COCO 2015 [4] |
| CIDEr-D | 2015 | TF-IDF n-gram (n=1 to 4) cosine, clipping, length penalty | Yes, several preferred | System-level Pearson 0.43 on COCO 2015 [4]; ~84% pairwise accuracy on PASCAL-50S [1] |
| SPICE | 2016 | Scene-graph tuple F-score | Yes | System-level Pearson 0.88 on COCO 2015 [4] |
| CLIP Score | 2021 | Image-text cosine similarity (CLIP), reference-free | No | Kendall tau 51.2 on Flickr8K-Expert [6] |
The correlation columns are not all measured the same way, so they show trends rather than a single ranking. The system-level Pearson figures come from the SPICE evaluation, while the CLIP Score figure is a Kendall correlation at the caption level. [4][6]