CIDEr

Computer Vision Machine Learning Model Evaluation

12 min read

Updated Jun 29, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 29, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 2,370 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

CIDEr (Consensus-based Image Description Evaluation) is an automatic evaluation metric for image captioning that scores a machine-generated caption by how closely it matches the consensus of several human reference captions for the same image. Each sentence is represented as a TF-IDF (term frequency-inverse document frequency) weighted vector over n-grams of length 1 to 4, and the score is the average cosine similarity between the candidate caption and the set of reference captions. CIDEr was introduced by Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh in a paper presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2015. ^[1] A hardened variant called CIDEr-D became one of the official scores on the MS COCO captioning evaluation server, and for several years it served as the headline number that captioning systems competed on. ^[2]

The central idea is consensus. A single image can be described correctly in many ways, so any one reference caption is a poor gold standard. CIDEr instead treats a set of human captions as the target and rewards a candidate for using the words and phrases that most humans tend to use, while discounting phrasings that are rare or idiosyncratic. To do that it borrows TF-IDF weighting from information retrieval, which lets common, low-information n-grams (such as "a" or "there is") count for little and reserves weight for n-grams that actually distinguish one image from another. As the authors put it, the paper contributes "a new automated metric (CIDEr) that captures consensus" between human descriptions. ^[1]

Why was CIDEr created?

Before CIDEr, captioning research mostly reused metrics built for other tasks. BLEU and ROUGE came from machine translation and summarization, and METEOR was also a translation metric. These compare a candidate against references by counting overlapping words or n-grams, but they were not designed around the fact that a caption set encodes a distribution over acceptable descriptions. The CIDEr authors argued that a good metric for description should track human consensus directly: a caption is good to the degree that it says what people generally agree the image shows. ^[1]

To study consensus they also built two datasets with unusually deep annotation, PASCAL-50S and ABSTRACT-50S, each providing 50 human-written sentences per image rather than the usual handful, and they collected human judgments with a triplet protocol that asks which of two candidate sentences is more like a reference sentence. ^[1] Two design choices follow from the consensus goal. First, TF-IDF weighting down-weights n-grams that appear across many images and so carry little descriptive content. Second, the score is averaged over a whole set of references rather than computed against a single one, so a candidate is rewarded for agreeing with the crowd, not with one arbitrary writer.

How does CIDEr work?

CIDEr scores a candidate caption against a set of reference captions for the same image. Captions are first tokenized and reduced to their stems (in the original CIDEr), and each is represented by the n-grams it contains for n from 1 to 4.

Each n-gram is given a TF-IDF weight. For an n-gram omega_k in a sentence, the weight g_k combines two factors:

Term frequency: the count of omega_k in that sentence divided by the total count of all n-grams in the sentence.
Inverse document frequency: the logarithm of the total number of images in the dataset, |I|, divided by the number of images for which omega_k appears in at least one reference caption.

In the paper's notation:

g_k(s_ij) = [ h_k(s_ij) / sum_l h_l(s_ij) ] * log( |I| / sum_p min(1, sum_q h_k(s_pq)) )

Here h_k(s_ij) is the number of times omega_k occurs in reference sentence s_ij. The IDF term is what makes n-grams that show up everywhere contribute almost nothing, since for a very common n-gram the ratio inside the logarithm approaches one and the log approaches zero. ^[1]

Given these weights, the score for a single n-gram length n is the average cosine similarity between the candidate's weighted n-gram vector and each reference's weighted n-gram vector. For a candidate c_i and a reference set S_i with m references:

CIDEr_n(c_i, S_i) = (1/m) * sum_j  [ g^n(c_i) . g^n(s_ij) ] / ( ||g^n(c_i)|| * ||g^n(s_ij)|| )

The vectors g^n are formed from the weights of all n-grams of length n. Because cosine similarity normalizes by vector magnitude, the measure responds both to whether the candidate includes important reference n-grams (a recall-like behavior) and to whether it avoids n-grams the references lack (a precision-like behavior). ^[1]

The final CIDEr score combines the four n-gram lengths with uniform weights:

CIDEr(c_i, S_i) = sum_{n=1..4} w_n * CIDEr_n(c_i, S_i),   with w_n = 1/4

Unigrams capture word choice, while bigrams through 4-grams capture short phrases and local word order. ^[1]

What is the difference between CIDEr and CIDEr-D?

The authors found that plain CIDEr could be gamed: a sentence that humans rate poorly can still earn a high score if it repeats high-weight n-grams or stuffs in long strings of confident-sounding phrases. CIDEr-D is a hardened version designed for the MS COCO server, and it differs from the original in three ways. ^[1]^[2]

First, it clips n-gram counts. The candidate count for each n-gram is capped at the number of times that n-gram appears in the reference, which stops a model from inflating its score by repeating a good word many times. As the COCO Captions paper illustrates, a candidate saying "The The The" gets credit for only one "The" if that word occurs at most once in any individual reference. ^[2] Second, it adds a Gaussian penalty on the length difference between the candidate and each reference, so captions that are much longer or shorter than the references are pushed down. Third, it drops stemming and multiplies the result by a factor of 10 so the numbers land in a range similar to the other COCO metrics. The standard deviation of the length penalty is set to sigma = 6. ^[2] The per-length CIDEr-D formula is:

CIDEr-D_n(c_i, S_i) = (10/m) * sum_j  exp( -( l(c_i) - l(s_ij) )^2 / (2 * sigma^2) )
                              * [ min(g^n(c_i), g^n(s_ij)) . g^n(s_ij) ] / ( ||g^n(c_i)|| * ||g^n(s_ij)|| )

where l(c_i) and l(s_ij) are the candidate and reference sentence lengths, and the exponential term is the Gaussian length penalty. The four lengths are again averaged with weights 1/4. ^[2] The two metrics rank systems very similarly in practice; the CIDEr paper reports a Spearman correlation of about 0.94 between CIDEr and CIDEr-D, so the modifications mainly close gaming loopholes rather than change the overall ordering of good systems. ^[1]

In everyday usage people often say "CIDEr" when they mean the CIDEr-D score that the COCO server returns, which is worth keeping in mind when reading reported numbers.

How is CIDEr used on MS COCO?

CIDEr-D is most associated with MS COCO Captions, the dataset and evaluation server that anchored captioning research through the late 2010s. COCO provides five reference captions per image (the configuration labeled c5), plus a subset of 5,000 images with 40 references each (c40) for more stable scoring. The deeper reference pool suits CIDEr well, since the metric was designed to lean on many references. ^[2]

The COCO evaluation server reports a standard panel of metrics together: BLEU-1 through BLEU-4, ROUGE-L, METEOR, and CIDEr-D. ^[2] Later work added SPICE, a metric based on scene-graph tuples, and many papers report CIDEr-D alongside SPICE because the two capture different things. CIDEr-D is sensitive to fluent, consensus phrasing, while SPICE is sensitive to whether the right objects, attributes, and relations are mentioned.

CIDEr-D also became a training target, not just an evaluation score. Self-critical sequence training, introduced by Steven Rennie and colleagues at CVPR 2017, optimizes CIDEr-D directly with a REINFORCE-style policy gradient and uses the model's own greedy-decoded caption as the baseline. In the paper's words, the method established "a new state-of-the-art on the task, improving the best result in terms of CIDEr from 104.9 to 114.7" on the MS COCO evaluation server, and the technique became a standard final stage in captioning pipelines. ^[3] Training against the metric improves it reliably, though it also tends to sharpen whatever blind spots the metric has.

How well does CIDEr correlate with human judgment?

The original paper validated CIDEr against the consensus judgments collected on PASCAL-50S and ABSTRACT-50S. Using 48 reference sentences, CIDEr reproduced human pairwise preferences about 84 percent of the time on both datasets, against a human ceiling of roughly 90 percent on PASCAL-50S and 83 percent on ABSTRACT-50S. METEOR was the closest competitor at around 80 to 82 percent, and the BLEU and ROUGE variants trailed further behind, especially when only a few references were available. ^[1] On the Flickr8K caption-quality dataset, CIDEr reached a Spearman correlation of 0.58 with human ratings, slightly ahead of METEOR at 0.56. ^[1]

How strongly CIDEr tracks humans depends a lot on how the question is framed. When the SPICE authors measured system-level (corpus-level) Pearson correlation against human quality rankings of the MS COCO 2015 challenge entries, CIDEr came in at 0.43, with METEOR at 0.53 and SPICE at 0.88. ^[4] These figures look much weaker than the PASCAL-50S accuracies, but they answer a harder question (ranking whole systems that are already close in quality) rather than the easier one of telling a clearly good caption from a clearly bad one. Both kinds of result appear in the literature, so it helps to note which is being cited.

What are the limitations of CIDEr?

CIDEr inherits the basic weakness of any n-gram method: it rewards surface overlap, so a correct caption phrased with different words than the references can be scored low. Paraphrase, synonymy, and reordering are only partly handled by the mix of unigrams through 4-grams, and the TF-IDF weighting softens this without giving the metric any model of meaning.

The metric also needs a healthy pool of references to behave well. Its TF-IDF statistics and cosine-against-the-set averaging are most meaningful when there are several captions per image; with only one or two references the consensus signal is thin and scores grow noisy. That is part of why CIDEr is dependable on COCO, with its five or forty references, and weaker on datasets with sparse annotation.

Gaming remains a concern even with CIDEr-D. Clipping and the length penalty block the most obvious tricks, but because the metric can be optimized directly during training, models can drift toward generic, high-frequency phrasing that scores well without describing an image vividly or accurately. A later metric, CIDEr-R, was proposed specifically to handle datasets where caption length varies a great deal, a regime where the fixed Gaussian penalty of CIDEr-D is a poor fit. ^[5]

A different line of work questions the reference-based framing altogether. CLIP Score, introduced by Jack Hessel and colleagues at EMNLP 2021, scores a caption by the cosine similarity between the image and the caption in the shared embedding space of CLIP, using no reference captions at all (CLIP-S = 2.5 * max(cos(image, caption), 0)). ^[6] On the Flickr8K-Expert benchmark it reached a Kendall correlation of 51.2 with human judgments, ahead of CIDEr at 43.9 and SPICE at 44.9, and it showed a similar advantage on the Composite benchmark. ^[6] Reference-free metrics like this sidestep CIDEr's dependence on reference quality and quantity, though they bring their own biases from the vision-language model they rely on. In current practice within computer vision, CIDEr-D is rarely used alone; teams report it together with SPICE and, increasingly, with embedding-based scores so that the picture does not hinge on a single n-gram metric.

Metric	Year	Basis	References needed	Reported human correlation
BLEU-4	2002	Modified n-gram precision (n up to 4)	Yes	System-level Pearson 0.05 on COCO 2015 ^[4]
ROUGE-L	2004	Longest common subsequence	Yes	System-level Pearson 0.15 on COCO 2015 ^[4]
METEOR	2005	Unigram matches with stems and synonyms	Yes	System-level Pearson 0.53 on COCO 2015 ^[4]
CIDEr-D	2015	TF-IDF n-gram (n=1 to 4) cosine, clipping, length penalty	Yes, several preferred	System-level Pearson 0.43 on COCO 2015 ^[4]; ~84% pairwise accuracy on PASCAL-50S ^[1]
SPICE	2016	Scene-graph tuple F-score	Yes	System-level Pearson 0.88 on COCO 2015 ^[4]
CLIP Score	2021	Image-text cosine similarity (CLIP), reference-free	No	Kendall tau 51.2 on Flickr8K-Expert ^[6]

The correlation columns are not all measured the same way, so they show trends rather than a single ranking. The system-level Pearson figures come from the SPICE evaluation, while the CLIP Score figure is a Kendall correlation at the caption level. ^[4]^[6]

References

Vedantam, Ramakrishna; Zitnick, C. Lawrence; Parikh, Devi. "CIDEr: Consensus-based Image Description Evaluation." CVPR 2015. https://arxiv.org/abs/1411.5726 ↩
Chen, Xinlei; Fang, Hao; Lin, Tsung-Yi; Vedantam, Ramakrishna; Gupta, Saurabh; Dollar, Piotr; Zitnick, C. Lawrence. "Microsoft COCO Captions: Data Collection and Evaluation Server." arXiv preprint, 2015. https://arxiv.org/abs/1504.00325 ↩
Rennie, Steven J.; Marcheret, Etienne; Mroueh, Youssef; Ross, Jerret; Goel, Vaibhava. "Self-critical Sequence Training for Image Captioning." CVPR 2017. https://arxiv.org/abs/1612.00563 ↩
Anderson, Peter; Fernando, Basura; Johnson, Mark; Gould, Stephen. "SPICE: Semantic Propositional Image Caption Evaluation." ECCV 2016. https://arxiv.org/abs/1607.08822 ↩
Santos, Gabriel Oliveira dos; Colombini, Esther Luna; Avila, Sandra. "CIDEr-R: Robust Consensus-based Image Description Evaluation." Findings of EMNLP 2021. https://arxiv.org/abs/2109.13701 ↩
Hessel, Jack; Holtzman, Ari; Forbes, Maxwell; Le Bras, Ronan; Choi, Yejin. "CLIPScore: A Reference-free Evaluation Metric for Image Captioning." EMNLP 2021. https://aclanthology.org/2021.emnlp-main.595/ ↩
Vedantam, Ramakrishna. "cider: Code for computing the CIDEr metric." GitHub repository, 2015. https://github.com/vrama91/cider
Papineni, Kishore; Roukos, Salim; Ward, Todd; Zhu, Wei-Jing. "BLEU: a Method for Automatic Evaluation of Machine Translation." ACL 2002. https://aclanthology.org/P02-1040/
Lin, Chin-Yew. "ROUGE: A Package for Automatic Evaluation of Summaries." Text Summarization Branches Out, ACL 2004. https://aclanthology.org/W04-1013/
Lin, Tsung-Yi; Maire, Michael; Belongie, Serge; Hays, James; Perona, Pietro; Ramanan, Deva; Dollar, Piotr; Zitnick, C. Lawrence. "Microsoft COCO: Common Objects in Context." ECCV 2014. https://arxiv.org/abs/1405.0312

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

CLIP Score COCO dataset METEOR (metric)

Why was CIDEr created?

How does CIDEr work?

What is the difference between CIDEr and CIDEr-D?

How is CIDEr used on MS COCO?

How well does CIDEr correlate with human judgment?

What are the limitations of CIDEr?

How does CIDEr compare with related metrics?

See also

References

Improve this article

Related Articles

Average Precision

IoU

Generalization

Generalization Curve

Model Capacity

Splitter

What links here

Related Articles

Average Precision

IoU

Generalization

Generalization Curve

Model Capacity

Splitter

What links here