COCO (Common Objects in Context) is a large-scale dataset for object detection, image segmentation, keypoint detection, and image captioning. Created by a team led by Tsung-Yi Lin at Microsoft Research and introduced at the European Conference on Computer Vision (ECCV) in 2014, COCO has become the most widely used benchmark for evaluating computer vision models. The dataset contains over 330,000 images with more than 1.5 million object instances spanning 80 object categories, all annotated with per-instance segmentation masks. With over 49,000 citations as of 2025, the original COCO paper is one of the most cited publications in the history of computer vision research.
Before COCO, the dominant benchmark for object detection and recognition was the PASCAL VOC dataset, which contained 20 object categories and roughly 11,500 images. While PASCAL VOC served the community well from 2005 onward, it had limitations in scale, category diversity, and scene complexity. Many images in existing datasets were "iconic," meaning they showed a single, centered object against a clean background. Real-world scenes, by contrast, contain multiple objects in cluttered environments, with partial occlusions and varying scales.
The COCO project was motivated by the need for a dataset that captured this complexity. The authors, Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick, drew from institutions including Cornell University, California Institute of Technology, Brown University, UC Irvine, and Microsoft Research. Lubomir Bourdev (Facebook AI Research) and Ross Girshick (Microsoft Research, Redmond) also contributed to the project. Their goal was to build a dataset with three key properties: a large number of object instances per image, rich contextual relationships between objects, and precise spatial localization through per-instance segmentation masks rather than just bounding boxes.
The paper, "Microsoft COCO: Common Objects in Context," was published at ECCV 2014 and made available on arXiv in May 2014 (arXiv:1405.0312).
All images in COCO were sourced from Flickr, a photo-sharing platform where users upload images under various Creative Commons licenses. The choice of Flickr was deliberate: the platform tends to host photographs taken by amateur photographers in everyday settings, which produces more natural, non-iconic images compared to stock photography or web-scraped search results.
To find images that contained multiple objects in realistic contexts, the dataset creators searched Flickr using pairwise combinations of object category names. For example, rather than searching for "dog" alone (which tends to return iconic, centered images of dogs), they searched for pairs such as "dog + car" or "person + bicycle." This strategy produced images where objects appeared at various scales, viewpoints, and levels of occlusion, reflecting how objects naturally co-occur in real scenes.
The images are licensed under Creative Commons Attribution 4.0 (CC BY 4.0), which permits redistribution and commercial use with proper attribution.
COCO defines 80 object categories for detection and segmentation tasks. The original paper proposed 91 categories, but 11 of them (including hat, shoe, eye glasses, plate, mirror, window, desk, door, blender, hair brush, and street sign) lacked sufficient segmentation annotations and were dropped from the released dataset. The categories were selected to represent objects that a young child would easily recognize and that occur frequently in everyday life. The 80 categories are organized into 12 supercategories:
| Supercategory | Object Categories |
|---|---|
| Person | person |
| Vehicle | bicycle, car, motorcycle, airplane, bus, train, truck, boat |
| Outdoor | traffic light, fire hydrant, stop sign, parking meter, bench |
| Animal | bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe |
| Accessory | backpack, umbrella, handbag, tie, suitcase |
| Sports | frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket |
| Kitchen | bottle, wine glass, cup, fork, knife, spoon, bowl |
| Food | banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake |
| Furniture | chair, couch, potted plant, bed, dining table, toilet |
| Electronic | tv, laptop, mouse, remote, keyboard, cell phone |
| Appliance | microwave, oven, toaster, sink, refrigerator |
| Indoor | book, clock, vase, scissors, teddy bear, hair drier, toothbrush |
Each category was chosen to have a sufficient number of instances across the dataset to support training and evaluation of statistical models.
The COCO dataset has been released in multiple versions. The 2017 reorganization is the most commonly used today. No new images were added between versions; the same images were simply redistributed across splits.
| Statistic | Value |
|---|---|
| Total images | ~330,000 |
| Labeled images (with annotations) | ~200,000 |
| Object instances (thing categories) | ~1,500,000 |
| Thing categories | 80 |
| Stuff categories (COCO-Stuff) | 91 |
| Panoptic categories (combined) | 133 (80 things + 53 stuff) |
| Captions | Over 1.5 million (5 per image) |
| Keypoint-annotated person instances | ~250,000 (17 keypoints each) |
| Average object instances per image | 7.7 |
| Average categories per image | 3.5 |
The dataset split changed between COCO 2014 and COCO 2017. Based on community feedback, the 2017 release moved from an 83K/41K train/val configuration to a larger training set and smaller validation set.
| Split | Version | Images | Notes |
|---|---|---|---|
| train2014 | 2014 | 82,783 | Original training split |
| val2014 | 2014 | 40,504 | Original validation split |
| test2014 | 2014 | 40,775 | Original test split |
| train2017 | 2017 | 118,287 | Reorganized (train2014 + part of val2014) |
| val2017 | 2017 | 5,000 | Smaller validation set (subset of val2014) |
| test2017 | 2017 | 40,670 | Test images (annotations withheld) |
The test set is further divided into test-dev (used for development and frequent evaluation on public leaderboards) and test-challenge (used for official competition submissions). Roughly half of the test images (about 20,000) belong to each subset. Researchers must submit predictions to the official COCO evaluation server to obtain test results, preventing overfitting to the test set.
COCO provides five distinct types of annotations, each supporting different computer vision tasks.
Every object instance in COCO is annotated with both an axis-aligned bounding box and a pixel-level segmentation mask. The bounding box is defined by four values (x, y, width, height). The segmentation mask is stored as a polygon or, for crowd regions, as run-length encoding (RLE). This dual annotation supports both traditional bounding box detection and instance segmentation, where the goal is to produce a precise mask for each object. The dataset contains approximately 1.5 million object instances across the 80 categories, with an average of 7.7 instances per image. This density is significantly higher than PASCAL VOC (2.3 instances per image) and ImageNet (3.0 instances per image).
An "iscrowd" flag distinguishes individual object annotations from crowd annotations. When a region contains many closely packed instances of the same category (for example, a crowd of people), a single crowd annotation with an RLE mask is used instead of individual polygon masks.
COCO includes keypoint annotations for over 250,000 person instances across more than 200,000 images. Each person is annotated with 17 body keypoints: nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle. Each keypoint is stored as a triplet (x, y, v), where x and y are pixel coordinates and v is a visibility flag (v=0 means not labeled, v=1 means labeled but occluded, v=2 means labeled and visible). A skeleton definition connects these keypoints into a body structure for pose estimation tasks.
Each image in COCO comes with five independently written natural-language captions, totaling over 1.5 million captions across the dataset. These captions were collected through Amazon Mechanical Turk with specific guidelines requiring annotators to describe all important objects, their attributes, and their spatial relationships in at least eight words, while avoiding proper names and speculative statements. The widely used Karpathy split, proposed by Andrej Karpathy and Fei-Fei Li in 2015, divides the COCO 2014 data into training (113,287 images), validation (5,000 images), and test (5,000 images) subsets specifically optimized for captioning and vision-language research.
Stuff segmentation provides pixel-level labels for amorphous, uncountable regions such as sky, grass, road, water, and wall. Unlike "thing" categories (discrete, countable objects), "stuff" categories represent materials and surfaces. The COCO-Stuff extension, introduced by Holger Caesar, Jasper Uijlings, and Vittorio Ferrari at CVPR 2018, augmented all 164,000 images from COCO 2017 with annotations for 91 stuff categories. Combined with the 80 thing categories and 1 unlabeled class, COCO-Stuff provides dense pixel-level labels for 172 categories across every image.
Panoptic segmentation unifies instance segmentation of "things" and semantic segmentation of "stuff" into a single coherent output. Introduced by Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollar at CVPR 2019, this task requires every pixel in an image to be assigned both a semantic label and an instance ID (for thing classes). The COCO panoptic annotations cover 133 categories: 80 thing classes and 53 stuff classes (a curated subset of the full 91 stuff categories). The Panoptic Quality (PQ) metric was proposed alongside this task to evaluate performance in a unified manner.
COCO annotations were collected through a multi-stage pipeline using Amazon Mechanical Turk (AMT), a crowdsourcing platform. The pipeline was designed to ensure high-quality, consistent annotations at scale.
In the first stage, workers determined which of the object categories were present in each image. To achieve high recall, each image was labeled by eight independent workers. Workers were shown images and asked to indicate the presence or absence of objects from each of 11 supercategories, then from specific categories within those supercategories. This hierarchical approach reduced the cognitive burden compared to asking about all categories simultaneously. This stage required approximately 20,000 worker hours.
For each category identified in an image, workers marked individual instances by placing a cross on each object. Each worker was asked to label at most 10 instances of a given category per image. Multiple workers annotated the same image to ensure all instances were found, and a verification stage confirmed the locations. This stage required approximately 10,000 worker hours.
In the final and most labor-intensive stage, workers traced precise segmentation masks around each identified instance. This process required roughly 22 worker hours per 1,000 segmentations. To maintain quality, workers were required to complete a mandatory training task before participating, and only approximately one in three workers passed the qualification test. The segmentation masks were further verified in a separate quality-control round.
The entire annotation effort produced over 2.5 million labeled instances across 328,000 images, representing one of the largest crowdsourced annotation projects in computer vision at the time.
COCO introduced a set of evaluation metrics that have become the standard for object detection and segmentation benchmarking. These metrics differ from those used by PASCAL VOC in important ways.
The primary metric in COCO evaluation is Average Precision (AP), which measures the area under the precision-recall curve. Unlike PASCAL VOC, which computes AP at a single Intersection over Union (IoU) threshold of 0.5, COCO computes AP averaged over 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05. This metric, denoted AP@[.50:.05:.95], rewards models that produce tightly localized predictions, not just roughly correct bounding boxes. COCO also uses 101-point interpolated precision (sampling the precision-recall curve at 101 recall levels from 0 to 1 in steps of 0.01), providing finer-grained evaluation than the 11-point interpolation used by earlier versions of PASCAL VOC.
In COCO notation, "AP" without any qualifier refers to this averaged metric. This differs from other benchmarks where AP might refer to a single-threshold value. The term "mAP" (mean Average Precision) is sometimes used interchangeably with AP in other contexts, but in COCO, AP is already averaged over all categories.
| Metric | Definition |
|---|---|
| AP | Average Precision averaged over IoU thresholds 0.50 to 0.95 in steps of 0.05 (primary metric) |
| AP@50 (AP50) | AP at IoU threshold 0.50 (comparable to the PASCAL VOC metric) |
| AP@75 (AP75) | AP at IoU threshold 0.75 (strict localization requirement) |
| AP-Small (APS) | AP for small objects (area < 32 x 32 pixels) |
| AP-Medium (APM) | AP for medium objects (32 x 32 < area < 96 x 96 pixels) |
| AP-Large (APL) | AP for large objects (area > 96 x 96 pixels) |
Average Recall measures the maximum recall achievable given a fixed number of detections per image, averaged over all categories and the same 10 IoU thresholds (0.50 to 0.95).
| Metric | Definition |
|---|---|
| AR@1 | Average Recall with at most 1 detection per image |
| AR@10 | Average Recall with at most 10 detections per image |
| AR@100 | Average Recall with at most 100 detections per image |
| AR-Small | AR@100 for small objects (area < 32 x 32 pixels) |
| AR-Medium | AR@100 for medium objects (32 x 32 < area < 96 x 96 pixels) |
| AR-Large | AR@100 for large objects (area > 96 x 96 pixels) |
In COCO, small objects (area under 32 x 32 pixels) make up about 41% of all instances, medium objects account for about 34%, and large objects about 24%. This distribution makes the size-stratified metrics especially informative for understanding model behavior across different scales.
For keypoint detection, COCO uses Object Keypoint Similarity (OKS), which plays a role analogous to IoU in object detection. OKS measures the distance between predicted and ground-truth keypoint locations, normalized by the scale of the person and per-keypoint constants that account for annotation noise. AP and AR metrics are then computed using OKS thresholds instead of IoU thresholds, following the same 0.50 to 0.95 range.
For panoptic segmentation, the Panoptic Quality metric decomposes performance into a Recognition Quality (RQ) component and a Segmentation Quality (SQ) component, such that PQ = RQ x SQ. This decomposition provides interpretable insights into whether errors stem from missed detections (low RQ) or imprecise segmentations (low SQ). PQ is reported separately for thing classes and stuff classes, as well as averaged across all categories.
Image captioning performance on COCO is measured using standard natural language processing metrics: BLEU (1-gram through 4-gram), METEOR, ROUGE-L, CIDEr-D, and SPICE. The COCO Caption Evaluation Server computes all of these metrics against the five reference captions for each image.
The COCO Challenge series has been held annually since 2015, co-located with major computer vision conferences. These competitions have driven rapid progress across multiple tasks.
| Year | Venue | Challenge Tracks |
|---|---|---|
| 2015 | ICCV 2015 | Object Detection (bounding box + segmentation), Captioning |
| 2016 | ECCV 2016 | Object Detection, Keypoint Detection, Captioning |
| 2017 | ICCV 2017 | Object Detection, Keypoint Detection, Stuff Segmentation |
| 2018 | ECCV 2018 | Instance Segmentation, Panoptic Segmentation, Keypoint Detection, DensePose |
| 2019 | ICCV 2019 | Instance Segmentation, Panoptic Segmentation, Keypoint Detection, DensePose |
| 2020 | ECCV 2020 | Instance Segmentation, Panoptic Segmentation, Keypoint Detection, DensePose |
The early workshops (2015 and 2016) were held jointly with the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Starting in 2018, the COCO workshops were held jointly with the Mapillary Vistas dataset challenge. Notable milestones include the introduction of keypoint detection in 2016, stuff segmentation in 2017, and both panoptic segmentation and DensePose in 2018. Bounding-box-only detection was phased out after 2017 in favor of instance segmentation with masks.
Several extensions have been built on top of the COCO dataset to address additional research needs.
COCO-Stuff (Caesar et al., CVPR 2018) provides dense pixel-level annotations for 91 stuff categories across all 164,000 images in COCO 2017. Combined with the 80 thing categories, COCO-Stuff enables holistic scene understanding where every pixel receives a semantic label. The stuff categories include surfaces (road, pavement, sand), natural elements (sky, clouds, sea, river), vegetation (grass, tree, flower), buildings and structures (wall, building, fence), textiles (carpet, curtain, blanket), food materials, and other environmental elements.
The COCO Captions component (Chen et al., 2015) provides five human-written captions per image for over 200,000 images. The caption evaluation server supports automatic metrics including BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. COCO Captions has become the primary benchmark for image captioning research and is widely used in training vision-language models.
COCO-WholeBody (Jin et al., ECCV 2020) extends the keypoint annotations from 17 body keypoints to 133 keypoints per person, including detailed annotations for the face (68 keypoints), hands (42 keypoints, 21 per hand), and feet (6 keypoints). This extension supports whole-body pose estimation and fine-grained gesture recognition.
DensePose-COCO (Guler et al., CVPR 2018) maps pixels of person instances to UV coordinates on a 3D body surface model, enabling dense pose estimation. The dataset provides annotations for approximately 50,000 person instances with over 5 million manually annotated correspondences. Annotators first segmented each person into semantic body parts, then mapped sampled points to rendered views of a 3D body model to obtain surface coordinates.
COCONut (Deng et al., 2024) is a modernized re-annotation of the COCO dataset with higher-quality segmentation masks, addressing annotation noise and inconsistencies found in the original COCO annotations through improved annotation procedures.
COCO-Text (Veit et al., 2016) adds text detection and recognition annotations to COCO images, identifying text instances in natural scenes. It contains annotations for over 173,000 text instances in more than 63,000 images.
COCO occupies a specific niche among object detection and segmentation benchmarks. The following table compares COCO with other prominent datasets in the field.
| Feature | PASCAL VOC 2012 | COCO 2017 | Open Images V7 | LVIS |
|---|---|---|---|---|
| Year introduced | 2012 | 2014 (reorganized 2017) | 2020 (updated 2022) | 2019 |
| Total images | ~11,500 | ~330,000 | ~9,000,000 | ~164,000 |
| Object categories | 20 | 80 | 600 (detection) | 1,203 |
| Object instances | ~27,450 | ~1,500,000 | ~16,000,000 (boxes) | ~2,000,000 |
| Instance segmentation | Semantic masks only | Yes (all instances) | Yes (350 classes) | Yes (1,203 classes) |
| Avg. instances per image | ~2.3 | ~7.7 | ~8.4 | ~12.0 |
| Keypoint annotations | No | Yes (17 body keypoints) | No | No |
| Captions | No | Yes (5 per image) | Yes (localized narratives) | No |
| Panoptic segmentation | No | Yes (133 classes) | Limited | No |
| Stuff segmentation | No | Yes (91 classes via COCO-Stuff) | No | No |
| Long-tail distribution | No | No | Partial | Yes |
| Primary AP metric | AP@50 (single threshold) | AP@[.50:.05:.95] (10 thresholds) | AP@50 | AP (frequency-weighted) |
| Primary use case | Detection, segmentation | Detection, segmentation, captioning, pose | Large-scale detection | Fine-grained, rare-object segmentation |
PASCAL VOC is smaller and simpler, making it useful for quick prototyping but limited for evaluating models on complex, cluttered scenes. Open Images, created by Google, provides far more images and categories but uses a different annotation methodology with human-verified machine-generated labels. LVIS (Large Vocabulary Instance Segmentation), released in 2019 by Agrim Gupta, Piotr Dollar, and Ross Girshick, reuses the same COCO images but provides annotations for over 1,200 categories following a natural long-tail distribution. LVIS divides categories into frequent (more than 100 training images), common (10 to 100 images), and rare (fewer than 10 images) groups, making it especially valuable for evaluating models on uncommon object categories.
The COCO annotation format has become a de facto standard in the computer vision community. Annotations are stored in JSON files with a specific structure.
| Top-Level Field | Description |
|---|---|
info | Dataset metadata (version, description, year, contributor, date) |
licenses | Image license information (id, name, URL) |
images | List of image records (id, file_name, width, height, date_captured) |
annotations | List of annotation records (varies by task type) |
categories | List of category definitions (id, name, supercategory) |
For object detection, each annotation record includes an image ID, category ID, bounding box coordinates in [x, y, width, height] format, segmentation mask (as a polygon or run-length encoding), area, and an "iscrowd" flag that distinguishes individual instances from crowd regions.
Many annotation tools, training frameworks (including PyTorch, TensorFlow, and Detectron2), and evaluation libraries natively support the COCO JSON format. The official Python API (pycocotools) provides utilities for loading annotations, visualizing results, and computing evaluation metrics. This widespread adoption has made COCO format the common interchange standard for object detection datasets beyond COCO itself.
COCO has had a profound influence on the development of deep learning for visual recognition. Several factors contribute to its lasting impact.
First, the dataset's emphasis on instance segmentation pushed the field beyond bounding-box detection toward pixel-level understanding. Models such as Mask R-CNN (He et al., 2017), which jointly predicts bounding boxes and segmentation masks, were developed and evaluated primarily on COCO.
Second, the multi-IoU-threshold AP metric introduced by COCO raised the bar for localization quality. Under the PASCAL VOC metric (AP@50), a bounding box overlapping just half the object area was considered correct. The COCO metric, by averaging up to IoU 0.95, incentivized models to produce much more precise predictions.
Third, COCO's rich annotations across multiple tasks (detection, segmentation, keypoints, captions) enabled the development of multi-task and multi-modal models. Vision-language models such as CLIP, GPT-4V, and various image captioning architectures frequently use COCO data for training or evaluation.
Fourth, the annual challenge series created a competitive ecosystem that accelerated progress. Between 2015 and 2023, detection AP on the COCO test-dev leaderboard roughly tripled, reflecting advances from convolutional neural networks with hand-crafted components (such as Faster R-CNN) to end-to-end transformer-based architectures (such as DETR and its successors).
Fifth, the COCO JSON format and evaluation API have become foundational infrastructure in the computer vision toolchain. Virtually every object detection paper published since 2015 reports COCO metrics, and many new datasets adopt the COCO format for compatibility.
Sixth, models pre-trained on COCO serve as starting points for detection and segmentation tasks in other domains, including medical imaging, autonomous driving, satellite imagery analysis, and robotics. COCO pre-trained weights are available for most major detection frameworks.
The original paper has accumulated over 49,000 citations on Google Scholar, making it one of the most cited computer vision papers ever published.
Despite its wide adoption, COCO has several known limitations. The 80 object categories, while covering common everyday objects, represent only a small fraction of the visual concepts that humans can recognize. The dataset primarily contains images from North America and Europe, introducing geographic and cultural biases in the types of objects, scenes, and contexts represented. Annotation quality, while generally high, varies across instances; some segmentation masks contain errors, particularly for objects with complex boundaries or heavy occlusion. The dataset's object categories follow a relatively uniform distribution, which does not reflect the natural long-tail frequency of objects in the real world. This limitation motivated the creation of LVIS, which annotates over 1,200 categories on the same COCO images with a natural long-tail distribution. Researchers have also noted biases in the caption annotations, including gender and racial stereotypes present in the natural-language descriptions. As detection AP scores on COCO continue to climb, some researchers have raised concerns that the benchmark may be approaching saturation, prompting interest in harder evaluation protocols and more challenging datasets.