COCO dataset

Computer Vision Data & Datasets Machine Learning

22 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v8 · 4,488 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

COCO (Common Objects in Context) is a large-scale dataset for object detection, image segmentation, keypoint detection, and image captioning. Created by a team led by Tsung-Yi Lin at Microsoft Research and introduced at the European Conference on Computer Vision (ECCV) in 2014, COCO has become the most widely used benchmark for evaluating computer vision models.^[1] The dataset contains over 330,000 images with more than 1.5 million object instances spanning 80 object categories, all annotated with per-instance segmentation masks.^[1] The original paper reports a total of 2.5 million labeled instances across 328,000 images.^[1] With over 49,000 citations as of 2025, the original COCO paper is one of the most cited publications in the history of computer vision research.^[1]

The authors framed COCO around "the broader question of scene understanding," stating in the paper that the dataset advances object recognition "by gathering images of complex everyday scenes containing common objects in their natural context."^[1] In particular, COCO was designed to address three core research problems: detecting non-iconic views of objects, contextual reasoning between objects, and the precise 2D localization of objects.^[1]

History and Motivation

Before COCO, the dominant benchmark for object detection and recognition was the PASCAL VOC dataset, which contained 20 object categories and roughly 11,500 images.^[7] While PASCAL VOC served the community well from 2005 onward, it had limitations in scale, category diversity, and scene complexity.^[7] Many images in existing datasets were "iconic," meaning they showed a single, centered object against a clean background.^[1] Real-world scenes, by contrast, contain multiple objects in cluttered environments, with partial occlusions and varying scales.^[1]

The COCO project was motivated by the need for a dataset that captured this complexity. The authors, Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick, drew from institutions including Cornell University, California Institute of Technology, Brown University, UC Irvine, and Microsoft Research.^[1] Lubomir Bourdev (Facebook AI Research) and Ross Girshick (Microsoft Research, Redmond) also contributed to the project. Their goal was to build a dataset with three key properties: a large number of object instances per image, rich contextual relationships between objects, and precise spatial localization through per-instance segmentation masks rather than just bounding boxes.^[1]

The paper described its target categories as "91 objects types that would be easily recognizable by a 4 year old," emphasizing common, everyday objects over rare or specialized ones.^[1] "Microsoft COCO: Common Objects in Context" was published at ECCV 2014 and made available on arXiv in May 2014 (arXiv:1405.0312).^[1]

Image Collection

All images in COCO were sourced from Flickr, a photo-sharing platform where users upload images under various Creative Commons licenses.^[1] The choice of Flickr was deliberate: the platform tends to host photographs taken by amateur photographers in everyday settings, which produces more natural, non-iconic images compared to stock photography or web-scraped search results.^[1]

To find images that contained multiple objects in realistic contexts, the dataset creators searched Flickr using pairwise combinations of object category names.^[1] For example, rather than searching for "dog" alone (which tends to return iconic, centered images of dogs), they searched for pairs such as "dog + car" or "person + bicycle." This strategy produced images where objects appeared at various scales, viewpoints, and levels of occlusion, reflecting how objects naturally co-occur in real scenes.^[1]

The images are licensed under Creative Commons Attribution 4.0 (CC BY 4.0), which permits redistribution and commercial use with proper attribution.^[12]

What object categories does COCO contain?

COCO defines 80 object categories for detection and segmentation tasks.^[1] The original paper proposed 91 categories, but 11 of them (including hat, shoe, eye glasses, plate, mirror, window, desk, door, blender, hair brush, and street sign) lacked sufficient segmentation annotations and were dropped from the released dataset.^[1] The categories were selected to represent objects that a young child would easily recognize and that occur frequently in everyday life.^[1] The 80 categories are organized into 12 supercategories:

Supercategory	Object Categories
Person	person
Vehicle	bicycle, car, motorcycle, airplane, bus, train, truck, boat
Outdoor	traffic light, fire hydrant, stop sign, parking meter, bench
Animal	bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe
Accessory	backpack, umbrella, handbag, tie, suitcase
Sports	frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket
Kitchen	bottle, wine glass, cup, fork, knife, spoon, bowl
Food	banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake
Furniture	chair, couch, potted plant, bed, dining table, toilet
Electronic	tv, laptop, mouse, remote, keyboard, cell phone
Appliance	microwave, oven, toaster, sink, refrigerator
Indoor	book, clock, vase, scissors, teddy bear, hair drier, toothbrush

Each category was chosen to have a sufficient number of instances across the dataset to support training and evaluation of statistical models.^[1]

Dataset Statistics and Splits

The COCO dataset has been released in multiple versions. The 2017 reorganization is the most commonly used today. No new images were added between versions; the same images were simply redistributed across splits.^[12]

Statistic	Value
Total images	~330,000
Labeled images (with annotations)	~200,000
Object instances (thing categories)	~1,500,000
Thing categories	80
Stuff categories (COCO-Stuff)	91
Panoptic categories (combined)	133 (80 things + 53 stuff)
Captions	Over 1.5 million (5 per image)
Keypoint-annotated person instances	~250,000 (17 keypoints each)
Average object instances per image	7.7
Average categories per image	3.5

Train, Validation, and Test Splits

The dataset split changed between COCO 2014 and COCO 2017. Based on community feedback, the 2017 release moved from an 83K/41K train/val configuration to a larger training set and smaller validation set.^[12]

Split	Version	Images	Notes
train2014	2014	82,783	Original training split
val2014	2014	40,504	Original validation split
test2014	2014	40,775	Original test split
train2017	2017	118,287	Reorganized (train2014 + part of val2014)
val2017	2017	5,000	Smaller validation set (subset of val2014)
test2017	2017	40,670	Test images (annotations withheld)

The test set is further divided into test-dev (used for development and frequent evaluation on public leaderboards) and test-challenge (used for official competition submissions). Roughly half of the test images (about 20,000) belong to each subset. Researchers must submit predictions to the official COCO evaluation server to obtain test results, preventing overfitting to the test set.^[12]

Annotation Types

COCO provides five distinct types of annotations, each supporting different computer vision tasks.

Object Detection and Instance Segmentation

Every object instance in COCO is annotated with both an axis-aligned bounding box and a pixel-level segmentation mask.^[1] The bounding box is defined by four values (x, y, width, height). The segmentation mask is stored as a polygon or, for crowd regions, as run-length encoding (RLE).^[1] This dual annotation supports both traditional bounding box detection and instance segmentation, where the goal is to produce a precise mask for each object. The dataset contains approximately 1.5 million object instances across the 80 categories, with an average of 7.7 instances per image.^[1] This density is significantly higher than PASCAL VOC (2.3 instances per image) and ImageNet (3.0 instances per image).^[1]

An "iscrowd" flag distinguishes individual object annotations from crowd annotations. When a region contains many closely packed instances of the same category (for example, a crowd of people), a single crowd annotation with an RLE mask is used instead of individual polygon masks.^[1]

Keypoint Detection

COCO includes keypoint annotations for over 250,000 person instances across more than 200,000 images.^[1] Each person is annotated with 17 body keypoints: nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle.^[1] Each keypoint is stored as a triplet (x, y, v), where x and y are pixel coordinates and v is a visibility flag (v=0 means not labeled, v=1 means labeled but occluded, v=2 means labeled and visible). A skeleton definition connects these keypoints into a body structure for pose estimation tasks.

Image Captioning

Each image in COCO comes with five independently written natural-language captions, totaling over 1.5 million captions across the dataset.^[5] These captions were collected through Amazon Mechanical Turk with specific guidelines requiring annotators to describe all important objects, their attributes, and their spatial relationships in at least eight words, while avoiding proper names and speculative statements.^[5] The widely used Karpathy split, proposed by Andrej Karpathy and Fei-Fei Li in 2015, divides the COCO 2014 data into training (113,287 images), validation (5,000 images), and test (5,000 images) subsets specifically optimized for captioning and vision-language research.

Stuff Segmentation

Stuff segmentation provides pixel-level labels for amorphous, uncountable regions such as sky, grass, road, water, and wall.^[2] Unlike "thing" categories (discrete, countable objects), "stuff" categories represent materials and surfaces.^[2] The COCO-Stuff extension, introduced by Holger Caesar, Jasper Uijlings, and Vittorio Ferrari at CVPR 2018, augmented all 164,000 images from COCO 2017 with annotations for 91 stuff categories.^[2] Combined with the 80 thing categories and 1 unlabeled class, COCO-Stuff provides dense pixel-level labels for 172 categories across every image.^[2]

Panoptic Segmentation

Panoptic segmentation unifies instance segmentation of "things" and semantic segmentation of "stuff" into a single coherent output.^[3] Introduced by Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollar at CVPR 2019, this task requires every pixel in an image to be assigned both a semantic label and an instance ID (for thing classes).^[3] The COCO panoptic annotations cover 133 categories: 80 thing classes and 53 stuff classes (a curated subset of the full 91 stuff categories).^[3] The Panoptic Quality (PQ) metric was proposed alongside this task to evaluate performance in a unified manner.^[3]

How were COCO annotations collected?

COCO annotations were collected through a multi-stage pipeline using Amazon Mechanical Turk (AMT), a crowdsourcing platform.^[1] The pipeline was designed to ensure high-quality, consistent annotations at scale. According to the paper, the dataset "drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation."^[1]

Stage 1: Category Labeling

In the first stage, workers determined which of the object categories were present in each image. To achieve high recall, each image was labeled by eight independent workers.^[1] Workers were shown images and asked to indicate the presence or absence of objects from each of 11 supercategories, then from specific categories within those supercategories.^[1] This hierarchical approach reduced the cognitive burden compared to asking about all categories simultaneously. This stage required approximately 20,000 worker hours.^[1]

Stage 2: Instance Spotting

For each category identified in an image, workers marked individual instances by placing a cross on each object.^[1] Each worker was asked to label at most 10 instances of a given category per image.^[1] Multiple workers annotated the same image to ensure all instances were found, and a verification stage confirmed the locations. This stage required approximately 10,000 worker hours.^[1]

Stage 3: Instance Segmentation

In the final and most labor-intensive stage, workers traced precise segmentation masks around each identified instance.^[1] This process required roughly 22 worker hours per 1,000 segmentations.^[1] To maintain quality, workers were required to complete a mandatory training task before participating, and only approximately one in three workers passed the qualification test.^[1] The segmentation masks were further verified in a separate quality-control round.^[1]

The entire annotation effort produced over 2.5 million labeled instances across 328,000 images, representing one of the largest crowdsourced annotation projects in computer vision at the time.^[1]

Evaluation Metrics

COCO introduced a set of evaluation metrics that have become the standard for object detection and segmentation benchmarking.^[1] These metrics differ from those used by PASCAL VOC in important ways.

Average Precision (AP)

The primary metric in COCO evaluation is Average Precision (AP), which measures the area under the precision-recall curve.^[1] Unlike PASCAL VOC, which computes AP at a single Intersection over Union (IoU) threshold of 0.5, COCO computes AP averaged over 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05.^[1] This metric, denoted AP@[.50:.05:.95], rewards models that produce tightly localized predictions, not just roughly correct bounding boxes.^[1] COCO also uses 101-point interpolated precision (sampling the precision-recall curve at 101 recall levels from 0 to 1 in steps of 0.01), providing finer-grained evaluation than the 11-point interpolation used by earlier versions of PASCAL VOC.^[12]

In COCO notation, "AP" without any qualifier refers to this averaged metric.^[1] This differs from other benchmarks where AP might refer to a single-threshold value. The term "mAP" (mean Average Precision) is sometimes used interchangeably with AP in other contexts, but in COCO, AP is already averaged over all categories.^[1]

Metric	Definition
AP	Average Precision averaged over IoU thresholds 0.50 to 0.95 in steps of 0.05 (primary metric)
AP@50 (AP50)	AP at IoU threshold 0.50 (comparable to the PASCAL VOC metric)
AP@75 (AP75)	AP at IoU threshold 0.75 (strict localization requirement)
AP-Small (APS)	AP for small objects (area < 32 x 32 pixels)
AP-Medium (APM)	AP for medium objects (32 x 32 < area < 96 x 96 pixels)
AP-Large (APL)	AP for large objects (area > 96 x 96 pixels)

Average Recall (AR)

Average Recall measures the maximum recall achievable given a fixed number of detections per image, averaged over all categories and the same 10 IoU thresholds (0.50 to 0.95).^[1]

Metric	Definition
AR@1	Average Recall with at most 1 detection per image
AR@10	Average Recall with at most 10 detections per image
AR@100	Average Recall with at most 100 detections per image
AR-Small	AR@100 for small objects (area < 32 x 32 pixels)
AR-Medium	AR@100 for medium objects (32 x 32 < area < 96 x 96 pixels)
AR-Large	AR@100 for large objects (area > 96 x 96 pixels)

In COCO, small objects (area under 32 x 32 pixels) make up about 41% of all instances, medium objects account for about 34%, and large objects about 24%.^[1] This distribution makes the size-stratified metrics especially informative for understanding model behavior across different scales.

Keypoint Evaluation (OKS)

For keypoint detection, COCO uses Object Keypoint Similarity (OKS), which plays a role analogous to IoU in object detection.^[12] OKS measures the distance between predicted and ground-truth keypoint locations, normalized by the scale of the person and per-keypoint constants that account for annotation noise.^[12] AP and AR metrics are then computed using OKS thresholds instead of IoU thresholds, following the same 0.50 to 0.95 range.^[12]

Panoptic Quality (PQ)

For panoptic segmentation, the Panoptic Quality metric decomposes performance into a Recognition Quality (RQ) component and a Segmentation Quality (SQ) component, such that PQ = RQ x SQ.^[3] This decomposition provides interpretable insights into whether errors stem from missed detections (low RQ) or imprecise segmentations (low SQ).^[3] PQ is reported separately for thing classes and stuff classes, as well as averaged across all categories.^[3]

Caption Evaluation

Image captioning performance on COCO is measured using standard natural language processing metrics: BLEU (1-gram through 4-gram), METEOR, ROUGE-L, CIDEr-D, and SPICE.^[5] The COCO Caption Evaluation Server computes all of these metrics against the five reference captions for each image.^[5]

COCO Challenges and Workshops

The COCO Challenge series has been held annually since 2015, co-located with major computer vision conferences.^[12] These competitions have driven rapid progress across multiple tasks.

Year	Venue	Challenge Tracks
2015	ICCV 2015	Object Detection (bounding box + segmentation), Captioning
2016	ECCV 2016	Object Detection, Keypoint Detection, Captioning
2017	ICCV 2017	Object Detection, Keypoint Detection, Stuff Segmentation
2018	ECCV 2018	Instance Segmentation, Panoptic Segmentation, Keypoint Detection, DensePose
2019	ICCV 2019	Instance Segmentation, Panoptic Segmentation, Keypoint Detection, DensePose
2020	ECCV 2020	Instance Segmentation, Panoptic Segmentation, Keypoint Detection, DensePose

The early workshops (2015 and 2016) were held jointly with the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Starting in 2018, the COCO workshops were held jointly with the Mapillary Vistas dataset challenge. Notable milestones include the introduction of keypoint detection in 2016, stuff segmentation in 2017, and both panoptic segmentation and DensePose in 2018. Bounding-box-only detection was phased out after 2017 in favor of instance segmentation with masks.

Extensions and Variants

Several extensions have been built on top of the COCO dataset to address additional research needs.

COCO-Stuff

COCO-Stuff (Caesar et al., CVPR 2018) provides dense pixel-level annotations for 91 stuff categories across all 164,000 images in COCO 2017.^[2] Combined with the 80 thing categories, COCO-Stuff enables holistic scene understanding where every pixel receives a semantic label.^[2] The stuff categories include surfaces (road, pavement, sand), natural elements (sky, clouds, sea, river), vegetation (grass, tree, flower), buildings and structures (wall, building, fence), textiles (carpet, curtain, blanket), food materials, and other environmental elements.^[2]

COCO Captions

The COCO Captions component (Chen et al., 2015) provides five human-written captions per image for over 200,000 images.^[5] The caption evaluation server supports automatic metrics including BLEU, METEOR, ROUGE-L, CIDEr, and SPICE.^[5] COCO Captions has become the primary benchmark for image captioning research and is widely used in training vision-language models.

COCO-WholeBody

COCO-WholeBody (Jin et al., ECCV 2020) extends the keypoint annotations from 17 body keypoints to 133 keypoints per person, including detailed annotations for the face (68 keypoints), hands (42 keypoints, 21 per hand), and feet (6 keypoints).^[10] This extension supports whole-body pose estimation and fine-grained gesture recognition.^[10]

DensePose-COCO

DensePose-COCO (Guler et al., CVPR 2018) maps pixels of person instances to UV coordinates on a 3D body surface model, enabling dense pose estimation.^[4] The dataset provides annotations for approximately 50,000 person instances with over 5 million manually annotated correspondences.^[4] Annotators first segmented each person into semantic body parts, then mapped sampled points to rendered views of a 3D body model to obtain surface coordinates.^[4]

COCONut

COCONut (Deng et al., 2024) is a modernized re-annotation of the COCO dataset with higher-quality segmentation masks, addressing annotation noise and inconsistencies found in the original COCO annotations through improved annotation procedures.

COCO-Text

COCO-Text (Veit et al., 2016) adds text detection and recognition annotations to COCO images, identifying text instances in natural scenes.^[11] It contains annotations for over 173,000 text instances in more than 63,000 images.^[11]

How does COCO differ from other datasets?

COCO occupies a specific niche among object detection and segmentation benchmarks. The following table compares COCO with other prominent datasets in the field.

Feature	PASCAL VOC 2012	COCO 2017	Open Images V7	LVIS
Year introduced	2012	2014 (reorganized 2017)	2020 (updated 2022)	2019
Total images	~11,500	~330,000	~9,000,000	~164,000
Object categories	20	80	600 (detection)	1,203
Object instances	~27,450	~1,500,000	~16,000,000 (boxes)	~2,000,000
Instance segmentation	Semantic masks only	Yes (all instances)	Yes (350 classes)	Yes (1,203 classes)
Avg. instances per image	~2.3	~7.7	~8.4	~12.0
Keypoint annotations	No	Yes (17 body keypoints)	No	No
Captions	No	Yes (5 per image)	Yes (localized narratives)	No
Panoptic segmentation	No	Yes (133 classes)	Limited	No
Stuff segmentation	No	Yes (91 classes via COCO-Stuff)	No	No
Long-tail distribution	No	No	Partial	Yes
Primary AP metric	AP@50 (single threshold)	AP@[.50:.05:.95] (10 thresholds)	AP@50	AP (frequency-weighted)
Primary use case	Detection, segmentation	Detection, segmentation, captioning, pose	Large-scale detection	Fine-grained, rare-object segmentation

PASCAL VOC is smaller and simpler, making it useful for quick prototyping but limited for evaluating models on complex, cluttered scenes.^[7] Open Images, created by Google, provides far more images and categories but uses a different annotation methodology with human-verified machine-generated labels.^[8] LVIS (Large Vocabulary Instance Segmentation), released in 2019 by Agrim Gupta, Piotr Dollar, and Ross Girshick, reuses the same COCO images but provides annotations for over 1,200 categories following a natural long-tail distribution.^[6] LVIS divides categories into frequent (more than 100 training images), common (10 to 100 images), and rare (fewer than 10 images) groups, making it especially valuable for evaluating models on uncommon object categories.^[6]

COCO Annotation Format

The COCO annotation format has become a de facto standard in the computer vision community.^[12] Annotations are stored in JSON files with a specific structure.^[12]

Top-Level Field	Description
`info`	Dataset metadata (version, description, year, contributor, date)
`licenses`	Image license information (id, name, URL)
`images`	List of image records (id, file_name, width, height, date_captured)
`annotations`	List of annotation records (varies by task type)
`categories`	List of category definitions (id, name, supercategory)

For object detection, each annotation record includes an image ID, category ID, bounding box coordinates in [x, y, width, height] format, segmentation mask (as a polygon or run-length encoding), area, and an "iscrowd" flag that distinguishes individual instances from crowd regions.^[12]

Many annotation tools, training frameworks (including PyTorch, TensorFlow, and Detectron2), and evaluation libraries natively support the COCO JSON format. The official Python API (pycocotools) provides utilities for loading annotations, visualizing results, and computing evaluation metrics.^[12] This widespread adoption has made COCO format the common interchange standard for object detection datasets beyond COCO itself.

Impact and Legacy

COCO has had a profound influence on the development of deep learning for visual recognition. Several factors contribute to its lasting impact.

First, the dataset's emphasis on instance segmentation pushed the field beyond bounding-box detection toward pixel-level understanding. Models such as Mask R-CNN (He et al., 2017), which jointly predicts bounding boxes and segmentation masks, were developed and evaluated primarily on COCO.^[9]

Second, the multi-IoU-threshold AP metric introduced by COCO raised the bar for localization quality.^[1] Under the PASCAL VOC metric (AP@50), a bounding box overlapping just half the object area was considered correct.^[7] The COCO metric, by averaging up to IoU 0.95, incentivized models to produce much more precise predictions.^[1]

Third, COCO's rich annotations across multiple tasks (detection, segmentation, keypoints, captions) enabled the development of multi-task and multi-modal models. Vision-language models such as CLIP, GPT-4V, and various image captioning architectures frequently use COCO data for training or evaluation.

Fourth, the annual challenge series created a competitive ecosystem that accelerated progress. Between 2015 and 2023, detection AP on the COCO test-dev leaderboard roughly tripled, reflecting advances from convolutional neural networks with hand-crafted components (such as Faster R-CNN) to end-to-end transformer-based architectures (such as DETR and its successors).

Fifth, the COCO JSON format and evaluation API have become foundational infrastructure in the computer vision toolchain.^[12] Virtually every object detection paper published since 2015 reports COCO metrics, and many new datasets adopt the COCO format for compatibility.

Sixth, models pre-trained on COCO serve as starting points for detection and segmentation tasks in other domains, including medical imaging, autonomous driving, satellite imagery analysis, and robotics. COCO pre-trained weights are available for most major detection frameworks.

The original paper has accumulated over 49,000 citations on Google Scholar, making it one of the most cited computer vision papers ever published.^[1]

Limitations

Despite its wide adoption, COCO has several known limitations. The 80 object categories, while covering common everyday objects, represent only a small fraction of the visual concepts that humans can recognize.^[1] The dataset primarily contains images from North America and Europe, introducing geographic and cultural biases in the types of objects, scenes, and contexts represented. Annotation quality, while generally high, varies across instances; some segmentation masks contain errors, particularly for objects with complex boundaries or heavy occlusion. The dataset's object categories follow a relatively uniform distribution, which does not reflect the natural long-tail frequency of objects in the real world. This limitation motivated the creation of LVIS, which annotates over 1,200 categories on the same COCO images with a natural long-tail distribution.^[6] Researchers have also noted biases in the caption annotations, including gender and racial stereotypes present in the natural-language descriptions. As detection AP scores on COCO continue to climb, some researchers have raised concerns that the benchmark may be approaching saturation, prompting interest in harder evaluation protocols and more challenging datasets.

References

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." *European Conference on Computer Vision (ECCV)*. arXiv:1405.0312 ↩
Caesar, H., Uijlings, J., & Ferrari, V. (2018). "COCO-Stuff: Thing and Stuff Classes in Context." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1612.03716 ↩
Kirillov, A., He, K., Girshick, R., Rother, C., & Dollar, P. (2019). "Panoptic Segmentation." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1801.00868 ↩
Guler, R. A., Neverova, N., & Kokkinos, I. (2018). "DensePose: Dense Human Pose Estimation in the Wild." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1802.00434 ↩
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P., & Zitnick, C. L. (2015). "Microsoft COCO Captions: Data Collection and Evaluation Server." arXiv:1504.00325 ↩
Gupta, A., Dollar, P., & Girshick, R. (2019). "LVIS: A Dataset for Large Vocabulary Instance Segmentation." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1908.03195 ↩
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). "The PASCAL Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2), 303-338. ↩
Kuznetsova, A., Rom, H., Alldrin, N., et al. (2020). "The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale." *International Journal of Computer Vision*, 128, 1956-1981. ↩
He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). "Mask R-CNN." *IEEE International Conference on Computer Vision (ICCV)*. arXiv:1703.06870 ↩
Jin, S., Xu, L., Xu, J., et al. (2020). "Whole-Body Human Pose Estimation in the Wild." *European Conference on Computer Vision (ECCV)*. ↩
Veit, A., Matera, T., Neumann, L., Matas, J., & Belongie, S. (2016). "COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images." arXiv:1601.07140 ↩
COCO Dataset Official Website. https://cocodataset.org/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

COCO dataset

History and Motivation

Image Collection

What object categories does COCO contain?

Dataset Statistics and Splits

Train, Validation, and Test Splits

Annotation Types

Object Detection and Instance Segmentation

Keypoint Detection

Image Captioning

Stuff Segmentation

Panoptic Segmentation

How were COCO annotations collected?

Stage 1: Category Labeling

Stage 2: Instance Spotting

Stage 3: Instance Segmentation

Evaluation Metrics

Average Precision (AP)

Average Recall (AR)

Keypoint Evaluation (OKS)

Panoptic Quality (PQ)

Caption Evaluation

COCO Challenges and Workshops

Extensions and Variants

COCO-Stuff

COCO Captions

COCO-WholeBody

DensePose-COCO

COCONut

COCO-Text

How does COCO differ from other datasets?

COCO Annotation Format

Impact and Legacy

Limitations

See Also

References

Improve this article

What links here (24 of 30)

What links here (24 of 30)

History and Motivation

Image Collection

What object categories does COCO contain?

Dataset Statistics and Splits

Train, Validation, and Test Splits

Annotation Types

Object Detection and Instance Segmentation

Keypoint Detection

Image Captioning

Stuff Segmentation

Panoptic Segmentation

How were COCO annotations collected?

Stage 1: Category Labeling

Stage 2: Instance Spotting

Stage 3: Instance Segmentation

Evaluation Metrics

Average Precision (AP)

Average Recall (AR)

Keypoint Evaluation (OKS)

Panoptic Quality (PQ)

Caption Evaluation

COCO Challenges and Workshops

Extensions and Variants

COCO-Stuff

COCO Captions

COCO-WholeBody

DensePose-COCO

COCONut

COCO-Text

How does COCO differ from other datasets?

COCO Annotation Format

Impact and Legacy

Limitations

See Also

References

Improve this article

Related Articles

MNIST

LAION

PASCAL VOC

Segment Anything Model and Dataset (SAM and SA-1B)

CIFAR-10

LVIS (Large Vocabulary Instance Segmentation)

What links here (24 of 30)

Related Articles

MNIST

LAION

PASCAL VOC

Segment Anything Model and Dataset (SAM and SA-1B)

CIFAR-10

LVIS (Large Vocabulary Instance Segmentation)

What links here (24 of 30)