# PASCAL VOC

> Source: https://aiwiki.ai/wiki/pascal_voc
> Updated: 2026-06-23
> Categories: AI Benchmarks, Computer Vision, Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**PASCAL VOC** (Pattern Analysis, Statistical Modelling and Computational Learning Visual Object Classes) is a long-running benchmark dataset and annual challenge for object recognition, [object detection](/wiki/object_detection), segmentation, action classification, and person layout in [computer vision](/wiki/computer_vision). The challenge ran every year from 2005 to 2012 under the umbrella of the PASCAL Network of Excellence, an EU-funded research consortium, and produced a series of standardized image datasets, ground-truth annotations, evaluation protocols, and reference software that shaped how the field measured progress for more than a decade [1][2]. The defining 20-class taxonomy and the VOC2007 set of 9,963 images with 24,640 annotated objects became the canonical sandbox for detection research, evaluated with mean Average Precision at an [Intersection over Union](/wiki/intersection_over_union_iou) threshold of 0.5 [1][13]. The organizers described the challenge as "a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures" [1].

Though later overtaken in size by [ImageNet](/wiki/imagenet) and in scope by [MS COCO](/wiki/coco_dataset), PASCAL VOC's influence is hard to overstate. The 20-class catalogue introduced for VOC2007 became the canonical "set of objects" used in countless object-detection papers, including [R-CNN](/wiki/r_cnn), [Fast R-CNN](/wiki/fast_r_cnn), [Faster R-CNN](/wiki/faster_r_cnn), [YOLO](/wiki/yolo), and [SSD](/wiki/ssd_object_detection) [3][4][5][6]. The challenge's notion of mean Average Precision computed at an Intersection over Union threshold of 0.5 is still routinely reported as "PASCAL-style AP," and the bounding-box and segmentation annotation conventions seeded the data formats used by most modern detection benchmarks [1][2].

Following the death of organizer Mark Everingham in 2012, the PAMI Mark Everingham Prize was instituted to honor researchers who make selfless contributions to the computer vision community through datasets, software, challenges, and other forms of service rather than through individual research breakthroughs [7][8].

## What was the PASCAL Network of Excellence?

PASCAL stood for Pattern Analysis, Statistical Modelling and Computational Learning. It was a Network of Excellence funded by the European Union under the Sixth Framework Programme for research and technological development. The official start date was 1 December 2003, and the network was coordinated by John Shawe-Taylor, then at the University of Southampton, with extensive participation from the University of Edinburgh, ETH Zurich, KU Leuven, the University of Oxford, the University of Leeds, and Microsoft Research Cambridge among many other institutions across Europe [9].

The network's stated aim was to build a Europe-wide distributed institute pioneering principled methods of pattern analysis, statistical modelling, and computational learning. PASCAL ran until roughly 2008, after which a follow-on network called PASCAL2 continued under the Seventh Framework Programme. PASCAL2 ran from 2008 to about 2013 and refocused on adaptive cognitive systems and robotics [10].

The Visual Object Classes Challenge was one of several community competitions launched within PASCAL. From the network's perspective the challenge was a relatively small project, but it grew into the most visible scientific output associated with the PASCAL brand and is, today, the artifact most people remember when they hear "PASCAL" in a vision context [1][2].

## Who organized PASCAL VOC?

The core team behind VOC remained largely stable across the eight years the challenge ran [1][2]:

- **Mark Everingham** (University of Leeds) served as the operational lead and chief annotator-wrangler. He took on the bulk of the annotation pipeline, the evaluation server, and the development kit. Everingham died in 2012; the [precision-recall](/wiki/precision_recall) curves and the development kit released that year were largely his work, and the challenge was effectively named in his memory after his death.
- **Luc Van Gool** (ETH Zurich and KU Leuven) brought computer vision research expertise and helped supply images and infrastructure.
- **Christopher K. I. Williams** (University of Edinburgh) contributed statistical and machine-learning expertise, particularly in evaluation methodology.
- **John Winn** (Microsoft Research Cambridge, then [Microsoft Research](/wiki/microsoft_research)) helped design the segmentation task and the Bayesian aspects of evaluation.
- **Andrew Zisserman** (University of Oxford, Visual Geometry Group) contributed images, annotation guidelines, and broader research direction.

Additional collaborators included S. M. Ali Eslami, Yusuf Aytar, and Alexander Sorokin, who joined for individual editions to handle annotation, segmentation tooling, and crowdsourcing. The 2015 retrospective paper was authored by Everingham (posthumously), Eslami, Van Gool, Williams, Winn, and Zisserman [2].

## What were the PASCAL VOC editions (2005-2012)?

The challenge grew from a small four-class pilot in 2005 to a 20-class benchmark with multiple parallel competitions by 2012. Each year's edition was released as a development kit including images, XML annotations, evaluation code, and a results submission server. The numbers below are taken from the official VOC challenge pages [1][11][12][13][14]:

| Edition | Classes | Total images | Annotated objects | Notes |
|---|---|---|---|---|
| VOC2005 | 4 | 1,578 | 2,209 | Pilot; classes were motorbikes, bicycles, people, cars. Used existing image collections. Tasks: classification and detection. |
| VOC2006 | 10 | 5,304 | 9,507 | Added bus, cat, cow, dog, horse, sheep. First train/val/test split designed from scratch. |
| VOC2007 | 20 | 9,963 | 24,640 | Established the canonical 20-class taxonomy. Introduced segmentation taster and person layout taster. **Test annotations were released after the challenge** (on 6 November 2007), which is why VOC2007 became the most-used VOC benchmark in the literature [13]. |
| VOC2008 | 20 | 4,340 (trainval) | 10,363 | Segmentation became a full competition. Test annotations not released. |
| VOC2009 | 20 | 7,054 (trainval) | 17,218 | Cumulative design: dataset built by augmenting VOC2008 with newly labeled images; trainval was reused for VOC2010 and beyond. 3,211 segmentations. |
| VOC2010 | 20 | 10,103 (trainval) | 23,374 | 4,203 segmentations. Action classification taster introduced. |
| VOC2011 | 20 | 11,530 (trainval) | 27,450 | 5,034 segmentations. Action classification became a competition. |
| VOC2012 | 20 | 11,540 (trainval) | 27,450 | 6,929 segmentations. Final edition. Person layout taster competition retained. |

The "cumulative" design is worth a note: from 2009 onward, each year's training and validation set was a superset of the previous year's, which let researchers train on the union of all PASCAL data without worrying about overlapping splits. The total testing pool, when test images are added in, is roughly 21,738 images for VOC2010 and around 31,000 by VOC2012 once all unreleased test images are counted [1][2].

A persistent quirk is that test set annotations were only ever published for VOC2007. From 2008 onwards, ground truth for the test partition was kept private and evaluation was performed via an online submission server. The most reproducible reported numbers in the literature were therefore on VOC2007's test split, where labels had been released, locking in VOC2007 as the de facto sandbox for years of detection research [2].

No formal VOC challenge was run after 2012, but the Oxford VGG website continued to host the VOC2012 evaluation server, and as of the mid-2020s it was still receiving submissions [11].

## What are the 20 PASCAL VOC object classes?

From VOC2007 onwards, the challenge fixed a 20-class taxonomy organized into four super-categories. The list, by category, is [1][2][13]:

- **Person**: person.
- **Animals**: bird, cat, cow, dog, horse, sheep.
- **Vehicles**: aeroplane, bicycle, boat, bus, car, motorbike, train.
- **Indoor**: bottle, chair, dining table, potted plant, sofa, tv/monitor.

This 20-class set has occasionally been criticized as parochial (no kitchen utensils, no fine-grained animal distinctions, no traffic signs) but it was deliberately chosen to span recognizable everyday objects with enough variety in appearance and context to make detection and segmentation genuinely hard [2]. Classes were curated rather than scraped, and the team avoided easy cases such as hand-staged product photos in favor of categories with substantial intra-class variation. The result is that even with only 20 classes, VOC remained difficult well into the deep-learning era. Detection mAP on VOC2007 climbed from roughly 33% with deformable part models in 2010 to above 80% with fully tuned convolutional detectors by 2017, but few methods have ever truly saturated the benchmark [2][3][4].

## What tasks does PASCAL VOC include?

VOC was structured as a set of parallel competitions rather than a single task. Over the years, five distinct tasks emerged [1][2][12]:

- **Image classification**: For each of the 20 classes, predict whether at least one instance is present in the image. The output is a per-class score and the evaluation is per-class Average Precision over the test set.
- **Object detection**: For each class, return a list of bounding boxes with confidence scores. A detection counts as correct if it has [Intersection over Union](/wiki/intersection_over_union_iou) of at least 0.5 with a ground-truth [bounding box](/wiki/bounding_box) of the same class. Evaluation is per-class AP and mean AP across the 20 classes. This is the headline task most people associate with PASCAL VOC.
- **Semantic segmentation**: Assign each pixel of the image to one of the 20 object classes or to a "background" class. Introduced as a taster in VOC2007 and promoted to a full competition from VOC2008. See [semantic segmentation](/wiki/semantic_segmentation).
- **Action classification**: Given the bounding box of a person in an image, predict which of a small set of actions (jumping, phoning, playing instrument, reading, riding bike, riding horse, running, taking photo, using computer, walking) the person is performing. Introduced as a taster in VOC2010 and promoted to a full competition in VOC2011. VOC2012 added a variant where the person was specified by a single point on the body rather than a bounding box.
- **Person layout**: Predict bounding boxes for the parts of a person (head, hands, feet). Always a taster competition.

PASCAL VOC never included an [instance segmentation](/wiki/instance_segmentation) competition in the modern sense. Pixel labels were per-class, with no separation between two cats in the same image. That gap was filled by [MS COCO](/wiki/coco_dataset) and by [Mask R-CNN](/wiki/mask_r_cnn) several years later [15].

## How were PASCAL VOC images annotated?

Most VOC images were sourced from Flickr, often via creative-commons searches over class-specific keywords. The team deliberately spread image collection across multiple seed annotators and time periods to limit photographer-bias artifacts. Once images were collected, the annotation team applied a strict protocol that became, in practice, the template later copied by other detection benchmarks [1][2].

For object detection, every instance of any of the 20 classes had to be annotated in every image. That "complete annotation" rule meant that when an image was used as a negative example for class X, the system could trust that the image really did not contain class X. Crowdsourced datasets often relax this rule and pay the price in evaluation noise.

Each annotated object received an axis-aligned bounding box drawn tightly around the visible part of the object, plus three flags [2]:

- **truncated**: the object extends beyond the image frame.
- **occluded**: a substantial portion of the object is hidden by other foreground objects.
- **difficult**: the object is hard to recognize even for a human. Difficult examples are excluded from the standard evaluation metric.

For segmentation, every pixel of the foreground objects was painted with a class label, and every pixel of the surrounding scene was labeled "background." Object boundaries were softened with a thin "void" band labeled 255, used to absorb annotation jitter and excluded from evaluation. In VOC2008 and later, the team also produced separate "object" segmentations that distinguished different instances of the same class, even though no instance segmentation competition was run.

The 2015 retrospective paper analyzed inter-annotator agreement and found that human-vs-human bounding box overlap typically had IoU near 0.85 on most classes, which gives a sense of the ceiling on what detection algorithms can be expected to achieve [2].

## How is PASCAL VOC evaluated (mAP at IoU 0.5)?

PASCAL VOC introduced a vocabulary of evaluation metrics that, with minor variations, is still in use [1][2][16]. The basic building blocks are precision and recall, computed by walking down a confidence-sorted list of detections.

For classification and detection, the per-class score is **Average Precision** (AP), the area under the precision-recall curve. The challenge then reports [mean Average Precision](/wiki/mean_average_precision) (mAP) by averaging AP across the 20 classes.

Where VOC's AP differs from later benchmarks is in how the precision-recall curve is interpolated:

- **VOC2007 AP (11-point interpolation)**: The precision-recall curve is sampled at 11 equally spaced recall values (0.0, 0.1, ..., 1.0). At each sample point, the maximum precision achieved at any recall greater than or equal to that point is recorded. AP is the mean of those 11 precision values. This was a coarse approximation chosen partly for simplicity and partly to smooth out small zigzags in the curve.
- **VOC2010 AP (all-points interpolation)**: From VOC2010 onwards, AP was computed as the area under the full precision-recall curve, with precision interpolated to be monotonically non-increasing as recall increases. Concretely, the curve is sampled at every unique recall value where a true positive is added, and AP is the sum of rectangular areas under the resulting step function. This is a strictly better approximation of "area under the PR curve" than the 11-point scheme [16].

Both AP variants share the same definition of a true positive: a predicted box is considered correct if its IoU with a ground-truth box of the same class is at least 0.5, and that ground-truth box has not already been matched by a higher-confidence prediction of the same class. Multiple detections of the same object are counted as one true positive plus the rest as false positives.

For segmentation, the headline metric is the **mean Intersection-over-Union (mIoU)** computed across the 21 classes (20 foreground classes plus background). For each class, IoU is the number of pixels correctly labeled as that class divided by the union of pixels labeled as that class in either the prediction or the ground truth, summed over the test set. The class-wise IoUs are then averaged to give mIoU. This is essentially the same metric used by virtually every modern semantic segmentation benchmark.

A notable difference between VOC's metric and the metric used by [MS COCO](/wiki/coco_dataset) is that COCO averages AP over 10 IoU thresholds from 0.5 to 0.95 in steps of 0.05, rather than fixing the threshold at 0.5. COCO's metric punishes loose boxes more aggressively and rewards tight localization. "PASCAL-style AP" (single threshold of 0.5) and "COCO-style AP" (averaged over thresholds) are usually reported side by side in modern detection papers, and a model that wins under one metric may lose under the other [15].

## How did PASCAL VOC influence object detection research?

It is hard to overstate how central VOC2007 became to object-detection research between roughly 2008 and 2016. A reasonable summary: if a paper claimed a new object detector during that period, it almost certainly reported VOC2007 mAP, and probably VOC2010 or VOC2012 as well [2][3][4][5].

A brief tour of representative milestones, all calibrated against VOC:

- **Deformable Part Models (Felzenszwalb et al., 2010)**: Reached around 33% mAP on VOC2007. The DPM detector dominated PASCAL VOC for several years and was still being cited in 2015 as a strong non-deep baseline.
- **R-CNN (Girshick et al., 2014)**: The first widely cited demonstration of [convolutional features](/wiki/computer_vision) for [object detection](/wiki/object_detection). Reported 53.7% mAP on VOC2010 and 53.3% on VOC2012, more than 30% relative improvement over the prior best. R-CNN combined selective search region proposals with a fine-tuned CNN classifier [3].
- **Fast R-CNN (Girshick, 2015)**: Streamlined R-CNN's pipeline, replacing per-region forward passes with a single shared feature map and ROI pooling. Reached around 68% mAP on VOC2007 trainval+VOC2012 trainval [4].
- **Faster R-CNN (Ren et al., 2015)**: Introduced the Region Proposal Network, removing the dependency on selective search. Reported around 73% mAP on VOC2007 with VGG-16 features [5].
- **YOLO (Redmon et al., 2016)** and **SSD (Liu et al., 2016)**: Single-shot detectors that traded a few points of mAP on VOC2007 for dramatic speedups, opening the door to real-time detection [6][17].
- **Fully Convolutional Networks (Long, Shelhamer, Darrell, 2015)**: Established the use of FCNs for semantic segmentation, with PASCAL VOC2012 as the primary benchmark. Reached around 62% mIoU on VOC2012 test, an enormous jump over previous methods. See [FCN](/wiki/fcn) [18].

Beyond raw numbers, VOC influenced the field in subtler ways. The annotation conventions, including the truncated/occluded/difficult flags and the "complete annotation" rule, were copied by COCO and many follow-on datasets. The fact that nearly every detection paper today reports per-class AP plus mAP, with class names listed in a consistent table, is a VOC convention [1][2][15].

The 2015 retrospective paper, published in [IJCV](/wiki/ijcv), pulled together eight years of challenge data and offered a much-cited analysis of where object detectors were succeeding and failing as of the deep-learning transition. The paper showed that even the strongest detectors of 2014 still struggled with small objects, unusual viewpoints, and heavy occlusion, all themes that later motivated the design of [MS COCO](/wiki/coco_dataset) [2].

## What replaced PASCAL VOC, and what were its limitations?

By the mid-2010s, the limitations of PASCAL VOC had become a recurring topic at vision conferences like [CVPR](/wiki/cvpr) and [ICCV](/wiki/iccv). The four most commonly cited shortcomings are [2][15]:

1. **Class count**: 20 classes is small. Real applications such as autonomous driving, retail, or warehouse robotics need to distinguish many more categories.
2. **Image count**: Around 10,000 test images per edition is small by modern standards. The signal-to-noise ratio in mAP scores is limited, and small differences between methods are hard to call statistically significant.
3. **Scene complexity**: VOC images tend to feature one or two prominent objects against a clean background. Real-world scenes are messier, with dozens of overlapping objects, severe occlusion, and a much heavier tail of object scales.
4. **Lack of small objects**: Annotators were told to skip very small or hard-to-recognize objects. This means VOC is a poor benchmark for testing whether a detector handles small objects well.

Three successor benchmarks largely supplanted VOC:

- **[ImageNet](/wiki/imagenet)** (Deng et al., 2009; ILSVRC challenges from 2010 onward): Pushed image classification to 1,000 fine-grained categories with millions of training images. ImageNet did include a detection track (DET) starting in 2013 with 200 categories, but for classification it became the obvious benchmark of choice [19][20].
- **[MS COCO](/wiki/coco_dataset)** (Lin et al., 2014, with continued growth through 2017): 80 object categories, around 330,000 images, 1.5 million object instances, full instance segmentations. COCO's averaged-IoU AP metric is more discriminating than VOC's, and the dataset's emphasis on cluttered everyday scenes made it the natural successor for detection and instance segmentation. By around 2016, most new detection papers led with COCO numbers [15].
- **[Open Images](/wiki/open_images)** (Kuznetsova et al., 2018) and **[LVIS](/wiki/lvis)** (Gupta et al., 2019): Pushed further into the long tail. Open Images V4 had 600 boxable classes and around 9 million training images. LVIS introduced 1,203 categories with a deliberately long-tailed distribution [21][22].

Despite these successors, VOC2007 and VOC2012 remained relevant as compact, well-understood evaluation sets. They are still used as auxiliary benchmarks alongside COCO, particularly in transfer learning, semi-supervised detection, and few-shot detection studies. Many researchers also still report VOC2012 segmentation mIoU when introducing new semantic segmentation models, because the test server is still up [11].

## Legacy and the Everingham Prize

Mark Everingham died in October 2012 after a long illness. His co-organizers wrote in the preface to the VOC2012 challenge: "Mark was the key member of the VOC project, and it would have been impossible without his selfless contributions. We have all benefited tremendously from working with Mark." The 2015 retrospective paper was published with Everingham as posthumous first author [2].

In the wake of his death, the IEEE Computer Society Technical Committee on Pattern Analysis and Machine Intelligence (TCPAMI) established the **PAMI Mark Everingham Prize**. The prize is awarded to a researcher or team of researchers "who have made a selfless contribution of significant benefit to other members of the computer vision community." The award explicitly recognizes service-oriented contributions including challenges, datasets, open-source software, textbooks, and educational resources, as opposed to research breakthroughs that are honored by other prizes such as the Marr Prize [7][8].

The prize is presented annually, alternating between [ECCV](/wiki/eccv) in even years and [ICCV](/wiki/iccv) in odd years. Recipients receive USD 3,000 and a plaque [7]. Some illustrative recipients include [7][8]:

- **2013 (ICCV)**: P. Jonathon Phillips, for the FERET and FRVT face recognition datasets and challenges; and Gary Bradski with the OpenCV team, for the OpenCV open-source library.
- **2014 (ECCV)**: Terry Boult and Geoffrey Boult, for long-running service to the computer vision community in conference and workshop management.
- **2015 (ICCV)**: Daniel Scharstein and Richard Szeliski, for the Middlebury stereo, optical flow, and MRF benchmarks.
- Subsequent years honored, among others, the creators of the KITTI autonomous-driving benchmark, the TRECVid video retrieval evaluation, the UCF action recognition datasets, and ImageNet itself.

Beyond the prize, VOC's legacy is structural. Today's detection and segmentation pipelines, from torchvision to Ultralytics's YOLO toolchain, ship with VOC dataloaders by default, and "VOC format" is shorthand for a specific XML annotation layout that countless other datasets now mimic. PASCAL VOC was the project that made object detection a measurable, comparable, and ultimately tractable problem [1][2][11].

## See also

- [Computer vision](/wiki/computer_vision)
- [Object detection](/wiki/object_detection)
- [Image classification](/wiki/image_classification)
- [Semantic segmentation](/wiki/semantic_segmentation)
- [Instance segmentation](/wiki/instance_segmentation)
- [COCO dataset](/wiki/coco_dataset)
- [ImageNet](/wiki/imagenet)
- [Open Images](/wiki/open_images)
- [LVIS](/wiki/lvis)
- [Mean Average Precision](/wiki/mean_average_precision)
- [Intersection over Union](/wiki/intersection_over_union_iou)
- [R-CNN](/wiki/r_cnn)
- [Fast R-CNN](/wiki/fast_r_cnn)
- [Faster R-CNN](/wiki/faster_r_cnn)
- [YOLO](/wiki/yolo)
- [SSD](/wiki/ssd_object_detection)
- [Mask R-CNN](/wiki/mask_r_cnn)
- [FCN](/wiki/fcn)
- [Mark Everingham](/wiki/mark_everingham)
- [Andrew Zisserman](/wiki/andrew_zisserman)
- [Luc Van Gool](/wiki/luc_van_gool)
- [PASCAL Network](/wiki/pascal_network)

## References

1. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). "The Pascal Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2), 303-338. https://link.springer.com/article/10.1007/s11263-009-0275-4
2. Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). "The Pascal Visual Object Classes Challenge: A Retrospective." *International Journal of Computer Vision*, 111(1), 98-136. https://link.springer.com/article/10.1007/s11263-014-0733-5
3. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation." *CVPR 2014*. https://arxiv.org/abs/1311.2524
4. Girshick, R. (2015). "Fast R-CNN." *ICCV 2015*. https://arxiv.org/abs/1504.08083
5. Ren, S., He, K., Girshick, R., & Sun, J. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." *NeurIPS 2015*. https://arxiv.org/abs/1506.01497
6. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). "You Only Look Once: Unified, Real-Time Object Detection." *CVPR 2016*. https://arxiv.org/abs/1506.02640
7. The Computer Vision Foundation. "PAMI Mark Everingham Prize." https://www.thecvf.com/?page_id=529
8. IEEE Computer Society Technical Committee on Pattern Analysis and Machine Intelligence. "PAMI Mark Everingham Prize." https://tc.computer.org/tcpami/awards/pami-mark-everingham-prize/
9. CORDIS, European Commission. "Pattern analysis, statistical modelling and computational Learning (PASCAL)." Project ID 506778, Sixth Framework Programme. https://cordis.europa.eu/project/id/506778
10. CORDIS, European Commission. "Pattern Analysis, Statistical Modelling and Computational Learning 2 (PASCAL2)." Project ID 216886, Seventh Framework Programme. https://cordis.europa.eu/project/id/216886
11. The PASCAL VOC Project (Visual Geometry Group, University of Oxford). "The PASCAL Visual Object Classes Homepage." http://host.robots.ox.ac.uk/pascal/VOC/
12. The PASCAL VOC Project. "The PASCAL Visual Object Classes Challenge 2012 (VOC2012)." http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
13. The PASCAL VOC Project. "The PASCAL Visual Object Classes Challenge 2007 (VOC2007)." http://host.robots.ox.ac.uk/pascal/VOC/voc2007/
14. The PASCAL VOC Project. "The PASCAL Visual Object Classes Challenge 2011 (VOC2011)." http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html
15. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." *ECCV 2014*. https://arxiv.org/abs/1405.0312
16. Padilla, R., Netto, S. L., & da Silva, E. A. B. (2020). "A Survey on Performance Metrics for Object-Detection Algorithms." *International Conference on Systems, Signals and Image Processing*. https://github.com/rafaelpadilla/Object-Detection-Metrics
17. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). "SSD: Single Shot MultiBox Detector." *ECCV 2016*. https://arxiv.org/abs/1512.02325
18. Long, J., Shelhamer, E., & Darrell, T. (2015). "Fully Convolutional Networks for Semantic Segmentation." *CVPR 2015*. https://arxiv.org/abs/1411.4038
19. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). "ImageNet: A Large-Scale Hierarchical Image Database." *CVPR 2009*. https://ieeexplore.ieee.org/document/5206848
20. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., et al. (2015). "ImageNet Large Scale Visual Recognition Challenge." *International Journal of Computer Vision*, 115(3), 211-252. https://arxiv.org/abs/1409.0575
21. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., et al. (2020). "The Open Images Dataset V4." *International Journal of Computer Vision*, 128(7), 1956-1981. https://arxiv.org/abs/1811.00982
22. Gupta, A., Dollar, P., & Girshick, R. (2019). "LVIS: A Dataset for Large Vocabulary Instance Segmentation." *CVPR 2019*. https://arxiv.org/abs/1908.03195
23. He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). "Mask R-CNN." *ICCV 2017*. https://arxiv.org/abs/1703.06870

