PASCAL VOC
Last reviewed
Apr 28, 2026
Sources
23 citations
Review status
Source-backed
Revision
v1 ยท 3,918 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
23 citations
Review status
Source-backed
Revision
v1 ยท 3,918 words
Add missing citations, update stale details, or suggest a clearer explanation.
PASCAL VOC (Pattern Analysis, Statistical Modelling and Computational Learning Visual Object Classes) is a long-running benchmark dataset and annual challenge for object recognition, detection, segmentation, action classification, and person layout in computer vision. The challenge ran every year from 2005 to 2012 under the umbrella of the PASCAL Network of Excellence, an EU-funded research consortium, and produced a series of standardized image datasets, ground-truth annotations, evaluation protocols, and reference software that shaped how the field measured progress for more than a decade [1][2].
Though later overtaken in size by ImageNet and in scope by MS COCO, PASCAL VOC's influence is hard to overstate. The 20-class catalogue introduced for VOC2007 became the canonical "set of objects" used in countless object-detection papers, including R-CNN, Fast R-CNN, Faster R-CNN, YOLO, and SSD [3][4][5][6]. The challenge's notion of mean Average Precision computed at an Intersection over Union threshold of 0.5 is still routinely reported as "PASCAL-style AP," and the bounding-box and segmentation annotation conventions seeded the data formats used by most modern detection benchmarks [1][2].
Following the death of organizer Mark Everingham in 2012, the PAMI Mark Everingham Prize was instituted to honor researchers who make selfless contributions to the computer vision community through datasets, software, challenges, and other forms of service rather than through individual research breakthroughs [7][8].
PASCAL stood for Pattern Analysis, Statistical Modelling and Computational Learning. It was a Network of Excellence funded by the European Union under the Sixth Framework Programme for research and technological development. The official start date was 1 December 2003, and the network was coordinated by John Shawe-Taylor, then at the University of Southampton, with extensive participation from the University of Edinburgh, ETH Zurich, KU Leuven, the University of Oxford, the University of Leeds, and Microsoft Research Cambridge among many other institutions across Europe [9].
The network's stated aim was to build a Europe-wide distributed institute pioneering principled methods of pattern analysis, statistical modelling, and computational learning. PASCAL ran until roughly 2008, after which a follow-on network called PASCAL2 continued under the Seventh Framework Programme. PASCAL2 ran from 2008 to about 2013 and refocused on adaptive cognitive systems and robotics [10].
The Visual Object Classes Challenge was one of several community competitions launched within PASCAL. From the network's perspective the challenge was a relatively small project, but it grew into the most visible scientific output associated with the PASCAL brand and is, today, the artifact most people remember when they hear "PASCAL" in a vision context [1][2].
The core team behind VOC remained largely stable across the eight years the challenge ran [1][2]:
Additional collaborators included S. M. Ali Eslami, Yusuf Aytar, and Alexander Sorokin, who joined for individual editions to handle annotation, segmentation tooling, and crowdsourcing. The 2015 retrospective paper was authored by Everingham (posthumously), Eslami, Van Gool, Williams, Winn, and Zisserman [2].
The challenge grew from a small four-class pilot in 2005 to a 20-class benchmark with multiple parallel competitions by 2012. Each year's edition was released as a development kit including images, XML annotations, evaluation code, and a results submission server. The numbers below are taken from the official VOC challenge pages [1][11][12][13][14]:
| Edition | Classes | Total images | Annotated objects | Notes |
|---|---|---|---|---|
| VOC2005 | 4 | 1,578 | 2,209 | Pilot; classes were motorbikes, bicycles, people, cars. Used existing image collections. Tasks: classification and detection. |
| VOC2006 | 10 | 5,304 | 9,507 | Added bus, cat, cow, dog, horse, sheep. First train/val/test split designed from scratch. |
| VOC2007 | 20 | 9,963 | 24,640 | Established the canonical 20-class taxonomy. Introduced segmentation taster and person layout taster. Test annotations were released after the challenge, which is why VOC2007 became the most-used VOC benchmark in the literature. |
| VOC2008 | 20 | 4,340 (trainval) | 10,363 | Segmentation became a full competition. Test annotations not released. |
| VOC2009 | 20 | 7,054 (trainval) | 17,218 | Cumulative design: dataset built by augmenting VOC2008 with newly labeled images; trainval was reused for VOC2010 and beyond. 3,211 segmentations. |
| VOC2010 | 20 | 10,103 (trainval) | 23,374 | 4,203 segmentations. Action classification taster introduced. |
| VOC2011 | 20 | 11,530 (trainval) | 27,450 | 5,034 segmentations. Action classification became a competition. |
| VOC2012 | 20 | 11,540 (trainval) | 27,450 | 6,929 segmentations. Final edition. Person layout taster competition retained. |
The "cumulative" design is worth a note: from 2009 onward, each year's training and validation set was a superset of the previous year's, which let researchers train on the union of all PASCAL data without worrying about overlapping splits. The total testing pool, when test images are added in, is roughly 21,738 images for VOC2010 and around 31,000 by VOC2012 once all unreleased test images are counted [1][2].
A persistent quirk is that test set annotations were only ever published for VOC2007. From 2008 onwards, ground truth for the test partition was kept private and evaluation was performed via an online submission server. The most reproducible reported numbers in the literature were therefore on VOC2007's test split, where labels had been released, locking in VOC2007 as the de facto sandbox for years of detection research [2].
No formal VOC challenge was run after 2012, but the Oxford VGG website continued to host the VOC2012 evaluation server, and as of the mid-2020s it was still receiving submissions [11].
From VOC2007 onwards, the challenge fixed a 20-class taxonomy organized into four super-categories. The list, by category, is [1][2]:
This 20-class set has occasionally been criticized as parochial (no kitchen utensils, no fine-grained animal distinctions, no traffic signs) but it was deliberately chosen to span recognizable everyday objects with enough variety in appearance and context to make detection and segmentation genuinely hard [2]. Classes were curated rather than scraped, and the team avoided easy cases such as hand-staged product photos in favor of categories with substantial intra-class variation. The result is that even with only 20 classes, VOC remained difficult well into the deep-learning era. Detection mAP on VOC2007 climbed from roughly 33% with deformable part models in 2010 to above 80% with fully tuned convolutional detectors by 2017, but few methods have ever truly saturated the benchmark [2][3][4].
VOC was structured as a set of parallel competitions rather than a single task. Over the years, five distinct tasks emerged [1][2][12]:
PASCAL VOC never included an instance segmentation competition in the modern sense. Pixel labels were per-class, with no separation between two cats in the same image. That gap was filled by MS COCO and by Mask R-CNN several years later [15].
Most VOC images were sourced from Flickr, often via creative-commons searches over class-specific keywords. The team deliberately spread image collection across multiple seed annotators and time periods to limit photographer-bias artifacts. Once images were collected, the annotation team applied a strict protocol that became, in practice, the template later copied by other detection benchmarks [1][2].
For object detection, every instance of any of the 20 classes had to be annotated in every image. That "complete annotation" rule meant that when an image was used as a negative example for class X, the system could trust that the image really did not contain class X. Crowdsourced datasets often relax this rule and pay the price in evaluation noise.
Each annotated object received an axis-aligned bounding box drawn tightly around the visible part of the object, plus three flags [2]:
For segmentation, every pixel of the foreground objects was painted with a class label, and every pixel of the surrounding scene was labeled "background." Object boundaries were softened with a thin "void" band labeled 255, used to absorb annotation jitter and excluded from evaluation. In VOC2008 and later, the team also produced separate "object" segmentations that distinguished different instances of the same class, even though no instance segmentation competition was run.
The 2015 retrospective paper analyzed inter-annotator agreement and found that human-vs-human bounding box overlap typically had IoU near 0.85 on most classes, which gives a sense of the ceiling on what detection algorithms can be expected to achieve [2].
PASCAL VOC introduced a vocabulary of evaluation metrics that, with minor variations, is still in use [1][2][16]. The basic building blocks are precision and recall, computed by walking down a confidence-sorted list of detections.
For classification and detection, the per-class score is Average Precision (AP), the area under the precision-recall curve. The challenge then reports mean Average Precision (mAP) by averaging AP across the 20 classes.
Where VOC's AP differs from later benchmarks is in how the precision-recall curve is interpolated:
Both AP variants share the same definition of a true positive: a predicted box is considered correct if its IoU with a ground-truth box of the same class is at least 0.5, and that ground-truth box has not already been matched by a higher-confidence prediction of the same class. Multiple detections of the same object are counted as one true positive plus the rest as false positives.
For segmentation, the headline metric is the mean Intersection-over-Union (mIoU) computed across the 21 classes (20 foreground classes plus background). For each class, IoU is the number of pixels correctly labeled as that class divided by the union of pixels labeled as that class in either the prediction or the ground truth, summed over the test set. The class-wise IoUs are then averaged to give mIoU. This is essentially the same metric used by virtually every modern semantic segmentation benchmark.
A notable difference between VOC's metric and the metric used by MS COCO is that COCO averages AP over 10 IoU thresholds from 0.5 to 0.95 in steps of 0.05, rather than fixing the threshold at 0.5. COCO's metric punishes loose boxes more aggressively and rewards tight localization. "PASCAL-style AP" (single threshold of 0.5) and "COCO-style AP" (averaged over thresholds) are usually reported side by side in modern detection papers, and a model that wins under one metric may lose under the other [15].
It is hard to overstate how central VOC2007 became to object-detection research between roughly 2008 and 2016. A reasonable summary: if a paper claimed a new object detector during that period, it almost certainly reported VOC2007 mAP, and probably VOC2010 or VOC2012 as well [2][3][4][5].
A brief tour of representative milestones, all calibrated against VOC:
Beyond raw numbers, VOC influenced the field in subtler ways. The annotation conventions, including the truncated/occluded/difficult flags and the "complete annotation" rule, were copied by COCO and many follow-on datasets. The fact that nearly every detection paper today reports per-class AP plus mAP, with class names listed in a consistent table, is a VOC convention [1][2][15].
The 2015 retrospective paper, published in IJCV, pulled together eight years of challenge data and offered a much-cited analysis of where object detectors were succeeding and failing as of the deep-learning transition. The paper showed that even the strongest detectors of 2014 still struggled with small objects, unusual viewpoints, and heavy occlusion, all themes that later motivated the design of MS COCO [2].
By the mid-2010s, the limitations of PASCAL VOC had become a recurring topic at vision conferences like CVPR and ICCV. The four most commonly cited shortcomings are [2][15]:
Three successor benchmarks largely supplanted VOC:
Despite these successors, VOC2007 and VOC2012 remained relevant as compact, well-understood evaluation sets. They are still used as auxiliary benchmarks alongside COCO, particularly in transfer learning, semi-supervised detection, and few-shot detection studies. Many researchers also still report VOC2012 segmentation mIoU when introducing new semantic segmentation models, because the test server is still up [11].
Mark Everingham died in October 2012 after a long illness. His co-organizers wrote in the preface to the VOC2012 challenge: "Mark was the key member of the VOC project, and it would have been impossible without his selfless contributions. We have all benefited tremendously from working with Mark." The 2015 retrospective paper was published with Everingham as posthumous first author [2].
In the wake of his death, the IEEE Computer Society Technical Committee on Pattern Analysis and Machine Intelligence (TCPAMI) established the PAMI Mark Everingham Prize. The prize is awarded to a researcher or team of researchers "who have made a selfless contribution of significant benefit to other members of the computer vision community." The award explicitly recognizes service-oriented contributions including challenges, datasets, open-source software, textbooks, and educational resources, as opposed to research breakthroughs that are honored by other prizes such as the Marr Prize [7][8].
The prize is presented annually, alternating between ECCV in even years and ICCV in odd years. Recipients receive USD 3,000 and a plaque [7]. Some illustrative recipients include [7][8]:
Beyond the prize, VOC's legacy is structural. Today's detection and segmentation pipelines, from torchvision to Ultralytics's YOLO toolchain, ship with VOC dataloaders by default, and "VOC format" is shorthand for a specific XML annotation layout that countless other datasets now mimic. PASCAL VOC was the project that made object detection a measurable, comparable, and ultimately tractable problem [1][2][11].