LVIS (Large Vocabulary Instance Segmentation)
Last reviewed
Apr 30, 2026
Sources
19 citations
Review status
Source-backed
Revision
v1 ยท 4,000 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
19 citations
Review status
Source-backed
Revision
v1 ยท 4,000 words
Add missing citations, update stale details, or suggest a clearer explanation.
LVIS (pronounced "el-vis") is a large-scale instance segmentation benchmark that targets the long-tailed regime of object detection. It contains roughly 2 million high-quality instance masks across 1,203 entry-level object categories, annotated on the same approximately 164,000 images that make up Microsoft COCO 2017. Categories were chosen to reflect the natural Zipfian frequency distribution of objects in everyday photographs, so most classes have only a handful of training examples while a few common classes have tens of thousands. The dataset was introduced by Agrim Gupta, Piotr Dollar, and Ross Girshick of Facebook AI Research at CVPR 2019 and rapidly became the standard benchmark for long-tailed recognition in detection and segmentation.
LVIS was deliberately designed around two awkward truths about real-world recognition. First, the standard benchmarks at the time, PASCAL VOC and Microsoft COCO, covered only 20 to 80 common categories that were nearly balanced by construction. Second, annotating every category on every image becomes quadratically expensive as the vocabulary grows. LVIS introduced a federated annotation protocol that annotates each image only for a subset of relevant categories and pairs it with a federated evaluation metric that does not penalize predictions on missing labels. The combination of long-tailed vocabulary, careful annotation, and a fair evaluation protocol turned LVIS into a controlled testbed for class imbalance research.
Until 2019 the dominant detection benchmarks were closed-set and class-balanced by construction. PASCAL VOC 2007 and 2012 covered 20 categories. Microsoft COCO, released by Lin et al. in 2014, expanded that to 80 thing categories with hundreds to thousands of training instances per class. ImageNet detection used 200 categories. None reflected the actual frequency of objects in everyday photographs, where a small number of categories (people, chairs, cars) dominate while a long tail of categories (toothpicks, water heaters, dressers) appear infrequently but matter for real applications.
Gupta, Dollar, and Girshick argued that this gap mattered for science as well as deployment. With thousands of training examples per class, even noisy classifiers learn good decision boundaries; with single-digit training examples per class, the same architectures collapse. The LVIS paper sets the goal as studying "detection in the regime where there are a few thousand or more total categories with a power-law distribution of training instances per category." The dataset was sized to make this study possible and evaluated in a way that did not let methods cheat by ignoring the tail.
A second motivation was the cost of annotation. For 1,200 categories on 164,000 images, exhaustive labeling would require almost 200 million image-category checks. The federated annotation strategy resolved this by annotating each image for only a subset of categories while tracking exactly which categories were exhaustively annotated on each image, so that evaluation could be honest about the missing labels.
LVIS reuses the image set from COCO 2017 but rebuilds the label set from scratch. The image distribution is therefore familiar to anyone who has worked with COCO, while the annotation density and category structure are different.
The full LVIS v1.0 release covers approximately 164,000 images split as roughly 100,000 train, 19,800 validation, 19,800 test-dev, and 19,800 test-challenge. The exact counts in the lvis-api repository are 100,170 training, 19,809 validation, 19,822 test-dev, and 19,822 test-challenge images. Because the images are identical to COCO 2017, models can swap freely between the two label sets and any pretraining infrastructure built for COCO transfers without change.
The LVIS vocabulary contains 1,203 entry-level categories drawn from WordNet synsets. Each category is associated with a synset definition, a name, a list of synonyms, and a position in the WordNet hierarchy. "Entry-level" follows Eleanor Rosch's psycholinguistic notion of basic-level categories, the level at which people most naturally name objects. A photo of a poodle would typically be labeled "dog" rather than "poodle" or "animal." Mapping categories to synsets prevents fragmentation from synonyms (sofa vs couch) and provides a hierarchy that researchers can exploit for hierarchical losses.
The initial v0.5 release in May 2019 contained 1,230 categories after quality filtering. v1.0, released in May 2020, settled on the 1,203-category vocabulary that is still used today. Categories cover a broad range of household objects, food items, animals, vehicles, sports equipment, clothing, and tools.
The key conceptual contribution of LVIS is its three-bin frequency split, which assigns each category to one of three buckets based on how many training images contain at least one instance of it. The thresholds are fixed in the official LVIS release.
| Bucket | Symbol | Definition (training images per category) | Approximate share of categories |
|---|---|---|---|
| Rare | r | 1 to 10 images | About 30 percent |
| Common | c | 11 to 100 images | About 35 percent |
| Frequent | f | More than 100 images | About 35 percent |
Approximately 75 percent of categories appear in 100 or fewer training images in v0.5, and the situation is similar in v1.0. The rare bucket contains hundreds of categories that are essentially few-shot classes by COCO standards; the most frequent few have tens of thousands of training instances. This stark distribution is the entire point: it forces methods to handle the head and the tail at the same time.
The LVIS paper documents a six-stage crowdsourced annotation pipeline executed on Amazon Mechanical Turk. The pipeline avoids any machine-in-the-loop labeling so that masks are not biased toward what current detectors find easy.
The outputs of this pipeline are two sets per category, the positive set (images with all instances of that category exhaustively annotated) and the negative set (images known to contain no instances of that category). All other (image, category) pairs are simply unknown and must be ignored at evaluation time. This is what "federated" means in LVIS: a single dataset that is the union of many smaller per-category datasets, each of which is internally consistent but only covers a slice of the image set.
Quality control was unusually aggressive. Roughly 18 percent of initially detected categories were dropped during pruning, and about 10 percent of marked instances were dropped for visual consistency. The remaining masks are reported by Gupta et al. to be substantially crisper and more boundary-accurate than the COCO masks they replace. The lvis-api repository emphasizes that LVIS uses higher-quality polygons than COCO, and several follow-up studies (including the COCO-relabeling work by Zhao et al.) cite LVIS as evidence that COCO masks underestimate true object boundaries.
LVIS has shipped two main public releases.
| Release | Date | Images | Categories | Notes |
|---|---|---|---|---|
| LVIS v0.5 | May 2019 | About 82,000 (57k train, 5k val, 20k test) | 1,230 | Initial release with the CVPR 2019 paper. Used in the LVIS Challenge at ICCV 2019. |
| LVIS v1.0 | May 2020 | About 164,000 (100k train, 20k val, 20k test-dev, 20k test-challenge) | 1,203 | Expanded to the full COCO 2017 image set. Standard release used by all subsequent papers and the 2020 and 2021 challenges. |
v0.5 was an interim release while annotation was still in progress. v1.0 made use of the complete COCO 2017 image set and refined the category list, which is why category counts dropped slightly from 1,230 to 1,203. From mid-2020 onward virtually all reported numbers in the literature use v1.0.
LVIS adapts the COCO mask AP protocol but adds two crucial twists: per-frequency reporting and federated handling of missing labels. Understanding the evaluation rules is essential to interpreting LVIS numbers.
The primary metric is mask Average Precision (AP), computed as the mean of mask AP across IoU thresholds 0.5, 0.55, ..., 0.95 (10 thresholds in steps of 0.05), averaged across categories. The intersection-over-union is computed between predicted and ground-truth instance masks. AP50 (IoU at 0.5) and AP75 (IoU at 0.75) are reported as well, but the headline number is the IoU-averaged AP, in line with the COCO convention.
A subtle but important detail is that LVIS averages AP across categories rather than across detections or images. This means a method that nails the head classes and ignores the tail will score much worse on LVIS than on COCO, because the tail makes up most of the categories.
The distinctive LVIS metrics are AP_r, AP_c, and AP_f, the mean AP over the rare, common, and frequent buckets respectively. These let researchers see at a glance whether an improvement comes from the head or the tail.
| Metric | Meaning |
|---|---|
| AP | Mean AP across all 1,203 categories at IoU 0.5:0.05:0.95 |
| AP50 | Mean AP at IoU 0.5 (single threshold) |
| AP75 | Mean AP at IoU 0.75 |
| AP_r | Mean AP averaged over rare categories only (1-10 train images) |
| AP_c | Mean AP averaged over common categories only (11-100 train images) |
| AP_f | Mean AP averaged over frequent categories only (>100 train images) |
| AP_box | Bounding-box variant for detection-only methods |
A standard Mask R-CNN with a ResNet-50-FPN backbone and class-balanced sampling reaches around 21 percent overall AP on LVIS v0.5 in the original paper, but scores only about 3 percent AP_r without resampling. Repeat factor sampling raises AP_r to about 13 percent, the first concrete demonstration that simple sampling tricks materially help the tail. These numbers became the de facto baselines for everyone who followed.
Because an image is exhaustively annotated for only a subset of categories, naively penalizing predictions on the missing categories would punish correct detections that happened to land on unannotated objects. The federated AP definition in LVIS only counts an image's predictions for category c if that image is in the positive or negative set for c. Predictions on images outside both sets are simply ignored for that category. The bookkeeping is handled by the lvis-api Python package, which is built on top of the COCO API.
This design choice is what makes large-vocabulary evaluation tractable at all. Without federated AP, building a fair benchmark for 1,200 categories on 164,000 images would require nearly 200 million annotation decisions, which is infeasible. With federated AP, annotation effort scales with the actual number of category instances rather than the cross product.
The LVIS organizers ran a series of competitions to drive method development.
| Competition | Year | Notes |
|---|---|---|
| LVIS Challenge | ICCV 2019 (October) | First challenge, held on v0.5. Winner used Hybrid Task Cascade with class-balanced sampling and several long-tail tricks. The Equalization Loss work by Tan et al. emerged from this challenge and won first place. |
| LVIS Challenge | ECCV 2020 | First challenge on v1.0. Many entries used the federated loss and decoupled training schedules. |
| LVIS Challenge | CVPR 2021 | Final official challenge edition, won by methods built on Cascade Mask R-CNN with CBNetV2 backbones and Swin-L. |
After 2021 the formal challenges wound down, but LVIS remained the standard test bed for long-tailed methods. New papers continued to report v1.0 numbers as a default, often alongside COCO numbers for comparison.
LVIS has driven the development of an entire subfield of long-tailed visual recognition. The methods listed below all report results on LVIS and represent the main families of approaches.
| Method | Authors | Venue | Key idea |
|---|---|---|---|
| Repeat Factor Sampling (RFS) | Gupta, Dollar, Girshick | CVPR 2019 | For each category c, define rc = max(1, sqrt(t/fc)) where fc is the fraction of training images containing c and t is a threshold. Repeat each image by the max rc over its categories. Simple and effective baseline. |
| Class-balanced sampling and re-weighting | Various | Pre-2019 baseline | Sample images so each category appears with similar frequency, or weight the loss inversely to category frequency. Helps the tail but can hurt the head. |
| Equalization Loss (EQL) | Tan, Wang, Wang, Liu, Liu, Yan | CVPR 2020 | Mask gradients on rare categories from being suppressed by abundant negative samples from the head. Won the LVIS Challenge 2019. |
| Equalization Loss v2 (EQLv2) | Tan, Lu, Wang, Yan, Liu | CVPR 2021 | Per-category gradient guided reweighting that balances positives and negatives independently. Outperforms EQL by about 4 AP overall and 14 to 18 AP on rare categories. |
| Decoupled training (cRT, tau-norm, LWS) | Kang, Xie, Rohrbach, Yan, Gordo, Feng, Kalantidis | ICLR 2020 | First train representations with natural sampling, then re-train or normalize the classifier with class-balanced sampling. Surprisingly effective with no architectural changes. |
| Federated Loss | Zhou, Koltun, Krahenbuhl | arXiv 2021 (CenterNet2 / probabilistic two-stage) | Sample a subset of negative classes per image based on the federated annotation structure. Outperforms EQL on v1.0. |
| Seesaw Loss | Wang, Zhang, Zang, Cao, Pang, Gong, Chen, Liu, Loy, Lin | CVPR 2021 | Dynamic mitigation factor reduces gradients on tail negatives, compensation factor penalizes false positives. State of the art at the time on LVIS without bells and whistles. |
| BAGS / DropLoss / LOCE | Various | 2020-2021 | Group-wise classification, randomized loss dropping, and online category equalization variations. |
| Detic | Zhou, Girdhar, Joulin, Krahenbuhl, Misra | ECCV 2022 | Trains the classifier branch on image-level supervision from ImageNet-21k and uses CLIP text embeddings as the classifier weight. Reaches 41.7 mAP on LVIS rare classes and scales detection to over 20,000 categories. |
A few patterns are visible across these methods. Loss-side fixes such as EQL, EQLv2, Federated Loss, and Seesaw Loss prevent gradients from abundant head categories from drowning out the tail. Sampler-side fixes such as repeat factor sampling change the input distribution. Decoupled training treats the representation and the classifier as separately optimizable. Vision-language methods such as Detic sidestep the long tail by importing supervision from much larger image-level datasets.
Most modern detectors and segmentation networks report LVIS numbers. The table below lists representative entries, all using v1.0 unless otherwise noted.
| Model | Year | Backbone | Approximate mask AP on LVIS v1.0 | Notes |
|---|---|---|---|---|
| Mask R-CNN | 2017 (LVIS baseline 2019) | ResNet-50-FPN | 21 (v0.5, with RFS) | The default baseline reported in the original LVIS paper. |
| Cascade Mask R-CNN | 2018 | ResNet-101-FPN | About 26 to 28 | Cai and Vasconcelos. Multi-stage cascade with rising IoU thresholds. |
| Hybrid Task Cascade (HTC) | 2019 | ResNeXt-101 | About 30 to 33 | Chen et al. CVPR 2019. Won the COCO 2018 challenge and was the basis of strong LVIS entries. |
| CenterMask2 | 2020 | VoVNet | About 30 to 35 | Lee and Park. One-stage anchor-free instance segmentation with backbone variants. |
| CenterNet2 + Federated Loss | 2021 | ResNet-50 | About 32 | Zhou, Koltun, Krahenbuhl. Probabilistic two-stage detection. |
| Mask2Former | 2022 | Swin-L | About 50 (v1.0) | Cheng et al. CVPR 2022. Universal segmentation architecture. |
| ViTDet | 2022 | ViT-H + MAE pretraining | 48.1 mask AP (v1.0) | Li, Mao, Girshick, He. ECCV 2022. 5.0 points higher than the 2021 challenge winner's strong baseline. |
| Mask DINO | 2023 | Swin-L | About 50 to 53 | Li et al. CVPR 2023. Unified DETR-style detection and segmentation. |
| Detic | 2022 | Swin-B + CLIP | 41.7 mAP (rare classes on LVIS) | Trains on ImageNet-21k labels. Designed for open-vocabulary scaling. |
Numbers are approximate because exact configurations (RFS, federated loss, EQL, multi-scale testing) vary across papers. The lesson is consistent: between the 2019 baseline at about 21 mask AP and the 2022 ViTDet entry at 48.1 mask AP, the field roughly doubled performance in three years, with most of the gain on rare categories coming from long-tail-specific methods rather than backbone scaling alone.
LVIS is one of several detection and segmentation datasets that target different combinations of vocabulary size, image count, and annotation density.
| Dataset | Year | Categories | Images | Annotation type | Notes |
|---|---|---|---|---|---|
| PASCAL VOC | 2005-2012 | 20 | About 11k (VOC 2012) | Bounding boxes; segmentation for 11 classes | Original detection benchmark; long obsolete for instance segmentation. |
| Microsoft COCO | 2014 | 80 thing categories | About 328k (118k train, 5k val 2017) | Polygonal instance masks; about 2.5M instances | Roughly balanced; standard benchmark for general detection and segmentation. |
| LVIS v1.0 | 2020 | 1,203 entry-level categories | About 164k (same images as COCO 2017) | Federated polygonal masks; about 2M instances | Long-tailed by construction; high-quality masks. |
| Objects365 | 2019 | 365 categories | About 600k images, 10M+ bounding boxes | Bounding boxes only | Shao et al. ICCV 2019. Used heavily for detection pretraining. |
| Open Images V5/V6 | 2018-2020 | 600 boxable classes (350 with masks) | About 9M images | Bounding boxes plus masks on a 944k subset | Largest publicly available detection dataset by image count. |
| V3Det | 2023 | 13,204 categories | About 243k images | Bounding boxes plus rich descriptions | Wang et al. ICCV 2023. Ten times larger vocabulary than LVIS; targets vast vocabulary detection. |
| ADE20K | 2017 | 150 evaluation classes | About 25k training images | Per-pixel semantic and instance labels | Used mostly for semantic and panoptic segmentation. |
| Cityscapes | 2016 | 19 classes (8 things) | 5,000 fine + 20,000 coarse | Per-pixel labels in urban driving | Specialized to autonomous driving scenes. |
LVIS is unique in this list for its combination of high-quality instance masks, large but tractable vocabulary, federated evaluation protocol, and inheritance of the COCO image set. Open Images and V3Det have larger vocabularies but mostly box-level annotations, and V3Det is too new to have the same depth of methods built around it. Objects365 is a useful pretraining corpus but has only 365 boxes-only classes. COCO remains the de facto general benchmark, and LVIS is the de facto long-tailed benchmark.
The LVIS ecosystem is built around a set of well-maintained open-source tools.
| Resource | Description |
|---|---|
| lvisdataset.org | Official project website with paper, downloads, and challenge information. |
| github.com/lvis-dataset/lvis-api | Python API for loading, manipulating, and evaluating LVIS data. Pip-installable as pip install lvis. Built on top of the COCO API. |
| Detectron2 LVIS configs | Official Mask R-CNN, Cascade Mask R-CNN, and ViTDet configurations for LVIS in the Detectron2 model zoo. Includes pretrained weights. |
| MMDetection LVIS configs | Open-MMLab's MMDetection framework provides LVIS configs for Mask R-CNN, HTC, Seesaw Loss, EQL, EQLv2, and many more. |
| Annotation downloads | LVIS v1.0 train, val, and test annotation JSON files (about 1 GB compressed). Combined with COCO 2017 images (about 19 GB), the full working set is around 20 GB. |
| Pretrained checkpoints | Detectron2 and MMDetection both publish pretrained models with public mask AP and per-frequency breakdowns. |
Using LVIS in practice is straightforward for anyone familiar with COCO. The image directory layout matches COCO 2017, and the annotation JSON format is a federated extension of COCO's format with extra fields for the positive and negative image sets per category.
LVIS is significant in several ways. It established the standard long-tailed instance segmentation benchmark and made class imbalance a central research topic in detection. It introduced federated annotation and federated evaluation, which other large-vocabulary benchmarks (including V3Det and several open-vocabulary benchmarks) have since adopted. It clarified that AP averaged uniformly over categories is a much harsher metric than the per-image AP that many older benchmarks used implicitly, and it forced the community to confront the gap between head and tail performance.
LVIS also drove the maturation of an entire methodological subfield. Repeat factor sampling, equalization losses, decoupled training, federated loss, Seesaw Loss, and image-level supervision via Detic all trace their lineage to LVIS evaluation results. Many of these techniques transfer cleanly to other long-tailed problems, including long-tailed image classification, long-tailed semantic segmentation, and long-tailed scene graph generation. Subsequent vast-vocabulary datasets such as V3Det acknowledge that simply increasing the number of classes does not automatically test long-tail handling unless the distribution is also long-tailed and the evaluation respects per-category averaging.