LVIS (Large Vocabulary Instance Segmentation)

Computer Vision Data & Datasets

23 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v2 · 4,553 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LVIS (Large Vocabulary Instance Segmentation, pronounced "el-vis") is a large-scale instance segmentation benchmark for computer vision that targets the long-tailed regime of object detection. It contains roughly 2 million high-quality instance masks across 1,203 entry-level object categories (1,230 in the earlier v0.5 release), annotated on the same approximately 164,000 images that make up Microsoft COCO 2017 ^[1]^[2]. LVIS was introduced by Agrim Gupta, Piotr Dollar, and Ross Girshick of Facebook AI Research at CVPR 2019, and it rapidly became the standard benchmark for long-tailed recognition in detection and segmentation ^[1].

Categories were chosen to reflect the natural Zipfian frequency distribution of objects in everyday photographs, so most classes have only a handful of training examples while a few common classes have tens of thousands. As the paper puts it, "Due to the Zipfian distribution of categories in natural images, LVIS naturally has a long tail of categories with few training samples" ^[1]. The dataset deliberately addresses two awkward truths about real-world recognition. First, the standard benchmarks at the time, PASCAL VOC and Microsoft COCO, covered only 20 to 80 common categories that were nearly balanced by construction ^[1]^[4]. Second, annotating every category on every image becomes quadratically expensive as the vocabulary grows. LVIS introduced a federated annotation protocol that annotates each image only for a subset of relevant categories and pairs it with a federated evaluation metric that does not penalize predictions on missing labels ^[1]. The combination of long-tailed vocabulary, careful annotation, and a fair evaluation protocol turned LVIS into a controlled testbed for class imbalance research.

What is LVIS used for?

LVIS is the de facto benchmark for measuring how well detectors and segmenters handle a long-tailed category distribution, where a few classes are common and most are rare. It supports three tasks on the same annotations: instance segmentation, object detection, and semantic segmentation. Because the images are identical to COCO 2017, LVIS is also widely used to stress-test methods that already report COCO numbers, and to study few-shot and open-vocabulary detection where the rare classes have only single-digit training examples.

Why was LVIS created?

Until 2019 the dominant detection benchmarks were closed-set and class-balanced by construction. PASCAL VOC 2007 and 2012 covered 20 categories. Microsoft COCO, released by Lin et al. in 2014, expanded that to 80 thing categories with hundreds to thousands of training instances per class ^[4]. ImageNet detection used 200 categories. None reflected the actual frequency of objects in everyday photographs, where a small number of categories (people, chairs, cars) dominate while a long tail of categories (toothpicks, water heaters, dressers) appear infrequently but matter for real applications.

Gupta, Dollar, and Girshick argued that this gap mattered for science as well as deployment. With thousands of training examples per class, even noisy classifiers learn good decision boundaries; with single-digit training examples per class, the same architectures collapse. The paper's stated goal is to build "a new dataset for Large Vocabulary Instance Segmentation" in which the authors "plan to collect ~2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images" ^[1]. The dataset was sized to make this study possible and evaluated in a way that did not let methods cheat by ignoring the tail.

A second motivation was the cost of annotation. For 1,200 categories on 164,000 images, exhaustive labeling would require almost 200 million image-category checks. The federated annotation strategy resolved this by annotating each image for only a subset of categories while tracking exactly which categories were exhaustively annotated on each image, so that evaluation could be honest about the missing labels ^[1].

How was LVIS constructed?

LVIS reuses the image set from COCO 2017 but rebuilds the label set from scratch ^[1]^[4]. The image distribution is therefore familiar to anyone who has worked with COCO, while the annotation density and category structure are different.

Images

The full LVIS v1.0 release covers approximately 164,000 images split as roughly 100,000 train, 19,800 validation, 19,800 test-dev, and 19,800 test-challenge ^[2]^[3]. The exact counts in the lvis-api repository are 100,170 training, 19,809 validation, 19,822 test-dev, and 19,822 test-challenge images ^[3]. The training split alone holds about 1.2 million annotations and the validation split about 244,000 ^[2]. Because the images are identical to COCO 2017, models can swap freely between the two label sets and any pretraining infrastructure built for COCO transfers without change.

What is the long-tail problem in LVIS?

The key conceptual contribution of LVIS is its three-bin frequency split, which assigns each category to one of three buckets based on how many training images contain at least one instance of it ^[1]. The thresholds are fixed in the official LVIS release.

Bucket	Symbol	Definition (training images per category)	Approximate share of categories
Rare	r	1 to 10 images	About 30 percent
Common	c	11 to 100 images	About 35 percent
Frequent	f	More than 100 images	About 35 percent

Approximately 75 percent of categories appear in 100 or fewer training images in v0.5, and the situation is similar in v1.0 ^[1]. The rare bucket contains hundreds of categories that are essentially few-shot classes by COCO standards; the most frequent few have tens of thousands of training instances. This stark distribution is the entire point: it forces methods to handle the head and the tail at the same time.

Annotation pipeline

The LVIS paper documents a six-stage crowdsourced annotation pipeline executed on Amazon Mechanical Turk ^[1]. The pipeline avoids any machine-in-the-loop labeling so that masks are not biased toward what current detectors find easy.

Object spotting. Annotators view an image and mark instances of objects that they can name. The process is iterative: each annotator extends the list with at least one new category that earlier annotators missed, until two independent annotators converge on the same set.
Exhaustive instance marking. For each spotted category on each image, five annotators mark every additional instance of that category in the image so that the (image, category) pair becomes exhaustively annotated.
Instance segmentation. Workers convert the point-level annotations into detailed polygonal segmentation masks.
Segment verification. Up to five annotators independently grade each mask. Stages 3 and 4 iterate until roughly 99 percent of masks pass quality standards.
Full recall verification. A separate pass confirms exhaustive annotation for each (image, category) pair, requiring agreement from at least four of five annotators.
Negative labels. Annotators check categories that are likely absent from each image and confirm them as negatives, populating the per-category negative image set.

The outputs of this pipeline are two sets per category, the positive set (images with all instances of that category exhaustively annotated) and the negative set (images known to contain no instances of that category) ^[1]. All other (image, category) pairs are simply unknown and must be ignored at evaluation time. This is what "federated" means in LVIS: a single dataset that is the union of many smaller per-category datasets, each of which is internally consistent but only covers a slice of the image set.

Quality control was unusually aggressive. Roughly 18 percent of initially detected categories were dropped during pruning, and about 10 percent of marked instances were dropped for visual consistency ^[1]. The remaining masks are reported by Gupta et al. to be substantially crisper and more boundary-accurate than the COCO masks they replace. The lvis-api repository emphasizes that LVIS uses higher-quality polygons than COCO ^[3], and several follow-up studies cite LVIS as evidence that COCO masks underestimate true object boundaries.

What is a federated dataset?

The "federated" design is the structural idea that makes a 1,200-category benchmark practical. The LVIS paper defines it directly: "A federated dataset is a dataset that is formed by the union of smaller constituent datasets, each of which looks exactly like a traditional object detection dataset for a single category" ^[1]. Formally the dataset is the union over categories of each category's positive set and negative set. Because the exhaustive-annotation guarantee only has to hold within each small per-category dataset, the whole collection never needs to be exhaustively annotated for every category at once, which is what cuts annotation cost from roughly 200 million decisions down to a feasible number ^[1].

At test time an algorithm does not know which constituent datasets a given image belongs to, so it must predict as if every category will be evaluated; the evaluation oracle then scores each category only on its own constituent (positive and negative) images ^[1]. Several later large-vocabulary and open-vocabulary benchmarks adopted the same federated structure.

How is LVIS evaluated?

LVIS adapts the COCO mask AP protocol but adds two crucial twists: per-frequency reporting and federated handling of missing labels ^[1]^[4]. Understanding the evaluation rules is essential to interpreting LVIS numbers.

Mask AP

The primary metric is mask Average Precision (AP), computed as the mean of mask AP across IoU thresholds 0.5, 0.55, ..., 0.95 (10 thresholds in steps of 0.05), averaged across categories ^[1]. The intersection-over-union is computed between predicted and ground-truth instance masks. AP50 (IoU at 0.5) and AP75 (IoU at 0.75) are reported as well, but the headline number is the IoU-averaged AP, in line with the COCO convention.

A subtle but important detail is that LVIS averages AP across categories rather than across detections or images. This means a method that nails the head classes and ignores the tail will score much worse on LVIS than on COCO, because the tail makes up most of the categories ^[1].

Per-frequency AP

The distinctive LVIS metrics are AP_r, AP_c, and AP_f, the mean AP over the rare, common, and frequent buckets respectively ^[1]. These let researchers see at a glance whether an improvement comes from the head or the tail.

Metric	Meaning
AP	Mean AP across all 1,203 categories at IoU 0.5:0.05:0.95
AP50	Mean AP at IoU 0.5 (single threshold)
AP75	Mean AP at IoU 0.75
AP_r	Mean AP averaged over rare categories only (1-10 train images)
AP_c	Mean AP averaged over common categories only (11-100 train images)
AP_f	Mean AP averaged over frequent categories only (>100 train images)
AP_box	Bounding-box variant for detection-only methods

A standard Mask R-CNN with a ResNet-50-FPN backbone and class-balanced sampling reaches around 21 percent overall AP on LVIS v0.5 in the original paper, but scores only about 3 percent AP_r without resampling ^[1]^[5]. Repeat factor sampling raises AP_r to about 13 percent, the first concrete demonstration that simple sampling tricks materially help the tail ^[1]. These numbers became the de facto baselines for everyone who followed.

Federated AP

Because an image is exhaustively annotated for only a subset of categories, naively penalizing predictions on the missing categories would punish correct detections that happened to land on unannotated objects. The federated AP definition in LVIS only counts an image's predictions for category c if that image is in the positive or negative set for c ^[1]. Predictions on images outside both sets are simply ignored for that category. The bookkeeping is handled by the lvis-api Python package, which is built on top of the COCO API ^[3].

This design choice is what makes large-vocabulary evaluation tractable at all. Without federated AP, building a fair benchmark for 1,200 categories on 164,000 images would require nearly 200 million annotation decisions, which is infeasible ^[1]. With federated AP, annotation effort scales with the actual number of category instances rather than the cross product.

When were the LVIS challenges held?

The LVIS organizers ran a series of competitions to drive method development ^[2]^[19].

Competition	Year	Notes
LVIS Challenge	ICCV 2019 (October)	First challenge, held on v0.5. Winner used Hybrid Task Cascade with class-balanced sampling and several long-tail tricks. The Equalization Loss work by Tan et al. emerged from this challenge and won first place.
LVIS Challenge	ECCV 2020	First challenge on v1.0. Many entries used the federated loss and decoupled training schedules.
LVIS Challenge	CVPR 2021	Final official challenge edition, won by methods built on Cascade Mask R-CNN with CBNetV2 backbones and Swin-L.

After 2021 the formal challenges wound down, but LVIS remained the standard test bed for long-tailed methods. New papers continued to report v1.0 numbers as a default, often alongside COCO numbers for comparison.

What are the main versions of LVIS?

LVIS has shipped two main public releases ^[1]^[2].

Release	Date	Images	Categories	Notes
LVIS v0.5	May 2019	About 82,000 (57k train, 5k val, 20k test)	1,230	Initial release with the CVPR 2019 paper. Used in the LVIS Challenge at ICCV 2019.
LVIS v1.0	2020	About 164,000 (100k train, 20k val, 20k test-dev, 20k test-challenge)	1,203	Expanded to the full COCO 2017 image set. Standard release used by all subsequent papers and the 2020 and 2021 challenges.

v0.5 was an interim release while annotation was still in progress, with 1,230 categories on about 82,000 images ^[1]. v1.0 made use of the complete COCO 2017 image set and refined the category list, which is why category counts dropped slightly from 1,230 to 1,203 ^[2]. From mid-2020 onward virtually all reported numbers in the literature use v1.0.

What long-tailed methods are evaluated on LVIS?

LVIS has driven the development of an entire subfield of long-tailed visual recognition. The methods listed below all report results on LVIS and represent the main families of approaches.

Method	Authors	Venue	Key idea
Repeat Factor Sampling (RFS)	Gupta, Dollar, Girshick	CVPR 2019	For each category c, define rc = max(1, sqrt(t/fc)) where fc is the fraction of training images containing c and t is a threshold. Repeat each image by the max rc over its categories. Simple and effective baseline.
Class-balanced sampling and re-weighting	Various	Pre-2019 baseline	Sample images so each category appears with similar frequency, or weight the loss inversely to category frequency. Helps the tail but can hurt the head.
Equalization Loss (EQL)	Tan, Wang, Wang, Liu, Liu, Yan	CVPR 2020	Mask gradients on rare categories from being suppressed by abundant negative samples from the head. Won the LVIS Challenge 2019.
Equalization Loss v2 (EQLv2)	Tan, Lu, Wang, Yan, Liu	CVPR 2021	Per-category gradient guided reweighting that balances positives and negatives independently. Outperforms EQL by about 4 AP overall and 14 to 18 AP on rare categories.
Decoupled training (cRT, tau-norm, LWS)	Kang, Xie, Rohrbach, Yan, Gordo, Feng, Kalantidis	ICLR 2020	First train representations with natural sampling, then re-train or normalize the classifier with class-balanced sampling. Surprisingly effective with no architectural changes.
Federated Loss	Zhou, Koltun, Krahenbuhl	arXiv 2021 (CenterNet2 / probabilistic two-stage)	Sample a subset of negative classes per image based on the federated annotation structure. Outperforms EQL on v1.0.
Seesaw Loss	Wang, Zhang, Zang, Cao, Pang, Gong, Chen, Liu, Loy, Lin	CVPR 2021	Dynamic mitigation factor reduces gradients on tail negatives, compensation factor penalizes false positives. State of the art at the time on LVIS without bells and whistles.
BAGS / DropLoss / LOCE	Various	2020-2021	Group-wise classification, randomized loss dropping, and online category equalization variations.
Detic	Zhou, Girdhar, Joulin, Krahenbuhl, Misra	ECCV 2022	Trains the classifier branch on image-level supervision from ImageNet-21k and uses CLIP text embeddings as the classifier weight. Reaches 41.7 mAP on LVIS rare classes and scales detection to over 20,000 categories.

A few patterns are visible across these methods. Loss-side fixes such as EQL, EQLv2, Federated Loss, and Seesaw Loss prevent gradients from abundant head categories from drowning out the tail ^[6]^[7]^[9]^[10]. Sampler-side fixes such as repeat factor sampling change the input distribution ^[1]. Decoupled training treats the representation and the classifier as separately optimizable ^[8]. Vision-language methods such as Detic sidestep the long tail by importing supervision from much larger image-level datasets ^[11].

Which models report LVIS results?

Most modern detectors and segmentation networks report LVIS numbers. The table below lists representative entries, all using v1.0 unless otherwise noted.

Model	Year	Backbone	Approximate mask AP on LVIS v1.0	Notes
Mask R-CNN	2017 (LVIS baseline 2019)	ResNet-50-FPN	21 (v0.5, with RFS)	The default baseline reported in the original LVIS paper.
Cascade Mask R-CNN	2018	ResNet-101-FPN	About 26 to 28	Cai and Vasconcelos. Multi-stage cascade with rising IoU thresholds.
Hybrid Task Cascade (HTC)	2019	ResNeXt-101	About 30 to 33	Chen et al. CVPR 2019. Won the COCO 2018 challenge and was the basis of strong LVIS entries.
CenterMask2	2020	VoVNet	About 30 to 35	Lee and Park. One-stage anchor-free instance segmentation with backbone variants.
CenterNet2 + Federated Loss	2021	ResNet-50	About 32	Zhou, Koltun, Krahenbuhl. Probabilistic two-stage detection.
Mask2Former	2022	Swin-L	About 50 (v1.0)	Cheng et al. CVPR 2022. Universal segmentation architecture.
ViTDet	2022	ViT-H + MAE pretraining	48.1 mask AP (v1.0)	Li, Mao, Girshick, He. ECCV 2022. 5.0 points higher than the 2021 challenge winner's strong baseline.
Mask DINO	2023	Swin-L	About 50 to 53	Li et al. CVPR 2023. Unified DETR-style detection and segmentation.
Detic	2022	Swin-B + CLIP	41.7 mAP (rare classes on LVIS)	Trains on ImageNet-21k labels. Designed for open-vocabulary scaling.

Numbers are approximate because exact configurations (RFS, federated loss, EQL, multi-scale testing) vary across papers. The lesson is consistent: between the 2019 baseline at about 21 mask AP and the 2022 ViTDet entry at 48.1 mask AP, the field roughly doubled performance in three years, with most of the gain on rare categories coming from long-tail-specific methods rather than backbone scaling alone ^[1]^[12].

How is LVIS different from COCO?

LVIS is one of several detection and segmentation datasets that target different combinations of vocabulary size, image count, and annotation density. The shortest answer is that LVIS uses the same images as COCO 2017 but replaces COCO's 80 roughly balanced categories with about 1,203 long-tailed categories, higher-quality masks, and a federated evaluation metric ^[1]^[4].

Dataset	Year	Categories	Images	Annotation type	Notes
PASCAL VOC	2005-2012	20	About 11k (VOC 2012)	Bounding boxes; segmentation for 11 classes	Original detection benchmark; long obsolete for instance segmentation.
Microsoft COCO	2014	80 thing categories	About 328k (118k train, 5k val 2017)	Polygonal instance masks; about 2.5M instances	Roughly balanced; standard benchmark for general detection and segmentation.
LVIS v1.0	2020	1,203 entry-level categories	About 164k (same images as COCO 2017)	Federated polygonal masks; about 2M instances	Long-tailed by construction; high-quality masks.
Objects365	2019	365 categories	About 600k images, 10M+ bounding boxes	Bounding boxes only	Shao et al. ICCV 2019. Used heavily for detection pretraining.
Open Images V5/V6	2018-2020	600 boxable classes (350 with masks)	About 9M images	Bounding boxes plus masks on a 944k subset	Largest publicly available detection dataset by image count.
V3Det	2023	13,204 categories	About 243k images	Bounding boxes plus rich descriptions	Wang et al. ICCV 2023. Ten times larger vocabulary than LVIS; targets vast vocabulary detection.
ADE20K	2017	150 evaluation classes	About 25k training images	Per-pixel semantic and instance labels	Used mostly for semantic and panoptic segmentation.
Cityscapes	2016	19 classes (8 things)	5,000 fine + 20,000 coarse	Per-pixel labels in urban driving	Specialized to autonomous driving scenes.

LVIS is unique in this list for its combination of high-quality instance masks, large but tractable vocabulary, federated evaluation protocol, and inheritance of the COCO image set ^[1]^[4]. Open Images and V3Det have larger vocabularies but mostly box-level annotations, and V3Det is too new to have the same depth of methods built around it ^[16]. Objects365 is a useful pretraining corpus but has only 365 boxes-only classes ^[15]. COCO remains the de facto general benchmark, and LVIS is the de facto long-tailed benchmark.

What open-source resources support LVIS?

The LVIS ecosystem is built around a set of well-maintained open-source tools.

Resource	Description
lvisdataset.org	Official project website with paper, downloads, and challenge information.
github.com/lvis-dataset/lvis-api	Python API for loading, manipulating, and evaluating LVIS data. Pip-installable as `pip install lvis`. Built on top of the COCO API.
Detectron2 LVIS configs	Official Mask R-CNN, Cascade Mask R-CNN, and ViTDet configurations for LVIS in the Detectron2 model zoo. Includes pretrained weights.
MMDetection LVIS configs	Open-MMLab's MMDetection framework provides LVIS configs for Mask R-CNN, HTC, Seesaw Loss, EQL, EQLv2, and many more.
Annotation downloads	LVIS v1.0 train, val, and test annotation JSON files (about 1 GB compressed). Combined with COCO 2017 images (about 19 GB), the full working set is around 20 GB.
Pretrained checkpoints	Detectron2 and MMDetection both publish pretrained models with public mask AP and per-frequency breakdowns.

Using LVIS in practice is straightforward for anyone familiar with COCO. The image directory layout matches COCO 2017, and the annotation JSON format is a federated extension of COCO's format with extra fields for the positive and negative image sets per category ^[3].

Why is LVIS significant?

LVIS is significant in several ways. It established the standard long-tailed instance segmentation benchmark and made class imbalance a central research topic in detection ^[1]. It introduced federated annotation and federated evaluation, which other large-vocabulary benchmarks (including V3Det and several open-vocabulary benchmarks) have since adopted ^[1]^[16]. It clarified that AP averaged uniformly over categories is a much harsher metric than the per-image AP that many older benchmarks used implicitly, and it forced the community to confront the gap between head and tail performance.

LVIS also drove the maturation of an entire methodological subfield. Repeat factor sampling, equalization losses, decoupled training, federated loss, Seesaw Loss, and image-level supervision via Detic all trace their lineage to LVIS evaluation results ^[1]^[6]^[7]^[8]^[9]^[10]^[11]. Many of these techniques transfer cleanly to other long-tailed problems, including long-tailed image classification, long-tailed semantic segmentation, and long-tailed scene graph generation. Subsequent vast-vocabulary datasets such as V3Det acknowledge that simply increasing the number of classes does not automatically test long-tail handling unless the distribution is also long-tailed and the evaluation respects per-category averaging ^[16].

ELI5: What is LVIS in simple terms?

Imagine you want to teach a computer to find and outline every object in a photo, not just common ones like people and cars, but rare ones like a stapler or a birdhouse. Most older datasets only taught the computer about a few dozen common objects, and they gave it thousands of example photos for each one. LVIS is harder and more realistic: it covers about 1,200 different kinds of objects, and just like in real life, a handful of those objects show up all the time while most show up only once or twice. That makes it a tough test, because the computer has to get good at the rare things too, not just the easy popular ones.

References

Gupta, A., Dollar, P., Girshick, R. (2019). "LVIS: A Dataset for Large Vocabulary Instance Segmentation". CVPR 2019. arXiv:1908.03195. https://arxiv.org/abs/1908.03195 ↩
LVIS Project Website. Facebook AI Research. https://www.lvisdataset.org/ ↩
lvis-dataset/lvis-api. Python API for the LVIS Dataset. GitHub. https://github.com/lvis-dataset/lvis-api ↩
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., Dollar, P. (2014). "Microsoft COCO: Common Objects in Context". ECCV 2014. arXiv:1405.0312. ↩
He, K., Gkioxari, G., Dollar, P., Girshick, R. (2017). "Mask R-CNN". ICCV 2017. arXiv:1703.06870. ↩
Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., Yan, J. (2020). "Equalization Loss for Long-Tailed Object Recognition". CVPR 2020. arXiv:2003.05176. ↩
Tan, J., Lu, X., Zhang, G., Yin, C., Li, Q. (2021). "Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection". CVPR 2021. arXiv:2012.08548. ↩
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., Kalantidis, Y. (2020). "Decoupling Representation and Classifier for Long-Tailed Recognition". ICLR 2020. arXiv:1910.09217. ↩
Zhou, X., Koltun, V., Krahenbuhl, P. (2021). "Probabilistic two-stage detection" (CenterNet2 with Federated Loss). arXiv:2103.07461. ↩
Wang, J., Zhang, W., Zang, Y., Cao, Y., Pang, J., Gong, T., Chen, K., Liu, Z., Loy, C. C., Lin, D. (2021). "Seesaw Loss for Long-Tailed Instance Segmentation". CVPR 2021. arXiv:2008.10032. ↩
Zhou, X., Girdhar, R., Joulin, A., Krahenbuhl, P., Misra, I. (2022). "Detecting Twenty-thousand Classes using Image-level Supervision" (Detic). ECCV 2022. arXiv:2201.02605. ↩
Li, Y., Mao, H., Girshick, R., He, K. (2022). "Exploring Plain Vision Transformer Backbones for Object Detection" (ViTDet). ECCV 2022. arXiv:2203.16527. ↩
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., Girdhar, R. (2022). "Masked-attention Mask Transformer for Universal Image Segmentation" (Mask2Former). CVPR 2022. arXiv:2112.01527.
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C. C., Lin, D. (2019). "Hybrid Task Cascade for Instance Segmentation". CVPR 2019. arXiv:1901.07518.
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J. (2019). "Objects365: A Large-Scale, High-Quality Dataset for Object Detection". ICCV 2019. ↩
Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D. (2023). "V3Det: Vast Vocabulary Visual Detection Dataset". ICCV 2023. arXiv:2304.03752. ↩
Detectron2 Model Zoo (LVIS configs). Facebook AI Research. https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md
MMDetection LVIS Configs. Open-MMLab. https://github.com/open-mmlab/mmdetection/tree/main/configs/lvis
LVIS Challenge at ICCV 2019 / ECCV 2020 / CVPR 2021. Workshop materials. https://www.lvisdataset.org/challenge ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

COCO dataset Focal loss IoU Mask R-CNN Object detection PASCAL VOC

What is LVIS used for?

Why was LVIS created?

How was LVIS constructed?

Images

Categories

What is the long-tail problem in LVIS?

Annotation pipeline

What is a federated dataset?

How is LVIS evaluated?

Mask AP

Per-frequency AP

Federated AP

When were the LVIS challenges held?

What are the main versions of LVIS?

What long-tailed methods are evaluated on LVIS?

Which models report LVIS results?

How is LVIS different from COCO?

What open-source resources support LVIS?

Why is LVIS significant?

ELI5: What is LVIS in simple terms?

See also

References

Improve this article

Related Articles

MNIST

Segment Anything Model and Dataset (SAM and SA-1B)

COCO dataset

LAION

CIFAR-10

PASCAL VOC

What links here

Related Articles

MNIST

Segment Anything Model and Dataset (SAM and SA-1B)

COCO dataset

LAION

CIFAR-10

PASCAL VOC

What links here