Instance segmentation

Computer Vision Deep Learning

23 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v3 · 4,555 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Instance segmentation is the computer vision task of detecting every object instance in an image and producing a pixel-precise mask for each one, so that two cats on a couch become two separate masks rather than one merged region. It differs from semantic segmentation, which labels each pixel by category but does not separate individual objects, by additionally assigning a distinct identity to every "thing" instance. The task therefore combines the goals of object detection, where each object is found with a bounding box, and image segmentation, where each pixel is labeled. Its standard accuracy measure is the COCO mask Average Precision (AP), reported as the mean over 10 mask intersection-over-union (IoU) thresholds from 0.50 to 0.95 across 80 object categories.^[2]^[22]

The modern formulation crystallized around 2014 with Hariharan et al.'s "Simultaneous Detection and Segmentation" (SDS) at ECCV, and the field accelerated dramatically after the 2017 ICCV best paper, Mask R-CNN by He, Gkioxari, Dollar, and Girshick, which the authors described as "a conceptually simple, flexible, and general framework for object instance segmentation."^[1]^[5] By 2022 a single transformer architecture, Mask2Former, reached 50.1 mask AP on COCO, and in April 2023 Meta's promptable Segment Anything Model (SAM) was released alongside SA-1B, a dataset of over 1 billion masks on 11 million images, the largest segmentation corpus ever assembled.^[17]^[20] Today the task is one of the most heavily benchmarked problems in computer vision, with standard evaluations on Microsoft COCO, Cityscapes, LVIS, and ADE20K, and a method lineage that runs through two-stage detectors, single-shot networks, transformer-based set prediction, and promptable foundation models.^[2]^[20]

What does an instance segmentation system output?

Given an input image, an instance segmentation system must output, for each detected object, a class label drawn from a fixed vocabulary, a confidence score, and a binary mask the same size as the image (or aligned to the image grid) marking the pixels that belong to that specific instance. Two instances of the same category receive separate masks with separate identities. A common assumption is that masks may overlap (for example, a person partially in front of a chair) but the evaluation usually treats each instance independently rather than enforcing a partition of the pixels.

Formally, given an image I, the system produces a set of triples {(c_i, s_i, M_i)} where c_i is a class index, s_i is a scalar confidence in [0, 1], and M_i is a binary mask. The number of instances is not fixed and may be zero. This open-set, variable-cardinality output is what makes the task hard: detection and segmentation must be solved jointly, and the model has to decide both how many objects are present and where each one ends.

How does instance segmentation differ from semantic and panoptic segmentation?

Instance segmentation sits in a small family of pixel-level recognition tasks. Each one trades off a different combination of category labels and instance separation. In short, semantic segmentation labels pixels by class but merges same-class objects; instance segmentation separates "things" but ignores background "stuff"; and panoptic segmentation does both at once.

Task	Output per pixel	Instance ID?	Stuff vs things	Example
Image classification	One class label for the whole image	No	Things only	"This is a photo of a dog."
Object detection	Bounding box and class for each object	Yes (per box)	Things only	Boxes around three cars.
Semantic segmentation	One class label per pixel	No	Both stuff and things	All sky pixels labeled "sky", all car pixels labeled "car" but merged.
Instance segmentation	Class label per pixel, plus instance separation	Yes, for things	Things only	Each car gets its own mask; sky and road are usually ignored.
Panoptic segmentation	Class label per pixel, plus instance ID for things	Yes for things, no for stuff	Both	A complete partition of the image: sky as one stuff region, three separate cars as three thing instances.

The panoptic formulation was proposed by Kirillov, He, Girshick, Rother, and Dollar in their 2019 CVPR paper "Panoptic Segmentation".^[8] Their goal was to unify semantic and instance segmentation under a single output format and a single metric, the panoptic quality (PQ). The vocabulary they introduced is widely used: things are countable objects with well-defined shapes (people, cars, dogs); stuff covers amorphous regions of similar texture or material (sky, road, grass).^[8] Instance segmentation in the strict sense only operates on the thing classes, while panoptic segmentation requires correct labels for both.

Which datasets are used to benchmark instance segmentation?

Progress on instance segmentation has been driven by a handful of large, carefully annotated datasets. Different datasets emphasize different difficulties: COCO focuses on common everyday objects in cluttered scenes, Cityscapes on urban driving, LVIS on long-tailed vocabularies, and ADE20K on broad scene parsing.

Dataset	Year	Categories	Images	Notes
Microsoft COCO	2014	80 thing classes (91 super-categories in raw annotations)	About 328k images, roughly 118k train and 5k validation in the 2017 split	Lin et al., "Microsoft COCO: Common Objects in Context". 2.5M labeled instances. The de facto benchmark for general instance segmentation.
Cityscapes	2016	19 semantic classes; 8 thing classes for instance evaluation (person, rider, car, truck, bus, train, motorcycle, bicycle)	5,000 images with fine annotations and 20,000 with coarse annotations from 50 cities	Cordts et al. Urban driving scenes; standard benchmark for autonomous driving research.
LVIS	2019	1,203 entry-level categories	About 164k images, around 2M instance masks	Gupta, Dollar, Girshick. Long-tailed (Zipfian) distribution; designed to expose how detectors fail on rare classes.
ADE20K	2017	150 evaluation classes covering both stuff and things	About 25k training images	Zhou et al. Used for scene parsing and panoptic segmentation; broader than COCO but smaller per-class.
YouTube-VIS	2019	40 categories	2,883 videos in the 2019 split, expanded to 3,859 in 2021	Yang et al. The first large benchmark for video instance segmentation; tracks each instance across frames.
Mapillary Vistas	2017	66 classes for instance/panoptic	25,000 high-resolution street-level images	Used for street-scene understanding at global scale.
Open Images	2017 onward	350 classes with masks	About 2.7M instance segmentations on a 944k subset (Open Images V5)	Google's large-scale dataset; instance masks were added in V5 (2019).

The COCO benchmark deserves special mention because the COCO API and its evaluation protocol have become the lingua franca of the field. The dataset, released by Lin and collaborators at Microsoft Research and Cornell, contains complex everyday scenes with multiple objects per image.^[2] It is small enough to train on a single multi-GPU server but large enough to drive meaningful generalization. See COCO dataset for a longer treatment.

How is instance segmentation evaluated?

The primary metric for COCO-style instance segmentation is mean Average Precision (AP) computed on mask intersection-over-union (IoU). For each (image, class) pair, predicted masks are matched to ground-truth masks at a series of IoU thresholds, precision-recall curves are computed, and the area under each curve is averaged.^[22] Averaging over a range of IoU thresholds rather than a single value rewards detectors with better boundary localization, which is why the COCO challenge replaced the older single-threshold PASCAL VOC metric.^[22]

The COCO protocol reports several variants:

Metric	Definition
AP	Mean of AP at IoU thresholds 0.50, 0.55, 0.60, ..., 0.95 (10 thresholds, step 0.05). Averaged over all 80 categories. The headline number.
AP50	AP at a single IoU threshold of 0.50. Easier; commonly reported by older PASCAL VOC papers.
AP75	AP at IoU 0.75. A stricter localization requirement.
APs, APm, APl	AP restricted to small (area < 32^2 pixels), medium (32^2 to 96^2), and large (> 96^2) objects respectively.
AR1, AR10, AR100	Average recall when allowed at most 1, 10, or 100 detections per image.

The Cityscapes instance benchmark uses a similar AP averaged over IoU 0.5 to 0.95 in steps of 0.05, restricted to the 8 thing classes.^[4]^[23]

For panoptic outputs the field uses panoptic quality (PQ), defined by Kirillov et al. (2019) as PQ = (sum of IoU over true positives) / (TP + 0.5 * FP + 0.5 * FN).^[8] PQ factors cleanly into segmentation quality (SQ, the average IoU of matched segments) times recognition quality (RQ, an F1 score over segments).^[8]

More recent work has argued that mask AP is biased toward interior pixels and underweights boundary errors. Cheng, Girshick, Dollar, Berg, and Kirillov proposed Boundary IoU (CVPR 2021) as a complementary metric that focuses on a thin band around mask boundaries.^[16]

What are the main families of instance segmentation methods?

The history of instance segmentation can be told as four overlapping waves: proposal-based pioneers, two-stage detectors with mask heads, single-shot dense predictors, and transformer-based set prediction. Promptable foundation models like SAM sit somewhat orthogonal to this taxonomy but interact with all of them.

Proposal-based pioneers (2014-2016)

Hariharan et al.'s SDS paper at ECCV 2014 set the template: generate region proposals, classify them, and refine the masks. They built on R-CNN with category-specific top-down figure-ground predictions.^[1] Pinheiro, Collobert, and Dollar's DeepMask (NeurIPS 2015) replaced the hand-engineered proposal step with a fully convolutional network that directly predicted class-agnostic segment proposals plus an objectness score.^[3] Pinheiro et al.'s follow-up SharpMask added a top-down refinement to recover sharper object boundaries. This line of work established that good masks could be produced without reliance on edges or superpixels.

Two-stage detection-based methods (Mask R-CNN family)

In 2017 He, Gkioxari, Dollar, and Girshick introduced Mask R-CNN at ICCV, where it won the best paper award (Marr Prize).^[5] The authors framed it as "a conceptually simple, flexible, and general framework for object instance segmentation," and the idea is structurally simple: extend Faster R-CNN with a third output branch that predicts a binary mask for each region of interest, in parallel with the existing classification and bounding-box regression branches.^[5]

The key technical contributions were:

RoIAlign, a pooling operator that uses bilinear interpolation instead of integer-coordinate quantization. The earlier RoIPool produced misaligned features that hurt mask quality far more than they hurt box accuracy. RoIAlign fixed this by sampling features at sub-pixel locations.^[5]
A small fully convolutional mask head that produced per-class masks at a fixed resolution, typically 28 by 28 pixels, on top of the RoI feature map. The output had K channels for K categories; only the channel of the predicted class was supervised, decoupling class prediction from mask prediction.^[5]
A per-pixel binary cross-entropy loss on the predicted mask, summed with the standard classification and box-regression losses from Faster R-CNN.^[5]

Mask R-CNN ran at about 5 frames per second and won all three COCO 2017 challenge tracks: instance segmentation, bounding-box detection, and person keypoint detection.^[5] Its conceptual simplicity made it the default baseline for years afterward.

Follow-on work pushed accuracy further. Cascade Mask R-CNN (Cai and Vasconcelos, 2018-2019) trained a sequence of detectors with increasing IoU thresholds. Hybrid Task Cascade (HTC) by Chen et al. (2019) interleaved detection and segmentation across cascade stages and added a semantic segmentation branch to provide context. PointRend (Kirillov et al. 2020) treated mask prediction as a rendering problem, refining masks at uncertain points with an MLP and producing crisp boundaries at high resolution.^[14]

Single-shot and anchor-free methods

Two-stage methods are accurate but slow. From around 2019 the field produced a wave of one-stage instance segmentation networks designed for real-time deployment.

YOLACT (Bolya, Zhou, Xiao, Lee, 2019) was the first method to crack 30 fps on COCO with reasonable accuracy: 29.8 mask AP at 33.5 fps on a single Titan Xp.^[10] It split the problem into two parallel branches: a fully convolutional network that produced a small set of prototype masks for the whole image, and a per-detection branch that predicted linear coefficients to combine those prototypes. The instance mask was simply the linear combination, thresholded at 0.5.^[10] YOLACT++ added deformable convolutions and a fast NMS variant.

SOLO ("Segmenting Objects by Locations") and SOLOv2 by Wang et al. (NeurIPS 2020) reframed the problem yet again. Instead of detecting then segmenting, SOLO assigned each pixel to an instance based on the location of the object's center.^[13] SOLOv2 introduced a dynamic mask head that decoupled mask kernel learning from mask feature learning and a Matrix NMS for fast post-processing.^[13] A lightweight SOLOv2 reached 31.3 fps and 37.1 AP on COCO.^[13]

Other one-stage approaches include CondInst (Tian et al., ECCV 2020), which used dynamic convolutional filters conditioned on each instance, and BlendMask (Chen et al., CVPR 2020), which combined a top-down attention map with bottom-up base masks. YOLOv5-seg, YOLOv8-seg, and Ultralytics' more recent variants extended the YOLO real-time detector family to masks. RTMDet-Ins (2022) hit roughly 52.8 AP on COCO at over 300 fps on an RTX 3090, an extreme on the speed-accuracy trade-off.

Transformer-based set prediction

The transformer wave began with DETR (Carion et al., ECCV 2020) at Facebook AI Research.^[12] DETR reframed object detection as direct set prediction: a transformer encoder-decoder consumed a CNN feature map together with a fixed set of learned object queries and produced a fixed-cardinality set of (class, box) predictions.^[12] A bipartite Hungarian matching loss aligned predictions with ground-truth objects without anchors or NMS.^[12] DETR could be extended to panoptic segmentation by adding a small mask head on top of each query.^[12]

MaskFormer by Cheng, Schwing, and Kirillov (NeurIPS 2021) generalized this idea: predict a set of binary masks, each tagged with a single global class label.^[15] They showed that mask classification works for both semantic and panoptic segmentation, reaching 55.6 mIoU on ADE20K and 52.7 PQ on COCO, and that it scales better than per-pixel classification when the number of classes is large.^[15]

Mask2Former (Cheng, Misra, Schwing, Kirillov, Girdhar, CVPR 2022) was the breakthrough. It introduced masked attention, a cross-attention variant that restricts attention to the region predicted by the previous mask, and a multi-scale high-resolution feature decoder.^[17] With a Swin-L backbone, Mask2Former reached 50.1 AP for instance segmentation on COCO, 57.8 PQ for panoptic segmentation on COCO, and 57.7 mIoU for semantic segmentation on ADE20K, leading three benchmarks with one architecture.^[17] The authors reported that it "outperforms the best specialized architectures by a significant margin" while reducing research effort by at least three times, consolidation that shrank the engineering surface for segmentation research substantially.^[17]

Mask DINO (Li et al., CVPR 2023) unified DETR-style detection and Mask2Former-style segmentation, reaching about 54.7 AP on COCO instance segmentation.^[19] Other transformer-based contenders include OneFormer, kMaX-DeepLab, and the segmentation variants of BEiT-3 (around 54.8 mask AP on COCO with very large vision transformers).

Promptable and foundation models

In April 2023 Meta AI released the Segment Anything Model (SAM) by Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson, Xiao, Whitehead, Berg, Lo, Dollar, and Girshick, introducing what the paper called "the Segment Anything (SA) project: a new task, model, and dataset for image segmentation."^[20] SAM was trained on a new dataset called SA-1B, with over 1 billion masks on 11 million licensed images, the largest segmentation corpus ever assembled.^[20] Architecturally, SAM has three parts: a heavy Vision Transformer image encoder, a lightweight prompt encoder that consumes points, boxes, or coarse masks, and a small mask decoder that turns the encoded image and prompt into a set of candidate masks.^[20] See Segment Anything Model (SAM) for more detail.

SAM was designed for a new task formulation called promptable segmentation: given any prompt about an object (a click, a box, a few coarse strokes), return a valid mask.^[20] The authors reported that its zero-shot transfer to new domains is "often competitive with or even superior to prior fully supervised results," frequently matching trained baselines without any fine-tuning.^[20] SAM is class-agnostic by design; pairing it with an open-vocabulary detector like Grounding DINO yields Grounded SAM, which acts as a class-aware instance segmentation pipeline driven by text prompts.

Meta released SAM 2 in July 2024 (Ravi et al.), extending the same prompt-based interface to video.^[21] SAM 2 uses a streaming memory mechanism so the model can track an object across frames after a single prompt and is reportedly six times faster than SAM 1 on images while needing roughly three times fewer interactions for comparable video accuracy.^[21] Other related foundation models include SEEM ("Segment Everything Everywhere All at Once") and SAM-HQ, a higher-quality variant.

Which backbone networks power instance segmentation models?

Under the hood, almost every modern instance segmentation system is built on a convolutional neural network or vision transformer feature extractor. The choice of backbone trades off accuracy for compute and memory.

Family	Examples	Notes
Plain ResNets	ResNet-50, ResNet-101	Default Mask R-CNN backbone in 2017. Still common as a baseline.
ResNeXt	ResNeXt-101 32x8d	Grouped convolutions; modest accuracy gain over ResNet at similar FLOPs.
HRNet	HRNetV2	Maintains high-resolution features throughout, useful for fine boundaries.
ConvNeXt	ConvNeXt-T/S/B/L	A modernized pure ConvNet family from Liu et al. (CVPR 2022); competitive with transformers.
Swin Transformer	Swin-T, Swin-S, Swin-B, Swin-L	Liu et al. (ICCV 2021); shifted-window attention; the standard backbone for Mask2Former.
Vision Transformer	ViT-B/L/H	Dosovitskiy et al. (2021); used in SAM, often pretrained on enormous web data.
Self-supervised pretraining	DINOv2, EVA, MAE	These produce strong general-purpose features that transfer well to segmentation.
FPN	Feature Pyramid Network	Lin et al. (CVPR 2017); not a backbone itself but a multi-scale feature aggregator used on top of nearly every backbone above.

What loss functions train an instance segmentation model?

Detection-based methods like Mask R-CNN sum three losses: a softmax classification loss over object categories, a smooth-L1 box-regression loss, and a per-pixel binary cross-entropy mask loss.^[5] The mask loss is computed only for the ground-truth class channel, which decouples class prediction from mask shape and avoids competition between classes.^[5]

Set-prediction methods like DETR, MaskFormer, and Mask2Former use a different recipe. They first solve a Hungarian assignment between predicted and ground-truth instances, minimizing a matching cost that combines classification probability, box overlap, and mask similarity.^[12]^[15] They then back-propagate a loss summed over the matched pairs only, typically a cross-entropy classification loss plus a binary cross-entropy mask loss combined with a Dice loss for better behavior on small masks.^[15]^[17] The Hungarian matching ensures that exactly one prediction is responsible for each ground-truth instance, so post-processing like non-maximum suppression is unnecessary.^[12]

SAM's training loss is a focal loss plus a Dice loss on each predicted mask, with an additional IoU prediction head trained with mean squared error so the model can rank its own outputs at inference time.^[20]

Practical considerations

Instance segmentation systems live or die on a few engineering choices that often matter more than headline accuracy.

Anchor-based vs anchor-free. Mask R-CNN inherits anchor boxes from Faster R-CNN. Newer methods like SOLO, CondInst, and Mask2Former are anchor-free, which simplifies the pipeline and removes a sensitive hyperparameter (anchor scales and aspect ratios). Set-prediction transformers go further and dispense with NMS as well.

Mask resolution. Mask R-CNN predicts each instance mask at 28 by 28 and then upsamples to the box.^[5] This is a pragmatic choice that keeps the mask head cheap, but it limits boundary detail. PointRend, transformer methods, and SAM use higher resolutions or iterative refinement to recover sharp edges.^[14]

Class-agnostic vs class-specific masks. Mask R-CNN predicts K mask channels per RoI but supervises only one.^[5] SAM and the early DeepMask line predict a single class-agnostic mask.^[3]^[20] Class-agnostic masks transfer better across vocabularies and are essential for prompt-driven workflows; class-specific masks may be slightly more accurate when the vocabulary is fixed and small.

Long-tail handling. On LVIS the head classes have thousands of training instances and the tail classes have fewer than ten.^[11] Standard losses overfit the head and underfit the tail. Common remedies include repeat-factor sampling (Gupta et al. 2019), federated loss (LVIS challenge baselines), equalization losses (Tan et al. 2020, 2021), and decoupled training of representation and classifier.^[11]

Weakly and semi-supervised approaches. Mask annotations are expensive (Lin et al. estimated tens of seconds per polygon), so a parallel literature trains instance segmenters from weaker signals: bounding boxes only (BoxInst, Tian et al. 2021), image-level labels, scribbles, or unlabeled data with self-training.^[2] SAM, with its near-zero-cost prompts, can also act as a labeling assistant inside this loop.

Real-time deployment. Production systems on cars, drones, or AR glasses need tens of milliseconds per frame, not hundreds. The YOLO-seg family, RTMDet-Ins, and SOLOv2-Lite live in this regime. For server-side workloads, accuracy-first methods like Mask2Former or Mask DINO are preferred.

What is instance segmentation used for?

Instance segmentation underpins a wide range of products and research areas.

Autonomous driving. Separating individual pedestrians, cyclists, and vehicles is a core perception requirement. Cityscapes, BDD100K, and NuScenes are common benchmarks; production stacks at Waymo, Cruise, Tesla, and Mobileye all rely on per-instance masks for tracking and motion prediction.^[4]
Medical imaging. Cell and nucleus segmentation in pathology, organ segmentation in CT and MRI, polyp detection in endoscopy, and lesion delineation all use instance segmentation. The Kaggle 2018 Data Science Bowl on nucleus segmentation popularized Mask R-CNN in the medical community. SAM has been adapted in MedSAM and SAM-Med2D for medical fine-tuning.
Robotics and manipulation. Robotic grasping needs to know exactly which pixels belong to the object being picked. Instance masks feed directly into 6-DoF grasp planners and bin-picking systems.
Augmented reality. Real-time portrait masking, hair segmentation, and object substitution in AR filters depend on per-instance masks running on mobile GPUs.
Video editing and visual effects. Rotoscoping, the manual frame-by-frame masking of actors, is being partially automated by SAM 2 and Mask2Former-VIS.^[21]
Satellite and aerial imagery. Building footprint extraction, vehicle counting, and crop-field delineation use instance segmentation on overhead imagery. The SpaceNet building challenges and xView2 are common benchmarks.
Agriculture. Per-plant or per-fruit segmentation supports yield prediction, weed mapping, and selective harvesting.
Retail and industrial inspection. Shelf monitoring, self-checkout, and defect detection on production lines all rely on per-instance masks to count, identify, or localize products and flaws.

What are the open challenges in instance segmentation?

Despite roughly a decade of intensive work, several problems remain genuinely hard.

Occlusion. When two instances of the same class overlap heavily, current methods often merge them or produce broken masks. Amodal segmentation, which asks the model to predict the full extent of an object including hidden parts, is an active subfield.

Small objects. APs (small-object AP) is consistently 10 to 20 points below APl on COCO. Higher input resolution helps but is expensive. Multi-scale architectures and crop-and-rescale strategies are common workarounds.

Long-tailed and open-vocabulary recognition. On LVIS the gap between rare and frequent classes is large.^[11] Open-vocabulary instance segmentation, where the system is asked about categories it has never seen during training, is an active frontier driven by CLIP-based classifiers and Grounded SAM-style pipelines.

Real-time deployment under tight compute budgets. A self-driving car may have only a few milliseconds of latency budget per camera, and edge devices have far less compute than a workstation GPU.

Video temporal consistency. Per-frame masks predicted independently flicker and drift between frames. Video instance segmentation methods like MaskTrack R-CNN (2019), VisTR, IDOL, and Mask2Former-VIS try to enforce temporal coherence, and SAM 2's memory module is a recent foundation-model approach to the same problem.^[9]^[21]

Foundation models: fine-tune or prompt? SAM and SAM 2 are powerful but class-agnostic.^[20]^[21] The community is still working out the right interface between large promptable models and downstream tasks: fine-tune the encoder, train adapters, distill into a small specialist, or wrap the model in a prompt-engineered pipeline. The answer probably differs by domain.

Annotation cost. Pixel-precise masks remain expensive to collect, especially for long-tailed or specialized vocabularies. Box-supervised, scribble-supervised, and self-supervised approaches help but still trail full supervision.

References

Hariharan, B., Arbelaez, P., Girshick, R., Malik, J. (2014). "Simultaneous Detection and Segmentation". ECCV 2014. arXiv:1407.1808. ↩
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., Dollar, P. (2014). "Microsoft COCO: Common Objects in Context". ECCV 2014. arXiv:1405.0312. ↩
Pinheiro, P. O., Collobert, R., Dollar, P. (2015). "Learning to Segment Object Candidates" (DeepMask). NeurIPS 2015. arXiv:1506.06204. ↩
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B. (2016). "The Cityscapes Dataset for Semantic Urban Scene Understanding". CVPR 2016. arXiv:1604.01685. ↩
He, K., Gkioxari, G., Dollar, P., Girshick, R. (2017). "Mask R-CNN". ICCV 2017 best paper (Marr Prize). arXiv:1703.06870. ↩
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017). "Feature Pyramid Networks for Object Detection". CVPR 2017.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A. (2017). "Scene Parsing through ADE20K Dataset". CVPR 2017.
Kirillov, A., He, K., Girshick, R., Rother, C., Dollar, P. (2019). "Panoptic Segmentation". CVPR 2019. arXiv:1801.00868. ↩
Yang, L., Fan, Y., Xu, N. (2019). "Video Instance Segmentation". ICCV 2019. arXiv:1905.04804. ↩
Bolya, D., Zhou, C., Xiao, F., Lee, Y. J. (2019). "YOLACT: Real-time Instance Segmentation". ICCV 2019. arXiv:1904.02689. ↩
Gupta, A., Dollar, P., Girshick, R. (2019). "LVIS: A Dataset for Large Vocabulary Instance Segmentation". CVPR 2019. arXiv:1908.03195. ↩
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. (2020). "End-to-End Object Detection with Transformers" (DETR). ECCV 2020. arXiv:2005.12872. ↩
Wang, X., Zhang, R., Kong, T., Li, L., Shen, C. (2020). "SOLOv2: Dynamic and Fast Instance Segmentation". NeurIPS 2020. arXiv:2003.10152. ↩
Kirillov, A., Wu, Y., He, K., Girshick, R. (2020). "PointRend: Image Segmentation as Rendering". CVPR 2020. ↩
Cheng, B., Schwing, A. G., Kirillov, A. (2021). "Per-Pixel Classification is Not All You Need for Semantic Segmentation" (MaskFormer). NeurIPS 2021. arXiv:2107.06278. ↩
Cheng, B., Girshick, R., Dollar, P., Berg, A. C., Kirillov, A. (2021). "Boundary IoU: Improving Object-Centric Image Segmentation Evaluation". CVPR 2021. ↩
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., Girdhar, R. (2022). "Masked-attention Mask Transformer for Universal Image Segmentation" (Mask2Former). CVPR 2022. arXiv:2112.01527. ↩
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S. (2022). "A ConvNet for the 2020s" (ConvNeXt). CVPR 2022.
Li, F., Zhang, H., Liu, S., Zhang, L., Ni, L. M., Shum, H.-Y. (2023). "Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation". CVPR 2023. ↩
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollar, P., Girshick, R. (2023). "Segment Anything" (SAM). arXiv:2304.02643. ↩
Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Radle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollar, P., Feichtenhofer, C. (2024). "SAM 2: Segment Anything in Images and Videos". arXiv:2408.00714. ↩
COCO Detection and Segmentation Evaluation. Microsoft COCO consortium. https://cocodataset.org/#detection-eval. ↩
Cityscapes benchmarks. https://www.cityscapes-dataset.com/benchmarks/. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Abbreviations COCO dataset DeepLab Faster R-CNN Focal loss Instance Intersection over union (IoU)IoU LVIS (Large Vocabulary Instance Segmentation)Mask R-CNN PASCAL VOC R-CNN (Regions with CNN features)SAM 2 YOLO (object detection)

What does an instance segmentation system output?

How does instance segmentation differ from semantic and panoptic segmentation?

Which datasets are used to benchmark instance segmentation?

How is instance segmentation evaluated?

What are the main families of instance segmentation methods?

Proposal-based pioneers (2014-2016)

Two-stage detection-based methods (Mask R-CNN family)

Single-shot and anchor-free methods

Transformer-based set prediction

Promptable and foundation models

Which backbone networks power instance segmentation models?

What loss functions train an instance segmentation model?

Practical considerations

What is instance segmentation used for?

What are the open challenges in instance segmentation?

See also

References

Improve this article

Related Articles

Diffusion model

Translational invariance

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

What links here

Related Articles

Diffusion model

Translational invariance

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

What links here