Instance segmentation
Last reviewed
Apr 30, 2026
Sources
23 citations
Review status
Source-backed
Revision
v1 ยท 4,301 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
23 citations
Review status
Source-backed
Revision
v1 ยท 4,301 words
Add missing citations, update stale details, or suggest a clearer explanation.
Instance segmentation is the computer vision task of detecting every object instance in an image and producing a pixel-precise mask for each one. Unlike semantic segmentation, which assigns a class label to each pixel but lumps all pixels of the same category together, instance segmentation also separates individual instances of the same class. Two cats sitting on a couch become two distinct masks, not one merged "cat" region. The task therefore combines the goals of object detection, where each object is identified with a bounding box, with the goals of image segmentation, where each pixel is labeled.
The modern formulation of instance segmentation crystallized around 2014 with Hariharan et al.'s "Simultaneous Detection and Segmentation" (SDS) at ECCV, and the field accelerated dramatically after the 2017 ICCV best paper, Mask R-CNN by He, Gkioxari, Dollar, and Girshick. Today the task is one of the most heavily benchmarked problems in computer vision, with standard evaluations on the Microsoft COCO, Cityscapes, LVIS, and ADE20K datasets, and a method lineage that runs through two-stage detectors, single-shot networks, transformer-based set prediction, and promptable foundation models such as the Segment Anything Model (SAM).
Given an input image, an instance segmentation system must output, for each detected object, a class label drawn from a fixed vocabulary, a confidence score, and a binary mask the same size as the image (or aligned to the image grid) marking the pixels that belong to that specific instance. Two instances of the same category receive separate masks with separate identities. A common assumption is that masks may overlap (for example, a person partially in front of a chair) but the evaluation usually treats each instance independently rather than enforcing a partition of the pixels.
Formally, given an image I, the system produces a set of triples {(c_i, s_i, M_i)} where c_i is a class index, s_i is a scalar confidence in [0, 1], and M_i is a binary mask. The number of instances is not fixed and may be zero. This open-set, variable-cardinality output is what makes the task hard: detection and segmentation must be solved jointly, and the model has to decide both how many objects are present and where each one ends.
Instance segmentation sits in a small family of pixel-level recognition tasks. Each one trades off a different combination of category labels and instance separation.
| Task | Output per pixel | Instance ID? | Stuff vs things | Example |
|---|---|---|---|---|
| Image classification | One class label for the whole image | No | Things only | "This is a photo of a dog." |
| Object detection | Bounding box and class for each object | Yes (per box) | Things only | Boxes around three cars. |
| Semantic segmentation | One class label per pixel | No | Both stuff and things | All sky pixels labeled "sky", all car pixels labeled "car" but merged. |
| Instance segmentation | Class label per pixel, plus instance separation | Yes, for things | Things only | Each car gets its own mask; sky and road are usually ignored. |
| Panoptic segmentation | Class label per pixel, plus instance ID for things | Yes for things, no for stuff | Both | A complete partition of the image: sky as one stuff region, three separate cars as three thing instances. |
The panoptic formulation was proposed by Kirillov, He, Girshick, Rother, and Dollar in their 2019 CVPR paper "Panoptic Segmentation". Their goal was to unify semantic and instance segmentation under a single output format and a single metric, the panoptic quality (PQ). The vocabulary they introduced is widely used: things are countable objects with well-defined shapes (people, cars, dogs); stuff covers amorphous regions of similar texture or material (sky, road, grass). Instance segmentation in the strict sense only operates on the thing classes, while panoptic segmentation requires correct labels for both.
Progress on instance segmentation has been driven by a handful of large, carefully annotated datasets. Different datasets emphasize different difficulties: COCO focuses on common everyday objects in cluttered scenes, Cityscapes on urban driving, LVIS on long-tailed vocabularies, and ADE20K on broad scene parsing.
| Dataset | Year | Categories | Images | Notes |
|---|---|---|---|---|
| Microsoft COCO | 2014 | 80 thing classes (91 super-categories in raw annotations) | About 328k images, roughly 118k train and 5k validation in the 2017 split | Lin et al., "Microsoft COCO: Common Objects in Context". 2.5M labeled instances. The de facto benchmark for general instance segmentation. |
| Cityscapes | 2016 | 19 semantic classes; 8 thing classes for instance evaluation (person, rider, car, truck, bus, train, motorcycle, bicycle) | 5,000 images with fine annotations and 20,000 with coarse annotations from 50 cities | Cordts et al. Urban driving scenes; standard benchmark for autonomous driving research. |
| LVIS | 2019 | 1,203 entry-level categories | About 164k images, around 2M instance masks | Gupta, Dollar, Girshick. Long-tailed (Zipfian) distribution; designed to expose how detectors fail on rare classes. |
| ADE20K | 2017 | 150 evaluation classes covering both stuff and things | About 25k training images | Zhou et al. Used for scene parsing and panoptic segmentation; broader than COCO but smaller per-class. |
| YouTube-VIS | 2019 | 40 categories | 2,883 videos in the 2019 split, expanded to 3,859 in 2021 | Yang et al. The first large benchmark for video instance segmentation; tracks each instance across frames. |
| Mapillary Vistas | 2017 | 66 classes for instance/panoptic | 25,000 high-resolution street-level images | Used for street-scene understanding at global scale. |
| Open Images | 2017 onward | 350 classes with masks | About 2.7M instance segmentations on a 944k subset (Open Images V5) | Google's large-scale dataset; instance masks were added in V5 (2019). |
The COCO benchmark deserves special mention because the COCO API and its evaluation protocol have become the lingua franca of the field. The dataset, released by Lin and collaborators at Microsoft Research and Cornell, contains complex everyday scenes with multiple objects per image. It is small enough to train on a single multi-GPU server but large enough to drive meaningful generalization. See COCO dataset for a longer treatment.
The primary metric for COCO-style instance segmentation is mean Average Precision (AP) computed on mask intersection-over-union (IoU). For each (image, class) pair, predicted masks are matched to ground-truth masks at a series of IoU thresholds, precision-recall curves are computed, and the area under each curve is averaged.
The COCO protocol reports several variants:
| Metric | Definition |
|---|---|
| AP | Mean of AP at IoU thresholds 0.50, 0.55, 0.60, ..., 0.95 (10 thresholds, step 0.05). Averaged over all 80 categories. The headline number. |
| AP50 | AP at a single IoU threshold of 0.50. Easier; commonly reported by older PASCAL VOC papers. |
| AP75 | AP at IoU 0.75. A stricter localization requirement. |
| APs, APm, APl | AP restricted to small (area < 32^2 pixels), medium (32^2 to 96^2), and large (> 96^2) objects respectively. |
| AR1, AR10, AR100 | Average recall when allowed at most 1, 10, or 100 detections per image. |
The Cityscapes instance benchmark uses a similar AP averaged over IoU 0.5 to 0.95 in steps of 0.05, restricted to the 8 thing classes.
For panoptic outputs the field uses panoptic quality (PQ), defined by Kirillov et al. (2019) as PQ = (sum of IoU over true positives) / (TP + 0.5 * FP + 0.5 * FN). PQ factors cleanly into segmentation quality (SQ, the average IoU of matched segments) times recognition quality (RQ, an F1 score over segments).
More recent work has argued that mask AP is biased toward interior pixels and underweights boundary errors. Cheng, Girshick, Dollar, Berg, and Kirillov proposed Boundary IoU (CVPR 2021) as a complementary metric that focuses on a thin band around mask boundaries.
The history of instance segmentation can be told as four overlapping waves: proposal-based pioneers, two-stage detectors with mask heads, single-shot dense predictors, and transformer-based set prediction. Promptable foundation models like SAM sit somewhat orthogonal to this taxonomy but interact with all of them.
Hariharan et al.'s SDS paper at ECCV 2014 set the template: generate region proposals, classify them, and refine the masks. They built on R-CNN with category-specific top-down figure-ground predictions. Pinheiro, Collobert, and Dollar's DeepMask (NeurIPS 2015) replaced the hand-engineered proposal step with a fully convolutional network that directly predicted class-agnostic segment proposals plus an objectness score. Pinheiro et al.'s follow-up SharpMask added a top-down refinement to recover sharper object boundaries. This line of work established that good masks could be produced without reliance on edges or superpixels.
In 2017 He, Gkioxari, Dollar, and Girshick introduced Mask R-CNN at ICCV, where it won the best paper award. The idea is structurally simple: extend Faster R-CNN with a third output branch that predicts a binary mask for each region of interest, in parallel with the existing classification and bounding-box regression branches.
The key technical contributions were:
Mask R-CNN ran at about 5 frames per second and won all three COCO 2017 challenge tracks: instance segmentation, bounding-box detection, and person keypoint detection. Its conceptual simplicity made it the default baseline for years afterward.
Follow-on work pushed accuracy further. Cascade Mask R-CNN (Cai and Vasconcelos, 2018-2019) trained a sequence of detectors with increasing IoU thresholds. Hybrid Task Cascade (HTC) by Chen et al. (2019) interleaved detection and segmentation across cascade stages and added a semantic segmentation branch to provide context. PointRend (Kirillov et al. 2020) treated mask prediction as a rendering problem, refining masks at uncertain points with an MLP and producing crisp boundaries at high resolution.
Two-stage methods are accurate but slow. From around 2019 the field produced a wave of one-stage instance segmentation networks designed for real-time deployment.
YOLACT (Bolya, Zhou, Xiao, Lee, 2019) was the first method to crack 30 fps on COCO with reasonable accuracy: 29.8 mask AP at 33.5 fps on a single Titan Xp. It split the problem into two parallel branches: a fully convolutional network that produced a small set of prototype masks for the whole image, and a per-detection branch that predicted linear coefficients to combine those prototypes. The instance mask was simply the linear combination, thresholded at 0.5. YOLACT++ added deformable convolutions and a fast NMS variant.
SOLO ("Segmenting Objects by Locations") and SOLOv2 by Wang et al. (NeurIPS 2020) reframed the problem yet again. Instead of detecting then segmenting, SOLO assigned each pixel to an instance based on the location of the object's center. SOLOv2 introduced a dynamic mask head that decoupled mask kernel learning from mask feature learning and a Matrix NMS for fast post-processing. A lightweight SOLOv2 reached 31.3 fps and 37.1 AP on COCO.
Other one-stage approaches include CondInst (Tian et al., ECCV 2020), which used dynamic convolutional filters conditioned on each instance, and BlendMask (Chen et al., CVPR 2020), which combined a top-down attention map with bottom-up base masks. YOLOv5-seg, YOLOv8-seg, and Ultralytics' more recent variants extended the YOLO real-time detector family to masks. RTMDet-Ins (2022) hit roughly 52.8 AP on COCO at over 300 fps on an RTX 3090, an extreme on the speed-accuracy trade-off.
The transformer wave began with DETR (Carion et al., ECCV 2020) at Facebook AI Research. DETR reframed object detection as direct set prediction: a transformer encoder-decoder consumed a CNN feature map together with a fixed set of learned object queries and produced a fixed-cardinality set of (class, box) predictions. A bipartite Hungarian matching loss aligned predictions with ground-truth objects without anchors or NMS. DETR could be extended to panoptic segmentation by adding a small mask head on top of each query.
MaskFormer by Cheng, Schwing, and Kirillov (NeurIPS 2021) generalized this idea: predict a set of binary masks, each tagged with a single global class label. They showed that mask classification works for both semantic and panoptic segmentation, reaching 55.6 mIoU on ADE20K and 52.7 PQ on COCO, and that it scales better than per-pixel classification when the number of classes is large.
Mask2Former (Cheng, Misra, Schwing, Kirillov, Girdhar, CVPR 2022) was the breakthrough. It introduced masked attention, a cross-attention variant that restricts attention to the region predicted by the previous mask, and a multi-scale high-resolution feature decoder. With a Swin-L backbone, Mask2Former reached 50.1 AP for instance segmentation on COCO, 57.8 PQ for panoptic segmentation on COCO, and 57.7 mIoU for semantic segmentation on ADE20K, leading three benchmarks with one architecture. This consolidation reduced the engineering surface for segmentation research substantially.
Mask DINO (Li et al., CVPR 2023) unified DETR-style detection and Mask2Former-style segmentation, reaching about 54.7 AP on COCO instance segmentation. Other transformer-based contenders include OneFormer, kMaX-DeepLab, and the segmentation variants of BEiT-3 (around 54.8 mask AP on COCO with very large vision transformers).
In April 2023 Meta AI released the Segment Anything Model (SAM) by Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson, Xiao, Whitehead, Berg, Lo, Dollar, and Girshick. SAM was trained on a new dataset called SA-1B, with over 1 billion masks on 11 million licensed images, the largest segmentation corpus ever assembled. Architecturally, SAM has three parts: a heavy Vision Transformer image encoder, a lightweight prompt encoder that consumes points, boxes, or coarse masks, and a small mask decoder that turns the encoded image and prompt into a set of candidate masks. See Segment Anything Model (SAM) for more detail.
SAM was designed for a new task formulation called promptable segmentation: given any prompt about an object (a click, a box, a few coarse strokes), return a valid mask. It generalizes zero-shot to new domains, often matching fully supervised baselines without any fine-tuning. SAM is class-agnostic by design; pairing it with an open-vocabulary detector like Grounding DINO yields Grounded SAM, which acts as a class-aware instance segmentation pipeline driven by text prompts.
Meta released SAM 2 in July 2024 (Ravi et al.), extending the same prompt-based interface to video. SAM 2 uses a streaming memory mechanism so the model can track an object across frames after a single prompt and is reportedly six times faster than SAM 1 on images while needing roughly three times fewer interactions for comparable video accuracy. Other related foundation models include SEEM ("Segment Everything Everywhere All at Once") and SAM-HQ, a higher-quality variant.
Under the hood, almost every modern instance segmentation system is built on a convolutional neural network or vision transformer feature extractor. The choice of backbone trades off accuracy for compute and memory.
| Family | Examples | Notes |
|---|---|---|
| Plain ResNets | ResNet-50, ResNet-101 | Default Mask R-CNN backbone in 2017. Still common as a baseline. |
| ResNeXt | ResNeXt-101 32x8d | Grouped convolutions; modest accuracy gain over ResNet at similar FLOPs. |
| HRNet | HRNetV2 | Maintains high-resolution features throughout, useful for fine boundaries. |
| ConvNeXt | ConvNeXt-T/S/B/L | A modernized pure ConvNet family from Liu et al. (CVPR 2022); competitive with transformers. |
| Swin Transformer | Swin-T, Swin-S, Swin-B, Swin-L | Liu et al. (ICCV 2021); shifted-window attention; the standard backbone for Mask2Former. |
| Vision Transformer | ViT-B/L/H | Dosovitskiy et al. (2021); used in SAM, often pretrained on enormous web data. |
| Self-supervised pretraining | DINOv2, EVA, MAE | These produce strong general-purpose features that transfer well to segmentation. |
| FPN | Feature Pyramid Network | Lin et al. (CVPR 2017); not a backbone itself but a multi-scale feature aggregator used on top of nearly every backbone above. |
Detection-based methods like Mask R-CNN sum three losses: a softmax classification loss over object categories, a smooth-L1 box-regression loss, and a per-pixel binary cross-entropy mask loss. The mask loss is computed only for the ground-truth class channel, which decouples class prediction from mask shape and avoids competition between classes.
Set-prediction methods like DETR, MaskFormer, and Mask2Former use a different recipe. They first solve a Hungarian assignment between predicted and ground-truth instances, minimizing a matching cost that combines classification probability, box overlap, and mask similarity. They then back-propagate a loss summed over the matched pairs only, typically a cross-entropy classification loss plus a binary cross-entropy mask loss combined with a Dice loss for better behavior on small masks. The Hungarian matching ensures that exactly one prediction is responsible for each ground-truth instance, so post-processing like non-maximum suppression is unnecessary.
SAM's training loss is a focal loss plus a Dice loss on each predicted mask, with an additional IoU prediction head trained with mean squared error so the model can rank its own outputs at inference time.
Instance segmentation systems live or die on a few engineering choices that often matter more than headline accuracy.
Anchor-based vs anchor-free. Mask R-CNN inherits anchor boxes from Faster R-CNN. Newer methods like SOLO, CondInst, and Mask2Former are anchor-free, which simplifies the pipeline and removes a sensitive hyperparameter (anchor scales and aspect ratios). Set-prediction transformers go further and dispense with NMS as well.
Mask resolution. Mask R-CNN predicts each instance mask at 28 by 28 and then upsamples to the box. This is a pragmatic choice that keeps the mask head cheap, but it limits boundary detail. PointRend, transformer methods, and SAM use higher resolutions or iterative refinement to recover sharp edges.
Class-agnostic vs class-specific masks. Mask R-CNN predicts K mask channels per RoI but supervises only one. SAM and the early DeepMask line predict a single class-agnostic mask. Class-agnostic masks transfer better across vocabularies and are essential for prompt-driven workflows; class-specific masks may be slightly more accurate when the vocabulary is fixed and small.
Long-tail handling. On LVIS the head classes have thousands of training instances and the tail classes have fewer than ten. Standard losses overfit the head and underfit the tail. Common remedies include repeat-factor sampling (Gupta et al. 2019), federated loss (LVIS challenge baselines), equalization losses (Tan et al. 2020, 2021), and decoupled training of representation and classifier.
Weakly and semi-supervised approaches. Mask annotations are expensive (Lin et al. estimated tens of seconds per polygon), so a parallel literature trains instance segmenters from weaker signals: bounding boxes only (BoxInst, Tian et al. 2021), image-level labels, scribbles, or unlabeled data with self-training. SAM, with its near-zero-cost prompts, can also act as a labeling assistant inside this loop.
Real-time deployment. Production systems on cars, drones, or AR glasses need tens of milliseconds per frame, not hundreds. The YOLO-seg family, RTMDet-Ins, and SOLOv2-Lite live in this regime. For server-side workloads, accuracy-first methods like Mask2Former or Mask DINO are preferred.
Instance segmentation underpins a wide range of products and research areas.
Despite roughly a decade of intensive work, several problems remain genuinely hard.
Occlusion. When two instances of the same class overlap heavily, current methods often merge them or produce broken masks. Amodal segmentation, which asks the model to predict the full extent of an object including hidden parts, is an active subfield.
Small objects. APs (small-object AP) is consistently 10 to 20 points below APl on COCO. Higher input resolution helps but is expensive. Multi-scale architectures and crop-and-rescale strategies are common workarounds.
Long-tailed and open-vocabulary recognition. On LVIS the gap between rare and frequent classes is large. Open-vocabulary instance segmentation, where the system is asked about categories it has never seen during training, is an active frontier driven by CLIP-based classifiers and Grounded SAM-style pipelines.
Real-time deployment under tight compute budgets. A self-driving car may have only a few milliseconds of latency budget per camera, and edge devices have far less compute than a workstation GPU.
Video temporal consistency. Per-frame masks predicted independently flicker and drift between frames. Video instance segmentation methods like MaskTrack R-CNN (2019), VisTR, IDOL, and Mask2Former-VIS try to enforce temporal coherence, and SAM 2's memory module is a recent foundation-model approach to the same problem.
Foundation models: fine-tune or prompt? SAM and SAM 2 are powerful but class-agnostic. The community is still working out the right interface between large promptable models and downstream tasks: fine-tune the encoder, train adapters, distill into a small specialist, or wrap the model in a prompt-engineered pipeline. The answer probably differs by domain.
Annotation cost. Pixel-precise masks remain expensive to collect, especially for long-tailed or specialized vocabularies. Box-supervised, scribble-supervised, and self-supervised approaches help but still trail full supervision.