Image Segmentation Models
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,491 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,491 words
Add missing citations, update stale details, or suggest a clearer explanation.
Image segmentation models are computer vision systems that perform pixel-level classification, assigning a class label or instance identity to every pixel in an input image. Unlike image classification, which produces a single label for an entire image, and unlike object detection, which outputs rectangular bounding boxes, segmentation produces dense per-pixel masks. The output is typically a label map with the same spatial dimensions as the input, or a set of binary masks paired with class labels.
The field splits into three sub-tasks. Semantic segmentation assigns each pixel a class label (such as "road" or "sky") without distinguishing separate instances. Instance segmentation separates individual object instances, producing one mask per object even when several share the same class. Panoptic segmentation, formalized by Kirillov and colleagues in 2018, unifies the two by labeling every pixel with both a class and an instance identifier where applicable, treating "things" (countable objects like cars and people) and "stuff" (amorphous regions like grass and sky) within a single output.
See also: Computer Vision Models and Tasks.
Before deep learning, segmentation relied on hand-engineered features and probabilistic graphical models. Graph cuts (Boykov and Jolly, 2001) formulated segmentation as energy minimization over a pixel graph. Markov random fields and conditional random fields modeled spatial dependencies between neighboring labels. Normalized cuts (Shi and Malik, 2000), mean-shift clustering, and watershed transforms were also widely used. These methods had no learned notion of object categories.
The modern era began with Fully Convolutional Networks (FCN, Long, Shelhamer, and Darrell, 2014, arXiv:1411.4038), the first end-to-end convolutional neural network trained for pixel-wise prediction. FCN replaced the fully connected layers in classifiers like VGG with 1x1 convolutions and added upsampling to produce a label map at the original resolution, with skip connections merging coarse semantic features with finer appearance features.
U-Net (Ronneberger, Fischer, and Brox, 2015, arXiv:1505.04597) became the dominant architecture for medical imaging. Its symmetric encoder-decoder with skip connections allows training on very few annotated images and inspired countless variants. SegNet (Badrinarayanan, Kendall, and Cipolla, 2015, arXiv:1511.00561) used a similar design but stored max-pooling indices for memory-efficient upsampling.
The DeepLab family (Chen et al., 2014 to 2018) introduced atrous (dilated) convolutions to expand the receptive field without losing spatial resolution, plus Atrous Spatial Pyramid Pooling (ASPP) for multi-scale context. DeepLabv3+ (Chen et al., 2018, arXiv:1802.02611) reached 89.0% mIoU on Pascal VOC 2012 and 82.1% on Cityscapes.
PSPNet (Zhao et al., 2016, arXiv:1612.01105) added a pyramid pooling module that aggregates global context at multiple region sizes and won the 2016 ImageNet Scene Parsing Challenge. HRNet (Sun et al., 2019, arXiv:1908.07919) maintained high-resolution representations throughout the network instead of recovering them from low-resolution features.
Mask R-CNN (He, Gkioxari, Dollar, and Girshick, 2017, arXiv:1703.06870) extended Faster R-CNN with a third head that predicts a binary mask per region of interest. It won several 2017 COCO challenge tracks and remained the standard baseline for years. Single-stage successors include YOLACT (Bolya et al., 2019), which combined prototype masks with per-instance coefficients for real-time speed; SOLO (Wang et al., 2019), which mapped each instance to a unique grid location; and BlendMask (Chen et al., 2020).
The panoptic task was introduced by Kirillov et al. (2018, arXiv:1801.00868) with the Panoptic Quality (PQ) metric, which combines segmentation and recognition quality in a single score. Early models attached semantic and instance branches to a shared backbone: Panoptic FPN (Kirillov et al., 2019), UPSNet (Xiong et al., 2019), and Panoptic-DeepLab (Cheng et al., 2020).
The vision transformer reshaped segmentation. SETR (Zheng et al., 2020, arXiv:2012.15840) replaced the convolutional backbone with a pure ViT encoder, reaching 50.28% mIoU on ADE20K. Segmenter (Strudel et al., 2021, arXiv:2105.05633) added a mask transformer decoder inspired by DETR.
MaskFormer (Cheng, Schwing, and Kirillov, 2021, arXiv:2107.06278) made a conceptual shift: instead of classifying each pixel, predict a set of binary masks each tagged with a class, similar to how DETR predicts boxes. Its successor Mask2Former (Cheng et al., 2021, arXiv:2112.01527) added masked attention that restricts cross-attention to predicted mask regions and demonstrated that a single architecture could reach state of the art on all three segmentation tasks: 57.7 mIoU on ADE20K, 50.1 AP on COCO instance, and 57.8 PQ on COCO panoptic.
SegFormer (Xie et al., 2021, arXiv:2105.15203) from NVIDIA paired a hierarchical transformer encoder with a lightweight MLP decoder. OneFormer (Jain et al., 2022, arXiv:2211.06220) added a task token that conditions a single set of weights to perform any of the three tasks, outperforming task-specific Mask2Former variants. Swin Transformer backbones became common in top-of-leaderboard systems.
The Segment Anything Model (SAM, Kirillov et al., Meta AI, April 2023, arXiv:2304.02643) introduced promptable segmentation: given a point click, bounding box, or rough mask, SAM outputs candidate masks for the indicated object. It was trained on SA-1B, a dataset with over 1.1 billion masks across 11 million images, around 400 times larger than any previous segmentation dataset. SAM uses a ViT-H image encoder plus a lightweight prompt encoder and mask decoder, so the heavy encoder can be amortized and individual prompts processed in milliseconds.
SAM 2 (Ravi et al., Meta AI, August 2024, arXiv:2408.00714) extended SAM to video, treating images as single-frame videos and adding a streaming memory module that propagates object identity across frames. It is roughly six times faster than SAM on images. Follow-ups include Semantic-SAM (Li et al., 2023) for class predictions; Grounded-SAM (IDEA Research, 2023), which chains Grounding DINO detection with SAM for masks; SEEM (Zou et al., 2023) for multi-type prompts; and OpenSeeD for open-vocabulary detection and segmentation. Florence-2 (Microsoft, 2023) is a unified vision-language model that includes segmentation.
Lightweight variants target edge inference. MobileSAM (Zhang et al., June 2023) distilled the ViT-H encoder into a ViT-Tiny, FastSAM (Zhao et al., 2023) replaced the encoder with a YOLO-style CNN, and EfficientSAM (Xiong et al., December 2023) used masked image pretraining.
| Model | Released | Authors / lab | Task | Notes |
|---|---|---|---|---|
| FCN | Nov 2014 | Long, Shelhamer, Darrell (Berkeley) | Semantic | First end-to-end CNN segmenter |
| U-Net | May 2015 | Ronneberger et al. (Freiburg) | Semantic (biomedical) | Encoder-decoder with skip connections |
| SegNet | Nov 2015 | Badrinarayanan, Kendall, Cipolla (Cambridge) | Semantic | VGG16 encoder with pooling indices |
| DeepLabv3+ | Feb 2018 | Chen et al. (Google) | Semantic | Atrous separable convolution, ASPP |
| Mask R-CNN | Mar 2017 | He et al. (FAIR) | Instance | Adds mask head to Faster R-CNN |
| PSPNet | Dec 2016 | Zhao et al. (CUHK, SenseTime) | Semantic | Pyramid pooling module |
| HRNet | Aug 2019 | Sun et al. (MSRA) | Semantic | Keeps high-resolution streams throughout |
| DETR-Panoptic | May 2020 | Carion et al. (FAIR) | Panoptic | Set prediction with transformer |
| SegFormer | May 2021 | Xie et al. (NVIDIA / HKUST) | Semantic | Hierarchical ViT + MLP decoder |
| MaskFormer | Jul 2021 | Cheng, Schwing, Kirillov (FAIR / UIUC) | Semantic, panoptic | Mask classification instead of per-pixel |
| Mask2Former | Dec 2021 | Cheng et al. (FAIR / UIUC) | Universal | Masked attention, top results on all three tasks |
| Painter | Dec 2022 | Wang et al. (BAAI) | In-context | Vision tasks via image pairs |
| OneFormer | Nov 2022 | Jain et al. (SHI Labs / IIT Roorkee) | Universal | Single weights for all three tasks |
| SAM | Apr 2023 | Kirillov et al. (Meta AI) | Promptable | ViT-H 636M params, SA-1B dataset |
| Grounded-SAM | Apr 2023 | IDEA Research | Text-prompted | Grounding DINO plus SAM |
| SEEM | Apr 2023 | Zou et al. (Microsoft / UW) | Promptable | Multi-prompt segmentation |
| MobileSAM | Jun 2023 | Zhang et al. (KAIST) | Edge | ViT-Tiny encoder distilled from SAM |
| MedSAM | Jan 2024 | Ma et al. (U. Toronto) | Medical | Fine-tuned SAM on 1.57M medical pairs |
| EfficientSAM | Dec 2023 | Xiong et al. (Meta AI) | Efficient | Masked image pretraining for SAM |
| SAM 2 | Aug 2024 | Ravi et al. (Meta AI) | Image + video | Streaming memory for video, 6x faster than SAM |
Progress has tracked closely with dataset scale. SA-1B alone contains roughly 400 times more masks than any previously published segmentation corpus, and SAM 2 introduced SA-V with annotated video sequences.
| Dataset | Year | Size | Classes | Notes |
|---|---|---|---|---|
| Pascal VOC 2012 | 2012 | ~1.5K test, 10K train+val | 21 | Classic benchmark for semantic segmentation |
| PASCAL Context | 2014 | 10K images | 459 (59 common) | Dense labels for whole scenes |
| Cityscapes | 2016 | 5K fine + 20K coarse | 19 | Urban driving, 1024x2048 resolution |
| ADE20K | 2017 | ~25K (20K train, 2K val) | 150 (eval) | Scene parsing, Zhou et al. |
| COCO | 2014, panoptic 2018 | ~118K train | 80 things, 53 stuff, 133 panoptic | Most influential general benchmark |
| Mapillary Vistas | 2017 | 25K | 152 | Street-level driving worldwide |
| LVIS | 2019 | 164K | 1203 (long-tail) | Federated annotation, rare categories |
| KITTI | 2012, segmentation 2015 | 200 (semantic), 200 (instance) | 19 | Autonomous driving from Karlsruhe |
| NYU Depth v2 | 2012 | 1449 labeled, 407K total | 40 | Indoor RGB-D scenes |
| SA-1B | 2023 | 11M images, 1.1B masks | Class-agnostic | SAM training set |
| SA-V | 2024 | 51K videos, 643K masklets | Class-agnostic | SAM 2 video training set |
Standard metrics include mIoU (mean Intersection-over-Union), the average IoU across classes; pixel accuracy, biased toward common classes and rarely used alone; AP at mask-level IoU thresholds for instances, as adopted by COCO; PQ (Panoptic Quality), which factors into segmentation quality SQ and recognition quality RQ; and boundary F-score for edge precision.
Before SAM, every new domain required training or fine-tuning a dedicated segmenter. A common 2024 to 2025 pipeline produces a coarse prompt (a click, a box, a CLIP similarity map, or output from Grounding DINO), passes it to SAM for an accurate mask, then optionally classifies the region using a vision-language model. Grounded-SAM is the most popular instance and supports arbitrary free-text prompts.
SAM is class-agnostic by design. Open-vocabulary models like OpenSeg and OpenSeeD fold class understanding directly into segmentation by aligning mask embeddings with text embeddings from contrastive language pretraining.
SAM 2 generalizes the SAM interface to video via a memory module that stores object features across frames and tracks an instance through occlusion and motion blur. On image segmentation it is around six times faster than SAM and ships with a larger video training set (SA-V).
Medical and scientific variants have proliferated. MedSAM (Ma et al., 2024, Nature Communications) fine-tunes SAM on over 1.57 million medical image-mask pairs across ten modalities and shows that one foundation model can handle CT, MRI, ultrasound, and pathology tasks competitively. EfficientSAM and FastSAM target phones and edge accelerators where the original 636M-parameter ViT-H encoder is too large.
Fine-grained boundaries remain difficult. Hair, fur, transparent objects like glass and water, and adjacent regions of similar texture (grass against trees, sky against fog) confuse even the largest models. Small objects suffer because feature-map resolution drops through the network and upsampling smears them. Annotation noise also hurts: human annotators disagree on boundaries by several pixels, so models scoring within annotator variance get little further gradient signal.
Foundation models bring their own caveats. SAM and SAM 2 are class-agnostic, so they cannot answer "what is this?" without a partner classifier. They were trained mostly on natural images and degrade on out-of-distribution domains like medical scans or microscopy unless fine-tuned. Real-time inference on edge devices is constrained: SAM's image encoder runs at around 2 seconds per image on a server GPU. 3D and volumetric segmentation, which matters for medical and robotic use, is only beginning to receive foundation-model-scale attention.