Image segmentation

Image segmentation is the computer vision task of partitioning a digital image into multiple regions by assigning every pixel a label, producing a pixel-level map of what each part of the image contains. It differs from image classification, which labels a whole image, and from object detection, which draws bounding boxes: segmentation predicts a class, an object instance, or both for each individual pixel. The three main variants are semantic segmentation (one class label per pixel), instance segmentation (a separate mask per object), and panoptic segmentation (both at once). Influential models include Fully Convolutional Networks and U-Net (2015), Mask R-CNN (2017), and Meta AI's Segment Anything Model (SAM, 2023), a foundation model trained on over 1 billion masks across 11 million images. ^[13]

Introduction

Image segmentation is a fundamental task in computer vision that involves partitioning a digital image into multiple segments, or groups of pixels, each corresponding to a meaningful region or object. The goal is to simplify the representation of an image so that it becomes easier to analyze, interpret, or process. Every pixel in an image is assigned to a category, an object instance, or both, depending on the type of segmentation being performed.

Unlike image classification, which assigns a single label to an entire image, and object detection, which draws bounding boxes around objects, image segmentation operates at the pixel level. This fine-grained understanding of visual scenes is critical for applications ranging from medical diagnosis to autonomous driving and satellite image analysis.

Image segmentation has evolved from simple threshold-based methods in the 1970s and 1980s to sophisticated deep learning architectures that achieve near-human accuracy on complex benchmarks. The introduction of convolutional neural networks (CNNs) to segmentation in the mid-2010s marked a dramatic shift in the field, and more recent transformer-based approaches have continued to push the boundaries of what is possible.

What are the types of image segmentation?

There are three primary types of image segmentation, each addressing a different level of visual understanding.

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel in an image. All pixels belonging to the same object category receive the same label, regardless of whether they belong to different individual objects. For example, in a street scene, all cars would be labeled as "car" and all pedestrians as "person," but no distinction would be made between individual cars or individual pedestrians.

Semantic segmentation divides the visual world into two broad categories. "Stuff" refers to amorphous, uncountable regions such as sky, road, grass, and water. "Things" are countable object categories such as cars, people, and animals. Semantic segmentation labels both stuff and things but does not separate individual object instances.

Instance Segmentation

Instance segmentation extends the concept of semantic segmentation by not only classifying each pixel but also distinguishing between separate instances of the same class. In a scene with three cars, instance segmentation would produce three distinct masks, one for each car, rather than merging them into a single "car" region. Instance segmentation typically focuses on "things" (countable objects) and does not label background or amorphous regions.

Panoptic Segmentation

Panoptic segmentation, first formalized by Alexander Kirillov and colleagues in a 2018 paper (published at CVPR 2019), combines semantic and instance segmentation into a unified task. ^[9] As the authors put it, panoptic segmentation "unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance)." ^[9] Every pixel in the image must receive both a semantic class label and an instance ID. For "stuff" categories like sky or road, all pixels of the same class share a single label. For "things" categories like people or vehicles, each individual object gets a unique instance ID. ^[9] The term "panoptic" comes from the Greek words "pan" (all) and "optic" (vision), reflecting the goal of capturing everything visible in an image.

Type	What It Labels	Distinguishes Instances?	Handles Stuff?	Handles Things?
Semantic segmentation	Every pixel gets a class label	No	Yes	Yes
Instance segmentation	Pixels of individual objects	Yes	No	Yes
Panoptic segmentation	Every pixel gets a class label and instance ID	Yes	Yes	Yes

History of Image Segmentation

Image segmentation has a long history that predates the deep learning era by several decades. Classical approaches relied on hand-crafted features, mathematical morphology, and optimization techniques.

Thresholding

Thresholding is the simplest and oldest method for image segmentation. It converts a grayscale image into a binary image by selecting a cutoff value (the threshold): pixels above the threshold are assigned to one class, and pixels below it to another. Otsu's method, proposed by Nobuyuki Otsu in 1979, automated the threshold selection process by minimizing intra-class variance of pixel intensities. ^[18] Thresholding works well for images with clear bimodal intensity distributions but struggles with complex scenes, varying lighting, or overlapping intensity ranges.

Region Growing

Region growing starts from one or more seed pixels and iteratively expands outward by adding neighboring pixels that satisfy a similarity criterion, such as a small difference in intensity or color. The process continues until no more pixels can be added. While straightforward to implement, region growing is sensitive to the choice of seed points and similarity thresholds, and it can produce inconsistent results in images with gradual intensity transitions.

Watershed Transform

The watershed algorithm, introduced by Serge Beucher and Christian Lantuejoul in 1979, treats an image as a topographic surface where pixel values represent elevation. The algorithm simulates flooding from regional minima: water rises from each minimum, and barriers (watershed lines) are built where water from different sources meets. These barriers define the segment boundaries. The watershed transform is effective for separating touching or overlapping objects but is prone to over-segmentation, often requiring preprocessing (such as marker-based approaches) to produce useful results.

Graph Cuts

Graph-based methods model an image as a graph where pixels (or small regions) are nodes and edges connect neighboring pixels, weighted by similarity. The segmentation problem is then framed as a graph partitioning problem. Yuri Boykov and Marie-Paule Jolly published a foundational paper in 2001 on interactive graph cuts for optimal boundary and region segmentation, where users provide seed points for foreground and background, and the algorithm finds the globally optimal cut. ^[15] GrabCut, introduced by Carsten Rother, Vladimir Kolmogorov, and Andrew Blake in 2004, simplified the interaction to a single bounding box and used iterative graph cuts with Gaussian Mixture Models for more automated foreground extraction. ^[16] Normalized cuts (Shi and Malik, 2000) offered another influential graph partitioning framework that balanced the cut cost against segment sizes. ^[19]

Superpixels

Superpixel algorithms group pixels into small, perceptually meaningful regions that respect object boundaries. Rather than working with individual pixels, downstream algorithms can operate on these compact regions, reducing computational cost. Simple Linear Iterative Clustering (SLIC), proposed by Radhakrishna Achanta and colleagues in 2012, became one of the most popular superpixel methods due to its speed, simplicity, and the quality of the resulting segments. ^[17] SLIC adapts k-means clustering in a five-dimensional space of color and spatial coordinates to produce compact, roughly uniform superpixels. ^[17]

Deep Learning for Image Segmentation

The application of deep neural networks to image segmentation transformed the field beginning in 2014 and 2015. Deep learning methods learn feature representations directly from data, eliminating the need for hand-crafted features and dramatically improving accuracy on challenging benchmarks.

Fully Convolutional Networks (FCN)

The paper "Fully Convolutional Networks for Semantic Segmentation" by Jonathan Long, Evan Shelhamer, and Trevor Darrell, presented at CVPR 2015, is widely regarded as the work that launched modern deep learning-based segmentation. ^[1] The authors showed that "convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation." ^[1] The key insight was to adapt classification networks (AlexNet, VGGNet, GoogLeNet) into fully convolutional networks by replacing their fully connected layers with convolutional layers. ^[1] This allowed the network to accept inputs of arbitrary size and produce dense, pixel-wise predictions.

FCN introduced the concept of using skip connections to combine coarse, high-level semantic information from deeper layers with fine, low-level spatial information from earlier layers. The authors proposed three variants: FCN-32s, which upsampled predictions by a factor of 32 in a single step; FCN-16s, which combined predictions from the final layer and a shallower layer before upsampling; and FCN-8s, which fused predictions from three layers for the finest output. ^[1] FCN-8s achieved 62.2% mean Intersection over Union (mIoU) on the PASCAL VOC 2012 benchmark, a 20% relative improvement over prior methods at the time, while inference for a typical image took roughly one third of a second. ^[1] The extended version of the paper was published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) in 2016.

U-Net

U-Net, introduced by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in 2015, was designed specifically for biomedical image segmentation, where annotated training data is often scarce. ^[2] The architecture has a distinctive U-shaped structure consisting of a contracting path (encoder) and an expanding path (decoder) connected by skip connections. ^[2]

The encoder follows a standard CNN pattern: repeated blocks of two 3x3 convolutions, each followed by a ReLU activation and a 2x2 max pooling operation for downsampling. At each downsampling step, the number of feature channels doubles. The decoder mirrors this structure, using 2x2 up-convolutions to restore spatial resolution. At each step in the decoder, the upsampled feature map is concatenated with the corresponding feature map from the encoder via a skip connection, preserving fine-grained spatial details that would otherwise be lost during downsampling. ^[2]

U-Net won both of the most challenging categories (phase contrast and DIC microscopy) at the ISBI 2015 Cell Tracking Challenge by a large margin, achieving an average IoU of 92% on the PhC-U373 dataset compared to 83% for the second-place method. ^[2] The architecture's reliance on heavy data augmentation to compensate for limited training samples made it especially popular in medical imaging, where large annotated datasets are rare. U-Net and its many variants (U-Net++, Attention U-Net, TransUNet, nnU-Net) remain among the most widely used segmentation architectures across medical imaging, satellite analysis, and other domains.

DeepLab Series

The DeepLab family of models, developed primarily by Liang-Chieh Chen and colleagues at Google, introduced several influential ideas for semantic segmentation across four major versions.

DeepLabv1 (Chen et al., ICLR 2015) combined a deep CNN based on VGG-16 with atrous (dilated) convolutions and a fully connected Conditional Random Field (CRF). ^[3] Atrous convolutions insert gaps (zeros) between filter weights, enlarging the receptive field without increasing the number of parameters or reducing spatial resolution. The CRF post-processing step sharpened object boundaries by modeling pairwise pixel relationships. ^[3]

DeepLabv2 (Chen et al., TPAMI 2017) introduced Atrous Spatial Pyramid Pooling (ASPP), which applies atrous convolutions at multiple dilation rates in parallel. ^[4] Each branch captures context at a different spatial scale, and their outputs are fused to produce a multi-scale feature representation. This design allowed the model to handle objects of widely varying sizes within a single image. DeepLabv2 also adopted ResNet as its backbone. ^[4]

DeepLabv3 (Chen et al., 2017, arXiv:1706.05587) refined the ASPP module by augmenting it with image-level features (global average pooling) to capture broader context. ^[5] It also incorporated batch normalization to stabilize training and removed the CRF post-processing step used in earlier versions, simplifying the pipeline while maintaining strong performance. ^[5]

DeepLabv3+ (Chen et al., ECCV 2018) added a lightweight decoder module to the DeepLabv3 encoder, creating a proper encoder-decoder architecture. ^[6] It also adopted depthwise separable convolutions (inspired by the Xception architecture) in both the ASPP module and the decoder, improving both speed and accuracy. DeepLabv3+ achieved 89.0% mIoU on PASCAL VOC 2012 and 82.1% mIoU on Cityscapes, both without any post-processing. ^[6]

Version	Year	Key Innovation	CRF Used?	Backbone
DeepLabv1	2015	Atrous convolution + CRF	Yes	VGG-16
DeepLabv2	2017	Atrous Spatial Pyramid Pooling (ASPP)	Yes	ResNet-101
DeepLabv3	2017	Improved ASPP + image-level features, no CRF	No	ResNet-101
DeepLabv3+	2018	Encoder-decoder + depthwise separable convolution	No	Modified Xception

PSPNet

The Pyramid Scene Parsing Network (PSPNet), proposed by Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia at CVPR 2017, addressed the problem of capturing global context for scene parsing. ^[7] Many segmentation errors arise because models fail to consider the broader scene: for example, misclassifying a boat on water as a car because the model only looks at local features.

PSPNet's core contribution is the Pyramid Pooling Module (PPM), which divides the feature map into grids of different sizes (typically 1x1, 2x2, 3x3, and 6x6), applies global average pooling within each grid cell, and then upsamples and concatenates the results with the original feature map. ^[7] This produces a representation that encodes context at multiple spatial granularities, from global scene-level information to local fine-grained details.

PSPNet used a dilated ResNet as its backbone and achieved first place in the ImageNet Scene Parsing Challenge 2016. It reached 85.4% mIoU on PASCAL VOC 2012 and 80.2% mIoU on the Cityscapes test set. ^[7]

Mask R-CNN

Mask R-CNN, published by Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick at ICCV 2017, became the dominant framework for instance segmentation. ^[8] The authors described it as "a conceptually simple, flexible, and general framework for object instance segmentation." ^[8] The model extends Faster R-CNN, a two-stage object detector, by adding a parallel branch that predicts a binary segmentation mask for each detected object, alongside the existing branches for bounding box regression and classification. ^[8]

A key technical contribution of Mask R-CNN is RoIAlign, which replaces the RoI Pooling operation from Faster R-CNN. RoI Pooling uses quantized (rounded) coordinates when extracting features from regions of interest, introducing spatial misalignments. RoIAlign instead uses bilinear interpolation to compute exact feature values at non-integer locations, preserving precise spatial correspondence. ^[8] This seemingly small change led to significant improvements in mask quality.

The mask branch itself is a small fully convolutional network applied to each region of interest, predicting a binary mask for each class independently. Mask R-CNN runs at approximately 5 frames per second and outperformed all single-model entries in the COCO 2016 instance segmentation, object detection, and keypoint detection challenges. ^[8] The framework has been extended to numerous tasks including human pose estimation and 3D object reconstruction.

Transformer-Based Segmentation

The success of the transformer architecture in natural language processing inspired researchers to apply self-attention mechanisms to visual segmentation. Transformer-based models have gradually overtaken purely convolutional architectures on major benchmarks.

SegFormer

SegFormer, proposed by Enze Xie and colleagues at NeurIPS 2021, introduced a clean and efficient design for semantic segmentation using transformers. ^[10] The architecture consists of two main components: a hierarchical transformer encoder and a lightweight All-MLP decoder. ^[10]

The encoder produces multi-scale feature maps (at 1/4, 1/8, 1/16, and 1/32 of the input resolution) using a mix of self-attention and efficient operations. Notably, SegFormer avoids positional encodings, which allows it to handle variable input resolutions at test time without interpolation artifacts. ^[10] The MLP decoder aggregates features from all encoder stages by unifying their channel dimensions, upsampling to a common resolution, concatenating them, and applying a final MLP to produce predictions.

The authors released a family of models from SegFormer-B0 (lightweight, 3.8M parameters) to SegFormer-B5 (high-performance, 84.7M parameters). The mid-sized SegFormer-B4 reached 50.3% mIoU on ADE20K with 64M parameters, which the authors reported as 5 times smaller and 2.2 points more accurate than the previous best method. ^[10] SegFormer-B5 achieved 84.0% mIoU on the Cityscapes validation set and demonstrated strong zero-shot robustness on corrupted versions of the dataset (Cityscapes-C). ^[10]

Mask2Former

Mask2Former (Cheng et al., CVPR 2022) established a unified architecture capable of handling semantic, instance, and panoptic segmentation with a single model. ^[11] Building on the earlier MaskFormer work, which framed segmentation as a mask classification problem, Mask2Former introduced masked attention to constrain cross-attention within predicted mask regions rather than attending to the full image. ^[11] This localized attention mechanism improved both efficiency and accuracy.

The architecture consists of a backbone (ResNet or Swin Transformer), a pixel decoder that produces multi-scale feature maps, and a transformer decoder that generates a set of mask predictions and class labels from learnable queries. Mask2Former set new state-of-the-art results across multiple benchmarks: 57.8 PQ on COCO panoptic, 50.1 AP on COCO instance, and 57.7 mIoU on ADE20K semantic segmentation. ^[11]

OneFormer

OneFormer, introduced by Jitesh Jain and colleagues at CVPR 2023, took the unified segmentation concept a step further. ^[12] While Mask2Former trains separate models for each task (semantic, instance, panoptic), OneFormer uses a single model trained once on panoptic annotations with a task-conditioned joint training strategy. A task token conditions the model on which segmentation task it should perform, making the architecture task-dynamic at inference time. ^[12]

OneFormer also introduces a query-text contrastive loss that uses a text encoder to create better distinctions between tasks and between classes. With a single trained model, OneFormer outperformed task-specific Mask2Former models on ADE20K, Cityscapes, and COCO across all three segmentation tasks. ^[12]

Segment Anything Model (SAM)

The Segment Anything Model (SAM), released by Meta AI in April 2023 and presented at ICCV 2023, represents a paradigm shift toward foundation models for segmentation. ^[13] Rather than training a model for a specific set of classes, SAM is designed as a general-purpose, promptable segmentation system: per the paper, "the model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks." ^[13]

Architecture

SAM consists of three components: an image encoder, a prompt encoder, and a mask decoder. The image encoder is a Vision Transformer (ViT) pretrained with Masked Autoencoders (MAE) that produces image embeddings. The prompt encoder handles various types of input prompts, including points (foreground or background), bounding boxes, rough masks, and free-form text. The lightweight mask decoder combines image and prompt embeddings using a modified transformer architecture and produces segmentation masks in real time. ^[13]

A key property of SAM is ambiguity awareness. When a prompt is ambiguous (for example, a single point on a person's shirt could refer to the shirt, the person, or the whole group), SAM outputs multiple valid masks at different levels of granularity. ^[13]

SA-1B Dataset

SAM was trained on the SA-1B dataset, which contains over 1.1 billion segmentation masks across 11 million high-resolution, diverse, and privacy-respecting images. ^[13] The dataset was built using a data engine with three stages: in the first stage, human annotators created masks with SAM's assistance; in the second stage, a mix of automatic and manual annotation was used; and in the third stage, masks were generated fully automatically by the model. ^[13] SA-1B is the largest segmentation dataset ever created, dwarfing previous datasets by orders of magnitude.

SAM demonstrated strong zero-shot transfer capabilities, matching or exceeding the performance of fully supervised models on many tasks without any task-specific training. ^[13]

SAM 2

SAM 2, released by Meta in July 2024, extends the Segment Anything concept to video. ^[14] While SAM processes individual images, SAM 2 can track and segment objects across video frames, handling occlusions, reappearances, and changes in object appearance over time. ^[14]

SAM 2 uses a transformer architecture with streaming memory that stores information about the target object from previously processed frames. Users can provide prompts (points, boxes, or masks) on any frame, and the model propagates the segmentation forward and backward through the video. According to the paper, in video segmentation SAM 2 achieves "better accuracy, using 3x fewer interactions than prior approaches," and for image segmentation it is "more accurate and 6x faster than the Segment Anything Model (SAM)." ^[14]

SAM 2 was trained on the SA-V dataset, the largest video segmentation dataset to date, containing approximately 600,000 masklets (spatiotemporal masks) collected on about 51,000 videos spanning 47 countries. ^[14]^[20] Meta reports that SA-V contains 53 times more masks than any existing video segmentation dataset. ^[14] Both SAM 2 and the SA-V dataset were released under permissive open-source licenses (Apache 2.0 and CC BY 4.0, respectively). ^[14]^[20]

Evaluation Metrics

Evaluating segmentation models requires metrics that capture different aspects of prediction quality. The choice of metric depends on the segmentation type being evaluated.

Pixel Accuracy

Pixel accuracy is the simplest metric: it measures the percentage of pixels in the image that are correctly classified. While intuitive, pixel accuracy can be misleading in datasets with class imbalance. If 90% of an image is background, a model that predicts everything as background achieves 90% pixel accuracy despite being useless for identifying foreground objects.

Intersection over Union (IoU) and Mean IoU

Intersection over Union, also called the Jaccard Index, is the standard metric for semantic segmentation. For a given class, IoU is computed as the area of overlap between the predicted and ground-truth regions divided by the area of their union:

IoU = TP / (TP + FP + FN)

where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively. Mean IoU (mIoU) averages the IoU across all classes and is the primary metric for benchmarks like PASCAL VOC, Cityscapes, and ADE20K.

Average Precision (AP)

For instance segmentation, Average Precision is the standard metric, borrowed from object detection. A predicted mask is considered a true positive if its IoU with a ground-truth mask exceeds a threshold. AP is computed across a range of IoU thresholds (typically 0.50 to 0.95 in steps of 0.05 on the COCO benchmark), and the results are averaged.

Panoptic Quality (PQ)

Panoptic Quality, proposed by Kirillov et al. alongside the panoptic segmentation task, provides a unified evaluation for both stuff and things classes. ^[9] PQ is decomposed into two factors:

PQ = SQ x RQ

Segmentation Quality (SQ) is the average IoU of matched segments. Recognition Quality (RQ) is the F1 score of the segment matching, capturing how well the model detects segments. PQ is computed per class and then averaged, ensuring that both large and small classes contribute equally to the final score. ^[9]

Metric	Segmentation Type	What It Measures	Range
Pixel accuracy	Semantic	Percentage of correctly labeled pixels	0 to 100%
mIoU	Semantic	Average overlap between predicted and ground-truth regions per class	0 to 100%
AP	Instance	Precision of mask predictions across IoU thresholds	0 to 100%
PQ	Panoptic	Combined segmentation and recognition quality	0 to 100%

Benchmark Datasets

Progress in image segmentation has been driven by standardized benchmark datasets that allow researchers to compare methods under controlled conditions.

PASCAL VOC

The PASCAL Visual Object Classes (VOC) dataset, particularly the 2012 edition, is one of the oldest and most widely used segmentation benchmarks. It contains roughly 11,000 images covering 20 object classes plus background, with pixel-level annotations for semantic segmentation. PASCAL VOC served as the primary benchmark for many early deep learning segmentation methods, including FCN and DeepLab. Although smaller than newer datasets, it remains a standard reference point for evaluating segmentation algorithms.

Cityscapes

Cityscapes is a large-scale dataset for urban scene understanding, collected from street-level views in 50 cities across Germany and neighboring countries. It includes 5,000 images with fine pixel-level annotations (2,975 training, 500 validation, 1,525 test) and an additional 20,000 images with coarse annotations. The dataset covers 19 semantic classes relevant to driving scenarios, including road, car, pedestrian, building, and vegetation. Cityscapes is a primary benchmark for autonomous driving segmentation research.

ADE20K

ADE20K, developed by researchers at MIT, is a large-scale scene parsing dataset containing more than 25,000 images with dense annotations spanning 150 object and stuff categories. The dataset covers a wide variety of indoor and outdoor scenes, making it more diverse than driving-focused datasets like Cityscapes. ADE20K served as the basis for the ImageNet Scene Parsing Challenge and is one of the standard benchmarks for evaluating semantic segmentation models.

COCO

The Common Objects in Context (COCO) dataset is one of the most comprehensive benchmarks for visual recognition. For segmentation tasks, COCO provides over 200,000 images with annotations covering 80 object categories for instance segmentation and 133 categories for panoptic segmentation (80 things + 53 stuff). COCO's diverse scenes, with multiple objects per image at various scales and in complex arrangements, make it a challenging and widely adopted benchmark.

SA-1B

SA-1B, introduced alongside SAM in 2023, is the largest segmentation dataset to date. It contains 1.1 billion automatically generated masks across 11 million images. ^[13] While SA-1B is not used as a traditional benchmark for comparing models, its scale and diversity have made it the foundational training set for promptable segmentation research. The images are high-resolution, geographically diverse, and processed to protect privacy.

Dataset	Images	Classes	Annotation Type	Primary Use
PASCAL VOC 2012	~11,000	20 + background	Semantic	Semantic segmentation
Cityscapes	5,000 fine + 20,000 coarse	19	Semantic, instance, panoptic	Autonomous driving
ADE20K	25,000+	150	Semantic	Scene parsing
COCO	200,000+	80 things + 53 stuff	Instance, panoptic	General-purpose benchmarking
SA-1B	11 million	Class-agnostic	Masks (no class labels)	Promptable segmentation training

Key Models Compared

The following table summarizes the major deep learning models discussed in this article, along with their target segmentation tasks, year of publication, and notable contributions.

Model	Year	Task	Key Contribution	Notable Result
FCN	2015	Semantic	End-to-end pixel-wise prediction with adapted classification networks	62.2% mIoU on PASCAL VOC 2012
U-Net	2015	Semantic (biomedical)	Symmetric encoder-decoder with skip connections	92% IoU on ISBI 2015 PhC-U373
DeepLabv1	2015	Semantic	Atrous convolutions + CRF	Improved boundary localization
DeepLabv2	2017	Semantic	Atrous Spatial Pyramid Pooling (ASPP)	Multi-scale object segmentation
PSPNet	2017	Semantic	Pyramid Pooling Module for global context	85.4% mIoU on PASCAL VOC 2012
Mask R-CNN	2017	Instance	Mask branch + RoIAlign added to Faster R-CNN	COCO 2016 challenge winner
DeepLabv3	2017	Semantic	Improved ASPP + image-level features	Comparable to state of the art without CRF
DeepLabv3+	2018	Semantic	Encoder-decoder with depthwise separable convolution	89.0% mIoU on PASCAL VOC 2012
SegFormer	2021	Semantic	Hierarchical transformer + MLP decoder	84.0% mIoU on Cityscapes
Mask2Former	2022	Universal	Masked attention for panoptic, instance, and semantic	57.8 PQ on COCO panoptic
OneFormer	2023	Universal	Task-conditioned joint training with one model	Outperforms task-specific Mask2Former
SAM	2023	Promptable	Foundation model trained on 1.1B masks	Strong zero-shot transfer
SAM 2	2024	Promptable (image + video)	Streaming memory for video, 6x faster than SAM	3x fewer interactions for video segmentation

What is image segmentation used for?

Image segmentation has found practical use across a wide range of industries and research fields.

Medical Imaging

In healthcare, segmentation is used to delineate anatomical structures and pathological regions in CT scans, MRI images, X-rays, and histopathology slides. Tumor detection and volumetric measurement, organ segmentation for surgical planning, cell counting in microscopy, and retinal vessel segmentation for diagnosing eye diseases are all common applications. U-Net and its derivatives dominate this space, largely because they perform well even with limited annotated training data, which is typical in clinical settings.

Autonomous Driving

Self-driving vehicles rely on real-time semantic and panoptic segmentation to understand their surroundings. Segmentation models identify drivable surfaces, lane markings, vehicles, pedestrians, cyclists, traffic signs, and obstacles from camera feeds. Accurate pixel-level understanding is essential for safe navigation, path planning, and collision avoidance. The Cityscapes dataset was specifically created to support research in this domain, and models like DeepLabv3+ and transformer-based architectures are commonly deployed in autonomous driving pipelines.

Satellite and Remote Sensing Imagery

Segmentation of satellite and aerial images supports land use classification, urban planning, deforestation monitoring, flood mapping, crop health assessment, and disaster response. The ability to classify every pixel in a satellite image into categories such as forest, water, urban area, and agricultural land is valuable for environmental science, government agencies, and agricultural businesses.

Video Editing and Visual Effects

Image and video segmentation powers features like background removal, object selection, and rotoscoping in video editing software. Tools built on segmentation models allow users to isolate subjects from backgrounds, apply selective effects, and create composites. SAM and SAM 2 have made these capabilities more accessible by enabling interactive, promptable segmentation that works without task-specific training.

Robotics

Robots performing manipulation, navigation, or inspection tasks use segmentation to identify and locate objects in their environment. Grasping a specific item from a cluttered shelf, for instance, requires the robot to segment individual objects and estimate their shapes. Instance segmentation is especially important in robotic manipulation, where the robot needs to distinguish between multiple similar objects.

Augmented Reality

Augmented reality (AR) applications use real-time segmentation to overlay digital content onto the physical world. Accurate segmentation of people, surfaces, and objects allows AR systems to place virtual objects realistically, apply body or face filters, and enable occlusion handling where virtual objects appear behind real ones.

Challenges and Open Problems

Despite significant advances, image segmentation still faces several challenges.

Boundary precision. Predicting exact object boundaries remains difficult, especially for objects with irregular shapes, thin structures (such as bicycle spokes or tree branches), or fuzzy edges. Many models produce masks that are slightly dilated or eroded compared to the true object boundary.

Class imbalance. In many real-world datasets, some classes occupy far more pixels than others. Models can become biased toward majority classes and fail to segment rare but important objects. Techniques like class-weighted loss functions, oversampling, and focal loss partially address this issue.

Domain shift. Models trained on one dataset often perform poorly when applied to images from a different domain (for example, a model trained on daytime driving scenes tested on nighttime images). Domain adaptation and domain generalization remain active research topics.

Real-time performance. Many applications, particularly autonomous driving and robotics, require segmentation at high frame rates. Balancing accuracy with computational efficiency is an ongoing tradeoff, though lightweight architectures and hardware acceleration continue to close this gap.

3D and volumetric segmentation. Extending 2D segmentation to 3D data, such as medical CT volumes or LiDAR point clouds, introduces additional complexity. While 3D extensions of U-Net and other architectures exist, 3D segmentation remains more computationally expensive and less mature than its 2D counterpart.

Open-vocabulary segmentation. Traditional models are limited to a fixed set of predefined classes. Open-vocabulary segmentation aims to segment objects based on arbitrary text descriptions, bridging segmentation with large language models and vision-language models. This is an active and rapidly evolving area of research.

References

Long, J., Shelhamer, E., and Darrell, T. "Fully Convolutional Networks for Semantic Segmentation." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2015. https://arxiv.org/abs/1411.4038
Ronneberger, O., Fischer, P., and Brox, T. "U-Net: Convolutional Networks for Biomedical Image Segmentation." *International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)*, 2015. https://arxiv.org/abs/1505.04597
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. "Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs." *International Conference on Learning Representations (ICLR)*, 2015.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs." *IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)*, 2017.
Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. "Rethinking Atrous Convolution for Semantic Image Segmentation." *arXiv preprint arXiv:1706.05587*, 2017.
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. "Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation." *Proceedings of the European Conference on Computer Vision (ECCV)*, 2018. https://arxiv.org/abs/1802.02611
Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. "Pyramid Scene Parsing Network." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2017.
He, K., Gkioxari, G., Dollar, P., and Girshick, R. "Mask R-CNN." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2017. https://arxiv.org/abs/1703.06870
Kirillov, A., He, K., Girshick, R., Rother, C., and Dollar, P. "Panoptic Segmentation." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2019. https://arxiv.org/abs/1801.00868
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., and Luo, P. "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers." *Advances in Neural Information Processing Systems (NeurIPS)*, 2021. https://arxiv.org/abs/2105.15203
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Girdhar, R. "Masked-attention Mask Transformer for Universal Image Segmentation." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2022.
Jain, J., Li, J., Chiu, M. T., Hassani, A., Orlov, N., and Shi, H. "OneFormer: One Transformer to Rule Universal Image Segmentation." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2023.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollar, P., and Girshick, R. "Segment Anything." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2023. https://arxiv.org/abs/2304.02643
Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Radle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C.-Y., Girshick, R., Dollar, P., and Feichtenhofer, C. "SAM 2: Segment Anything in Images and Videos." *arXiv preprint arXiv:2408.00714*, 2024. https://arxiv.org/abs/2408.00714
Boykov, Y. and Jolly, M.-P. "Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2001.
Rother, C., Kolmogorov, V., and Blake, A. "GrabCut: Interactive Foreground Extraction Using Iterated Graph Cuts." *ACM Transactions on Graphics (SIGGRAPH)*, 2004.
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Susstrunk, S. "SLIC Superpixels Compared to State-of-the-Art Superpixel Methods." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2012.
Otsu, N. "A Threshold Selection Method from Gray-Level Histograms." *IEEE Transactions on Systems, Man, and Cybernetics*, 1979.
Shi, J. and Malik, J. "Normalized Cuts and Image Segmentation." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2000.
Meta AI. "Introducing Meta Segment Anything Model 2 (SAM 2)." Meta AI Blog, July 2024. https://ai.meta.com/research/sam2/

Introduction

What are the types of image segmentation?

Semantic Segmentation

Instance Segmentation

Panoptic Segmentation

History of Image Segmentation

Thresholding

Region Growing

Watershed Transform

Graph Cuts

Superpixels

Deep Learning for Image Segmentation

Fully Convolutional Networks (FCN)

U-Net

DeepLab Series

PSPNet

Mask R-CNN

Transformer-Based Segmentation

SegFormer

Mask2Former

OneFormer

Segment Anything Model (SAM)

Architecture

SA-1B Dataset

SAM 2

Evaluation Metrics

Pixel Accuracy

Intersection over Union (IoU) and Mean IoU

Average Precision (AP)

Panoptic Quality (PQ)

Benchmark Datasets

PASCAL VOC

Cityscapes

ADE20K

COCO

SA-1B

Key Models Compared

What is image segmentation used for?

Medical Imaging

Autonomous Driving

Satellite and Remote Sensing Imagery

Video Editing and Visual Effects

Robotics

Augmented Reality

Challenges and Open Problems

References

Improve this article

Related Articles

Lyte

Diffusion model

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

What links here (24 of 51)

Introduction

What are the types of image segmentation?

Semantic Segmentation

Instance Segmentation

Panoptic Segmentation

History of Image Segmentation

Thresholding

Region Growing

Watershed Transform

Graph Cuts

Superpixels

Deep Learning for Image Segmentation

Fully Convolutional Networks (FCN)

U-Net

DeepLab Series

PSPNet

Mask R-CNN

Transformer-Based Segmentation

SegFormer

Mask2Former

OneFormer

Segment Anything Model (SAM)

Architecture

SA-1B Dataset

SAM 2