Image segmentation is a fundamental task in computer vision that involves partitioning a digital image into multiple segments, or groups of pixels, each corresponding to a meaningful region or object. The goal is to simplify the representation of an image so that it becomes easier to analyze, interpret, or process. Every pixel in an image is assigned to a category, an object instance, or both, depending on the type of segmentation being performed.
Unlike image classification, which assigns a single label to an entire image, and object detection, which draws bounding boxes around objects, image segmentation operates at the pixel level. This fine-grained understanding of visual scenes is critical for applications ranging from medical diagnosis to autonomous driving and satellite image analysis.
Image segmentation has evolved from simple threshold-based methods in the 1970s and 1980s to sophisticated deep learning architectures that achieve near-human accuracy on complex benchmarks. The introduction of convolutional neural networks (CNNs) to segmentation in the mid-2010s marked a dramatic shift in the field, and more recent transformer-based approaches have continued to push the boundaries of what is possible.
There are three primary types of image segmentation, each addressing a different level of visual understanding.
Semantic segmentation assigns a class label to every pixel in an image. All pixels belonging to the same object category receive the same label, regardless of whether they belong to different individual objects. For example, in a street scene, all cars would be labeled as "car" and all pedestrians as "person," but no distinction would be made between individual cars or individual pedestrians.
Semantic segmentation divides the visual world into two broad categories. "Stuff" refers to amorphous, uncountable regions such as sky, road, grass, and water. "Things" are countable object categories such as cars, people, and animals. Semantic segmentation labels both stuff and things but does not separate individual object instances.
Instance segmentation extends the concept of semantic segmentation by not only classifying each pixel but also distinguishing between separate instances of the same class. In a scene with three cars, instance segmentation would produce three distinct masks, one for each car, rather than merging them into a single "car" region. Instance segmentation typically focuses on "things" (countable objects) and does not label background or amorphous regions.
Panoptic segmentation, first formalized by Alexander Kirillov and colleagues in a 2018 paper (published at CVPR 2019), combines semantic and instance segmentation into a unified task. Every pixel in the image must receive both a semantic class label and an instance ID. For "stuff" categories like sky or road, all pixels of the same class share a single label. For "things" categories like people or vehicles, each individual object gets a unique instance ID. The term "panoptic" comes from the Greek words "pan" (all) and "optic" (vision), reflecting the goal of capturing everything visible in an image.
| Type | What It Labels | Distinguishes Instances? | Handles Stuff? | Handles Things? |
|---|---|---|---|---|
| Semantic segmentation | Every pixel gets a class label | No | Yes | Yes |
| Instance segmentation | Pixels of individual objects | Yes | No | Yes |
| Panoptic segmentation | Every pixel gets a class label and instance ID | Yes | Yes | Yes |
Image segmentation has a long history that predates the deep learning era by several decades. Classical approaches relied on hand-crafted features, mathematical morphology, and optimization techniques.
Thresholding is the simplest and oldest method for image segmentation. It converts a grayscale image into a binary image by selecting a cutoff value (the threshold): pixels above the threshold are assigned to one class, and pixels below it to another. Otsu's method, proposed by Nobuyuki Otsu in 1979, automated the threshold selection process by minimizing intra-class variance of pixel intensities. Thresholding works well for images with clear bimodal intensity distributions but struggles with complex scenes, varying lighting, or overlapping intensity ranges.
Region growing starts from one or more seed pixels and iteratively expands outward by adding neighboring pixels that satisfy a similarity criterion, such as a small difference in intensity or color. The process continues until no more pixels can be added. While straightforward to implement, region growing is sensitive to the choice of seed points and similarity thresholds, and it can produce inconsistent results in images with gradual intensity transitions.
The watershed algorithm, introduced by Serge Beucher and Christian Lantuejoul in 1979, treats an image as a topographic surface where pixel values represent elevation. The algorithm simulates flooding from regional minima: water rises from each minimum, and barriers (watershed lines) are built where water from different sources meets. These barriers define the segment boundaries. The watershed transform is effective for separating touching or overlapping objects but is prone to over-segmentation, often requiring preprocessing (such as marker-based approaches) to produce useful results.
Graph-based methods model an image as a graph where pixels (or small regions) are nodes and edges connect neighboring pixels, weighted by similarity. The segmentation problem is then framed as a graph partitioning problem. Yuri Boykov and Marie-Paule Jolly published a foundational paper in 2001 on interactive graph cuts for optimal boundary and region segmentation, where users provide seed points for foreground and background, and the algorithm finds the globally optimal cut. GrabCut, introduced by Carsten Rother, Vladimir Kolmogorov, and Andrew Blake in 2004, simplified the interaction to a single bounding box and used iterative graph cuts with Gaussian Mixture Models for more automated foreground extraction. Normalized cuts (Shi and Malik, 2000) offered another influential graph partitioning framework that balanced the cut cost against segment sizes.
Superpixel algorithms group pixels into small, perceptually meaningful regions that respect object boundaries. Rather than working with individual pixels, downstream algorithms can operate on these compact regions, reducing computational cost. Simple Linear Iterative Clustering (SLIC), proposed by Radhakrishna Achanta and colleagues in 2012, became one of the most popular superpixel methods due to its speed, simplicity, and the quality of the resulting segments. SLIC adapts k-means clustering in a five-dimensional space of color and spatial coordinates to produce compact, roughly uniform superpixels.
The application of deep neural networks to image segmentation transformed the field beginning in 2014 and 2015. Deep learning methods learn feature representations directly from data, eliminating the need for hand-crafted features and dramatically improving accuracy on challenging benchmarks.
The paper "Fully Convolutional Networks for Semantic Segmentation" by Jonathan Long, Evan Shelhamer, and Trevor Darrell, presented at CVPR 2015, is widely regarded as the work that launched modern deep learning-based segmentation. The key insight was to adapt classification networks (AlexNet, VGGNet, GoogLeNet) into fully convolutional networks by replacing their fully connected layers with convolutional layers. This allowed the network to accept inputs of arbitrary size and produce dense, pixel-wise predictions.
FCN introduced the concept of using skip connections to combine coarse, high-level semantic information from deeper layers with fine, low-level spatial information from earlier layers. The authors proposed three variants: FCN-32s, which upsampled predictions by a factor of 32 in a single step; FCN-16s, which combined predictions from the final layer and a shallower layer before upsampling; and FCN-8s, which fused predictions from three layers for the finest output. FCN-8s achieved 62.2% mean Intersection over Union (mIoU) on the PASCAL VOC 2012 benchmark, a 20% relative improvement over prior methods at the time. The extended version of the paper was published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) in 2016.
U-Net, introduced by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in 2015, was designed specifically for biomedical image segmentation, where annotated training data is often scarce. The architecture has a distinctive U-shaped structure consisting of a contracting path (encoder) and an expanding path (decoder) connected by skip connections.
The encoder follows a standard CNN pattern: repeated blocks of two 3x3 convolutions, each followed by a ReLU activation and a 2x2 max pooling operation for downsampling. At each downsampling step, the number of feature channels doubles. The decoder mirrors this structure, using 2x2 up-convolutions to restore spatial resolution. At each step in the decoder, the upsampled feature map is concatenated with the corresponding feature map from the encoder via a skip connection, preserving fine-grained spatial details that would otherwise be lost during downsampling.
U-Net won both of the most challenging categories (phase contrast and DIC microscopy) at the ISBI 2015 Cell Tracking Challenge by a large margin, achieving an average IoU of 92% on the PhC-U373 dataset compared to 83% for the second-place method. The architecture's reliance on heavy data augmentation to compensate for limited training samples made it especially popular in medical imaging, where large annotated datasets are rare. U-Net and its many variants (U-Net++, Attention U-Net, TransUNet, nnU-Net) remain among the most widely used segmentation architectures across medical imaging, satellite analysis, and other domains.
The DeepLab family of models, developed primarily by Liang-Chieh Chen and colleagues at Google, introduced several influential ideas for semantic segmentation across four major versions.
DeepLabv1 (Chen et al., ICLR 2015) combined a deep CNN based on VGG-16 with atrous (dilated) convolutions and a fully connected Conditional Random Field (CRF). Atrous convolutions insert gaps (zeros) between filter weights, enlarging the receptive field without increasing the number of parameters or reducing spatial resolution. The CRF post-processing step sharpened object boundaries by modeling pairwise pixel relationships.
DeepLabv2 (Chen et al., TPAMI 2017) introduced Atrous Spatial Pyramid Pooling (ASPP), which applies atrous convolutions at multiple dilation rates in parallel. Each branch captures context at a different spatial scale, and their outputs are fused to produce a multi-scale feature representation. This design allowed the model to handle objects of widely varying sizes within a single image. DeepLabv2 also adopted ResNet as its backbone.
DeepLabv3 (Chen et al., 2017, arXiv:1706.05587) refined the ASPP module by augmenting it with image-level features (global average pooling) to capture broader context. It also incorporated batch normalization to stabilize training and removed the CRF post-processing step used in earlier versions, simplifying the pipeline while maintaining strong performance.
DeepLabv3+ (Chen et al., ECCV 2018) added a lightweight decoder module to the DeepLabv3 encoder, creating a proper encoder-decoder architecture. It also adopted depthwise separable convolutions (inspired by the Xception architecture) in both the ASPP module and the decoder, improving both speed and accuracy. DeepLabv3+ achieved 89.0% mIoU on PASCAL VOC 2012 and 82.1% mIoU on Cityscapes.
| Version | Year | Key Innovation | CRF Used? | Backbone |
|---|---|---|---|---|
| DeepLabv1 | 2015 | Atrous convolution + CRF | Yes | VGG-16 |
| DeepLabv2 | 2017 | Atrous Spatial Pyramid Pooling (ASPP) | Yes | ResNet-101 |
| DeepLabv3 | 2017 | Improved ASPP + image-level features, no CRF | No | ResNet-101 |
| DeepLabv3+ | 2018 | Encoder-decoder + depthwise separable convolution | No | Modified Xception |
The Pyramid Scene Parsing Network (PSPNet), proposed by Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia at CVPR 2017, addressed the problem of capturing global context for scene parsing. Many segmentation errors arise because models fail to consider the broader scene: for example, misclassifying a boat on water as a car because the model only looks at local features.
PSPNet's core contribution is the Pyramid Pooling Module (PPM), which divides the feature map into grids of different sizes (typically 1x1, 2x2, 3x3, and 6x6), applies global average pooling within each grid cell, and then upsamples and concatenates the results with the original feature map. This produces a representation that encodes context at multiple spatial granularities, from global scene-level information to local fine-grained details.
PSPNet used a dilated ResNet as its backbone and achieved first place in the ImageNet Scene Parsing Challenge 2016. It reached 85.4% mIoU on PASCAL VOC 2012 and 80.2% mIoU on the Cityscapes test set.
Mask R-CNN, published by Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick at ICCV 2017, became the dominant framework for instance segmentation. The model extends Faster R-CNN, a two-stage object detector, by adding a parallel branch that predicts a binary segmentation mask for each detected object, alongside the existing branches for bounding box regression and classification.
A key technical contribution of Mask R-CNN is RoIAlign, which replaces the RoI Pooling operation from Faster R-CNN. RoI Pooling uses quantized (rounded) coordinates when extracting features from regions of interest, introducing spatial misalignments. RoIAlign instead uses bilinear interpolation to compute exact feature values at non-integer locations, preserving precise spatial correspondence. This seemingly small change led to significant improvements in mask quality.
The mask branch itself is a small fully convolutional network applied to each region of interest, predicting a binary mask for each class independently. Mask R-CNN runs at approximately 5 frames per second and outperformed all single-model entries in the COCO 2016 instance segmentation, object detection, and keypoint detection challenges. The framework has been extended to numerous tasks including human pose estimation and 3D object reconstruction.
The success of the transformer architecture in natural language processing inspired researchers to apply self-attention mechanisms to visual segmentation. Transformer-based models have gradually overtaken purely convolutional architectures on major benchmarks.
SegFormer, proposed by Enze Xie and colleagues at NeurIPS 2021, introduced a clean and efficient design for semantic segmentation using transformers. The architecture consists of two main components: a hierarchical transformer encoder and a lightweight All-MLP decoder.
The encoder produces multi-scale feature maps (at 1/4, 1/8, 1/16, and 1/32 of the input resolution) using a mix of self-attention and efficient operations. Notably, SegFormer avoids positional encodings, which allows it to handle variable input resolutions at test time without interpolation artifacts. The MLP decoder aggregates features from all encoder stages by unifying their channel dimensions, upsampling to a common resolution, concatenating them, and applying a final MLP to produce predictions.
The authors released a family of models from SegFormer-B0 (lightweight, 3.8M parameters) to SegFormer-B5 (high-performance, 84.7M parameters). SegFormer-B5 achieved 84.0% mIoU on the Cityscapes validation set and demonstrated strong zero-shot robustness on corrupted versions of the dataset.
Mask2Former (Cheng et al., CVPR 2022) established a unified architecture capable of handling semantic, instance, and panoptic segmentation with a single model. Building on the earlier MaskFormer work, which framed segmentation as a mask classification problem, Mask2Former introduced masked attention to constrain cross-attention within predicted mask regions rather than attending to the full image. This localized attention mechanism improved both efficiency and accuracy.
The architecture consists of a backbone (ResNet or Swin Transformer), a pixel decoder that produces multi-scale feature maps, and a transformer decoder that generates a set of mask predictions and class labels from learnable queries. Mask2Former set new state-of-the-art results across multiple benchmarks: 57.8 PQ on COCO panoptic, 50.1 AP on COCO instance, and 57.7 mIoU on ADE20K semantic segmentation.
OneFormer, introduced by Jitesh Jain and colleagues at CVPR 2023, took the unified segmentation concept a step further. While Mask2Former trains separate models for each task (semantic, instance, panoptic), OneFormer uses a single model trained once on panoptic annotations with a task-conditioned joint training strategy. A task token conditions the model on which segmentation task it should perform, making the architecture task-dynamic at inference time.
OneFormer also introduces a query-text contrastive loss that uses a text encoder to create better distinctions between tasks and between classes. With a single trained model, OneFormer outperformed task-specific Mask2Former models on ADE20K, Cityscapes, and COCO across all three segmentation tasks.
The Segment Anything Model (SAM), released by Meta AI in April 2023 and presented at ICCV 2023, represents a paradigm shift toward foundation models for segmentation. Rather than training a model for a specific set of classes, SAM is designed as a general-purpose, promptable segmentation system.
SAM consists of three components: an image encoder, a prompt encoder, and a mask decoder. The image encoder is a Vision Transformer (ViT) pretrained with Masked Autoencoders (MAE) that produces image embeddings. The prompt encoder handles various types of input prompts, including points (foreground or background), bounding boxes, rough masks, and free-form text. The lightweight mask decoder combines image and prompt embeddings using a modified transformer architecture and produces segmentation masks in real time.
A key property of SAM is ambiguity awareness. When a prompt is ambiguous (for example, a single point on a person's shirt could refer to the shirt, the person, or the whole group), SAM outputs multiple valid masks at different levels of granularity.
SAM was trained on the SA-1B dataset, which contains over 1.1 billion segmentation masks across 11 million high-resolution, diverse, and privacy-respecting images. The dataset was built using a data engine with three stages: in the first stage, human annotators created masks with SAM's assistance; in the second stage, a mix of automatic and manual annotation was used; and in the third stage, masks were generated fully automatically by the model. SA-1B is the largest segmentation dataset ever created, dwarfing previous datasets by orders of magnitude.
SAM demonstrated strong zero-shot transfer capabilities, matching or exceeding the performance of fully supervised models on many tasks without any task-specific training.
SAM 2, released by Meta in July 2024, extends the Segment Anything concept to video. While SAM processes individual images, SAM 2 can track and segment objects across video frames, handling occlusions, reappearances, and changes in object appearance over time.
SAM 2 uses a transformer architecture with streaming memory that stores information about the target object from previously processed frames. Users can provide prompts (points, boxes, or masks) on any frame, and the model propagates the segmentation forward and backward through the video. The model achieves better accuracy than prior video segmentation approaches while requiring three times fewer user interactions.
SAM 2 was trained on the SA-V dataset, the largest video segmentation dataset to date, containing approximately 600,000 masklets (spatiotemporal masks) from about 51,000 videos spanning 47 countries. For image segmentation, SAM 2 is also more accurate and six times faster than the original SAM. Both SAM 2 and the SA-V dataset were released under permissive open-source licenses (Apache 2.0 and CC BY 4.0, respectively).
Evaluating segmentation models requires metrics that capture different aspects of prediction quality. The choice of metric depends on the segmentation type being evaluated.
Pixel accuracy is the simplest metric: it measures the percentage of pixels in the image that are correctly classified. While intuitive, pixel accuracy can be misleading in datasets with class imbalance. If 90% of an image is background, a model that predicts everything as background achieves 90% pixel accuracy despite being useless for identifying foreground objects.
Intersection over Union, also called the Jaccard Index, is the standard metric for semantic segmentation. For a given class, IoU is computed as the area of overlap between the predicted and ground-truth regions divided by the area of their union:
IoU = TP / (TP + FP + FN)
where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively. Mean IoU (mIoU) averages the IoU across all classes and is the primary metric for benchmarks like PASCAL VOC, Cityscapes, and ADE20K.
For instance segmentation, Average Precision is the standard metric, borrowed from object detection. A predicted mask is considered a true positive if its IoU with a ground-truth mask exceeds a threshold. AP is computed across a range of IoU thresholds (typically 0.50 to 0.95 in steps of 0.05 on the COCO benchmark), and the results are averaged.
Panoptic Quality, proposed by Kirillov et al. alongside the panoptic segmentation task, provides a unified evaluation for both stuff and things classes. PQ is decomposed into two factors:
PQ = SQ x RQ
Segmentation Quality (SQ) is the average IoU of matched segments. Recognition Quality (RQ) is the F1 score of the segment matching, capturing how well the model detects segments. PQ is computed per class and then averaged, ensuring that both large and small classes contribute equally to the final score.
| Metric | Segmentation Type | What It Measures | Range |
|---|---|---|---|
| Pixel accuracy | Semantic | Percentage of correctly labeled pixels | 0 to 100% |
| mIoU | Semantic | Average overlap between predicted and ground-truth regions per class | 0 to 100% |
| AP | Instance | Precision of mask predictions across IoU thresholds | 0 to 100% |
| PQ | Panoptic | Combined segmentation and recognition quality | 0 to 100% |
Progress in image segmentation has been driven by standardized benchmark datasets that allow researchers to compare methods under controlled conditions.
The PASCAL Visual Object Classes (VOC) dataset, particularly the 2012 edition, is one of the oldest and most widely used segmentation benchmarks. It contains roughly 11,000 images covering 20 object classes plus background, with pixel-level annotations for semantic segmentation. PASCAL VOC served as the primary benchmark for many early deep learning segmentation methods, including FCN and DeepLab. Although smaller than newer datasets, it remains a standard reference point for evaluating segmentation algorithms.
Cityscapes is a large-scale dataset for urban scene understanding, collected from street-level views in 50 cities across Germany and neighboring countries. It includes 5,000 images with fine pixel-level annotations (2,975 training, 500 validation, 1,525 test) and an additional 20,000 images with coarse annotations. The dataset covers 19 semantic classes relevant to driving scenarios, including road, car, pedestrian, building, and vegetation. Cityscapes is a primary benchmark for autonomous driving segmentation research.
ADE20K, developed by researchers at MIT, is a large-scale scene parsing dataset containing more than 25,000 images with dense annotations spanning 150 object and stuff categories. The dataset covers a wide variety of indoor and outdoor scenes, making it more diverse than driving-focused datasets like Cityscapes. ADE20K served as the basis for the ImageNet Scene Parsing Challenge and is one of the standard benchmarks for evaluating semantic segmentation models.
The Common Objects in Context (COCO) dataset is one of the most comprehensive benchmarks for visual recognition. For segmentation tasks, COCO provides over 200,000 images with annotations covering 80 object categories for instance segmentation and 133 categories for panoptic segmentation (80 things + 53 stuff). COCO's diverse scenes, with multiple objects per image at various scales and in complex arrangements, make it a challenging and widely adopted benchmark.
SA-1B, introduced alongside SAM in 2023, is the largest segmentation dataset to date. It contains 1.1 billion automatically generated masks across 11 million images. While SA-1B is not used as a traditional benchmark for comparing models, its scale and diversity have made it the foundational training set for promptable segmentation research. The images are high-resolution, geographically diverse, and processed to protect privacy.
| Dataset | Images | Classes | Annotation Type | Primary Use |
|---|---|---|---|---|
| PASCAL VOC 2012 | ~11,000 | 20 + background | Semantic | Semantic segmentation |
| Cityscapes | 5,000 fine + 20,000 coarse | 19 | Semantic, instance, panoptic | Autonomous driving |
| ADE20K | 25,000+ | 150 | Semantic | Scene parsing |
| COCO | 200,000+ | 80 things + 53 stuff | Instance, panoptic | General-purpose benchmarking |
| SA-1B | 11 million | Class-agnostic | Masks (no class labels) | Promptable segmentation training |
The following table summarizes the major deep learning models discussed in this article, along with their target segmentation tasks, year of publication, and notable contributions.
| Model | Year | Task | Key Contribution | Notable Result |
|---|---|---|---|---|
| FCN | 2015 | Semantic | End-to-end pixel-wise prediction with adapted classification networks | 62.2% mIoU on PASCAL VOC 2012 |
| U-Net | 2015 | Semantic (biomedical) | Symmetric encoder-decoder with skip connections | 92% IoU on ISBI 2015 PhC-U373 |
| DeepLabv1 | 2015 | Semantic | Atrous convolutions + CRF | Improved boundary localization |
| DeepLabv2 | 2017 | Semantic | Atrous Spatial Pyramid Pooling (ASPP) | Multi-scale object segmentation |
| PSPNet | 2017 | Semantic | Pyramid Pooling Module for global context | 85.4% mIoU on PASCAL VOC 2012 |
| Mask R-CNN | 2017 | Instance | Mask branch + RoIAlign added to Faster R-CNN | COCO 2016 challenge winner |
| DeepLabv3 | 2017 | Semantic | Improved ASPP + image-level features | Comparable to state of the art without CRF |
| DeepLabv3+ | 2018 | Semantic | Encoder-decoder with depthwise separable convolution | 89.0% mIoU on PASCAL VOC 2012 |
| SegFormer | 2021 | Semantic | Hierarchical transformer + MLP decoder | 84.0% mIoU on Cityscapes |
| Mask2Former | 2022 | Universal | Masked attention for panoptic, instance, and semantic | 57.8 PQ on COCO panoptic |
| OneFormer | 2023 | Universal | Task-conditioned joint training with one model | Outperforms task-specific Mask2Former |
| SAM | 2023 | Promptable | Foundation model trained on 1.1B masks | Strong zero-shot transfer |
| SAM 2 | 2024 | Promptable (image + video) | Streaming memory for video, 6x faster than SAM | 3x fewer interactions for video segmentation |
Image segmentation has found practical use across a wide range of industries and research fields.
In healthcare, segmentation is used to delineate anatomical structures and pathological regions in CT scans, MRI images, X-rays, and histopathology slides. Tumor detection and volumetric measurement, organ segmentation for surgical planning, cell counting in microscopy, and retinal vessel segmentation for diagnosing eye diseases are all common applications. U-Net and its derivatives dominate this space, largely because they perform well even with limited annotated training data, which is typical in clinical settings.
Self-driving vehicles rely on real-time semantic and panoptic segmentation to understand their surroundings. Segmentation models identify drivable surfaces, lane markings, vehicles, pedestrians, cyclists, traffic signs, and obstacles from camera feeds. Accurate pixel-level understanding is essential for safe navigation, path planning, and collision avoidance. The Cityscapes dataset was specifically created to support research in this domain, and models like DeepLabv3+ and transformer-based architectures are commonly deployed in autonomous driving pipelines.
Segmentation of satellite and aerial images supports land use classification, urban planning, deforestation monitoring, flood mapping, crop health assessment, and disaster response. The ability to classify every pixel in a satellite image into categories such as forest, water, urban area, and agricultural land is valuable for environmental science, government agencies, and agricultural businesses.
Image and video segmentation powers features like background removal, object selection, and rotoscoping in video editing software. Tools built on segmentation models allow users to isolate subjects from backgrounds, apply selective effects, and create composites. SAM and SAM 2 have made these capabilities more accessible by enabling interactive, promptable segmentation that works without task-specific training.
Robots performing manipulation, navigation, or inspection tasks use segmentation to identify and locate objects in their environment. Grasping a specific item from a cluttered shelf, for instance, requires the robot to segment individual objects and estimate their shapes. Instance segmentation is especially important in robotic manipulation, where the robot needs to distinguish between multiple similar objects.
Augmented reality (AR) applications use real-time segmentation to overlay digital content onto the physical world. Accurate segmentation of people, surfaces, and objects allows AR systems to place virtual objects realistically, apply body or face filters, and enable occlusion handling where virtual objects appear behind real ones.
Despite significant advances, image segmentation still faces several challenges.
Boundary precision. Predicting exact object boundaries remains difficult, especially for objects with irregular shapes, thin structures (such as bicycle spokes or tree branches), or fuzzy edges. Many models produce masks that are slightly dilated or eroded compared to the true object boundary.
Class imbalance. In many real-world datasets, some classes occupy far more pixels than others. Models can become biased toward majority classes and fail to segment rare but important objects. Techniques like class-weighted loss functions, oversampling, and focal loss partially address this issue.
Domain shift. Models trained on one dataset often perform poorly when applied to images from a different domain (for example, a model trained on daytime driving scenes tested on nighttime images). Domain adaptation and domain generalization remain active research topics.
Real-time performance. Many applications, particularly autonomous driving and robotics, require segmentation at high frame rates. Balancing accuracy with computational efficiency is an ongoing tradeoff, though lightweight architectures and hardware acceleration continue to close this gap.
3D and volumetric segmentation. Extending 2D segmentation to 3D data, such as medical CT volumes or LiDAR point clouds, introduces additional complexity. While 3D extensions of U-Net and other architectures exist, 3D segmentation remains more computationally expensive and less mature than its 2D counterpart.
Open-vocabulary segmentation. Traditional models are limited to a fixed set of predefined classes. Open-vocabulary segmentation aims to segment objects based on arbitrary text descriptions, bridging segmentation with large language models and vision-language models. This is an active and rapidly evolving area of research.