DeepLab

Computer Vision Deep Learning Google

23 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v4 · 4,530 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DeepLab is a family of deep convolutional neural network architectures for semantic segmentation, developed by Liang-Chieh Chen and collaborators at UCLA and Google between 2014 and 2018 ^[1]^[2]^[4]. Its signature techniques are atrous (dilated) convolution, which enlarges a filter's field of view without adding parameters or reducing resolution, and Atrous Spatial Pyramid Pooling (ASPP), which captures multi-scale context in a single pass ^[2]. The family evolved through four main versions: v1 and v2 paired the network with a fully connected Conditional Random Field (CRF) to sharpen boundaries; v3 dropped the CRF and strengthened ASPP; and v3+ added an encoder-decoder structure for sharper object edges, reaching 89.0 percent mean Intersection-over-Union (mIoU) on PASCAL VOC 2012 and 82.1 percent on Cityscapes ^[3]^[4].

The DeepLab family introduced atrous (dilated) convolutions for dense prediction, ASPP for capturing multi-scale context, and an encoder-decoder structure with atrous separable convolutions, achieving state-of-the-art results on the PASCAL VOC and Cityscapes benchmarks for several years ^[1]^[2]^[3]^[4]. The line of work continued at Google Research after the v3+ release with extensions to panoptic segmentation (Panoptic-DeepLab, MaX-DeepLab, kMaX-DeepLab) ^[5]^[6]^[7].

The original DeepLab paper, "Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs" (arXiv:1412.7062), was first posted in December 2014 and presented at ICLR 2015 ^[1]. Subsequent versions were published in TPAMI (v2, 2017), as a technical report (v3, 2017), and at ECCV 2018 (v3+) ^[2]^[3]^[4]. Together, the DeepLab papers have been cited tens of thousands of times and form one of the most widely used baselines in image segmentation research and production ^[2].

What is DeepLab?

Semantic segmentation is the task of assigning a class label to every pixel in an image, producing a dense prediction map at the same spatial resolution as the input. DeepLab models tackle this problem with deep convolutional networks repurposed from image classification. The central design challenge that DeepLab addresses is that classification networks aggressively reduce spatial resolution through pooling and strided convolutions to produce a small, semantically rich feature map, which is the opposite of what dense prediction needs. DeepLab's solution is to keep the deep features semantically rich while preserving (or recovering) spatial resolution, and to expose those features to multiple receptive-field sizes so that objects at different scales can be segmented in one forward pass.

The v2 paper states the goal directly: it makes "three main contributions that are experimentally shown to have substantial practical merit," namely atrous convolution, ASPP, and the combination of deep nets with fully connected CRFs ^[2]. The family is best understood as a four-step evolution. DeepLab v1 introduced the atrous (dilated) convolution and a fully connected conditional random field (CRF) post-processor ^[1]. DeepLab v2 added the Atrous Spatial Pyramid Pooling (ASPP) module and switched the backbone to ResNet-101 ^[2]. DeepLab v3 improved ASPP with batch normalisation and image-level features and removed the CRF ^[3]. DeepLab v3+ added a lightweight decoder for sharper boundaries and adopted depthwise atrous separable convolutions, often paired with an Xception backbone ^[4].

Who created DeepLab?

DeepLab originated as a UCLA / Google collaboration. The first paper was authored by Liang-Chieh Chen (then a PhD student at UCLA, advised by Alan Yuille), George Papandreou (Google), Iasonas Kokkinos (then at Ecole Centrale Paris / INRIA), Kevin Murphy (Google), and Alan L. Yuille (UCLA) ^[1]. The DeepLab v3 and v3+ papers shifted the author list toward Google Research, with Florian Schroff and Hartwig Adam joining and Yukun Zhu added on v3+ ^[3]^[4].

Liang-Chieh Chen joined Google after his PhD and continued the line of work, authoring or co-authoring Panoptic-DeepLab (CVPR 2020), Axial-DeepLab (ECCV 2020), MaX-DeepLab (CVPR 2021), and kMaX-DeepLab (ECCV 2022) ^[5]^[6]^[7]. He later moved to ByteDance.

Paper	Year	Venue	First / corresponding author
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs (DeepLab v1)	2014 / 2015	arXiv 1412.7062, ICLR 2015	Liang-Chieh Chen
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs (DeepLab v2)	2016 / 2017	arXiv 1606.00915, TPAMI 2017	Liang-Chieh Chen
Rethinking Atrous Convolution for Semantic Image Segmentation (DeepLab v3)	2017	arXiv 1706.05587	Liang-Chieh Chen
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (DeepLab v3+)	2018	arXiv 1802.02611, ECCV 2018	Liang-Chieh Chen
Panoptic-DeepLab	2019 / 2020	arXiv 1911.10194, CVPR 2020	Bowen Cheng
MaX-DeepLab	2020 / 2021	arXiv 2012.00759, CVPR 2021	Huiyu Wang
kMaX-DeepLab	2022	ECCV 2022	Qihang Yu

Why did DeepLab matter?

Before DeepLab, the dominant approach to semantic segmentation with deep nets was the Fully Convolutional Network (FCN) of Long, Shelhamer and Darrell, presented at CVPR 2015 ^[8]. FCN demonstrated that a classification network could be turned into a dense predictor by replacing its fully connected layers with convolutions and adding a deconvolution stage to upsample the small output back to the input size. The technique worked, but the predictions were blurry, especially around object boundaries, because the backbone had downsampled the feature map by a factor of 32 before the upsample.

DeepLab proposed a different fix. Rather than upsample blurry features, it kept the features at a higher resolution in the first place, using atrous convolution to expand the receptive field without striding ^[1]. It then used a fully connected CRF as a post-processing step to sharpen boundaries by enforcing pixel-pair consistency across the whole image ^[1]^[14]. This combination produced sharper segmentations than FCN at comparable cost, and it pushed the PASCAL VOC 2012 test set mIoU from FCN's roughly 62 percent to 71.6 percent in the original DeepLab v1 ^[1]^[8].

The design choices that DeepLab introduced (atrous convolution, ASPP, encoder-decoder refinement) became standard tools in segmentation. Most segmentation papers published between 2015 and 2020 either built on DeepLab's components, compared against DeepLab as a baseline, or both. Atrous convolution in particular spread well beyond segmentation and is now common in object detection and dense prediction more generally.

How did DeepLab evolve from v1 to v3+?

The four main DeepLab releases each address a specific limitation of the previous one. The table below summarises the key change in each version, the backbone typically used, the use of CRF post-processing, and the headline result on the PASCAL VOC 2012 test server.

Version	Year	Backbone	Key new ideas	CRF	PASCAL VOC 2012 test mIoU
DeepLab v1	2014 / ICLR 2015	VGG-16	Atrous ("hole") convolution; fully connected CRF post-processing	Yes	71.6 percent
DeepLab v2	2016 / TPAMI 2017	ResNet-101 (also VGG-16 in ablations)	Atrous Spatial Pyramid Pooling (ASPP); multi-scale inputs	Yes	79.7 percent
DeepLab v3	2017	ResNet-101 / ResNet-50	Improved ASPP with batch normalisation and image-level features; cascaded atrous; CRF dropped	No	86.9 percent (with JFT-300M pretraining)
DeepLab v3+	2018 / ECCV 2018	Xception-65, Xception-71 (also ResNet-101, MobileNet-v2)	Encoder-decoder structure; atrous depthwise separable convolutions	No	89.0 percent

The v3 result of 86.9 percent on PASCAL VOC 2012 test is reported with an aligned ResNet-101 backbone pretrained on both ImageNet and the (Google-internal) JFT-300M dataset; without JFT, DeepLab v3 reaches 85.7 percent, still ahead of every prior DeepLab version and without any CRF ^[3]. The v3+ result of 89.0 percent uses an Xception-65 backbone and the COCO + JFT pretraining recipe ^[4].

DeepLab v1 (2014 / ICLR 2015)

The original DeepLab paper combined two ideas. First, it adapted VGG-16 for dense prediction by replacing the last two pooling layers' striding with atrous convolutions, which kept the output stride at 8 instead of the original 32. Second, it ran a dense conditional random field (the model of Krahenbuhl and Koltun, 2011) on the network's softmax output as a post-processing step, producing a sharper, more spatially consistent segmentation ^[1]^[14]. The paper frames the problem as a localisation issue: "responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation," a gap it closes by "combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF)" ^[1]. The system reached 71.6 percent mIoU on the PASCAL VOC 2012 test set and ran at about 8 frames per second on a contemporary GPU ^[1].

DeepLab v2 (2016 / TPAMI 2017)

The TPAMI version added two important pieces. The backbone was upgraded from VGG-16 to ResNet-101, which gave a substantial accuracy boost on its own. More importantly, the paper introduced Atrous Spatial Pyramid Pooling (ASPP), which, in the authors' words, "probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales" ^[2]. In practice the same feature map is processed in parallel by several 3x3 atrous convolutions at different dilation rates (typically 6, 12, 18, 24); the outputs are concatenated and fused, capturing context at multiple scales without a costly image pyramid. With ASPP, multi-scale inputs, and dense CRF, DeepLab v2 reached 79.7 percent mIoU on PASCAL VOC 2012 test and set new records on PASCAL-Context, PASCAL-Person-Part, and Cityscapes ^[2].

DeepLab v3 (2017)

"Rethinking Atrous Convolution for Semantic Image Segmentation" simplified the system. The CRF post-processor was removed; the paper reports that the "proposed 'DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark" ^[3]. The new ASPP module added batch normalisation to all parallel branches and added a global pooling branch (a 1x1 convolution applied to global average pooled features, then upsampled) to encode image-level context. The paper also explored cascaded atrous convolutions, replacing strided blocks deep in the network with atrous blocks at progressively larger rates. With ResNet-101 pretrained on ImageNet plus JFT-300M, DeepLab v3 reached 86.9 percent mIoU on PASCAL VOC 2012 test ^[3].

DeepLab v3+ (2018 / ECCV 2018)

DeepLab v3+ kept the v3 ASPP encoder but added a small decoder. Decoded features from ASPP (at output stride 16) are upsampled by 4x and concatenated with low-level features from an early backbone block (typically conv2 of ResNet or the entry-flow block of Xception) before another upsample. This recovers sharper object boundaries that are otherwise lost when going straight from output stride 16 back to the full resolution.

The paper also proposed atrous separable convolutions: depthwise separable convolutions in which the depthwise convolution is itself atrous. This dramatically reduces compute compared to a dense atrous 3x3 while preserving accuracy. With an Xception-65 backbone pretrained on ImageNet plus JFT-300M, plus pretraining on COCO, DeepLab v3+ "achieves the test set performance of 89.0% and 82.1% on PASCAL VOC 2012 and Cityscapes datasets without any post-processing" ^[4].

Successor models

After v3+, the DeepLab name continued at Google Research with extensions targeting panoptic segmentation (the joint task of semantic and instance segmentation):

Panoptic-DeepLab (Cheng et al., CVPR 2020) added two prediction heads to a DeepLab encoder: a semantic head and a class-agnostic instance head that predicts instance centers and per-pixel offsets. It was the first bottom-up panoptic system to match top-down ones in accuracy, reaching 84.2 percent mIoU, 39.0 percent AP, and 65.5 percent PQ on Cityscapes test ^[5].
Axial-DeepLab (Wang et al., ECCV 2020) replaced the convolutional backbone with axial self-attention layers, an early transformer-flavoured backbone for dense prediction.
MaX-DeepLab (Wang et al., CVPR 2021) was the first end-to-end panoptic segmentation model, using a dual-path mask transformer and a panoptic-quality-inspired bipartite matching loss. It removed the need for box detection and post-processing heuristics ^[6].
kMaX-DeepLab (Yu et al., ECCV 2022) reformulated the cross-attention in a mask transformer as k-means clustering, simplifying training and improving accuracy on COCO and Cityscapes panoptic benchmarks ^[7].

What is atrous convolution?

Innovation	First introduced	What it does
Atrous (dilated) convolution	DeepLab v1	Inserts "holes" between filter taps, expanding the receptive field by a factor of the dilation rate without adding parameters or reducing spatial resolution. Lets a classification backbone produce dense feature maps at output stride 8 or 16 instead of 32.
Fully connected CRF	DeepLab v1	Post-processes the softmax output with a dense pairwise CRF (Krahenbuhl and Koltun 2011) to sharpen boundaries by enforcing colour and position consistency between pixel pairs.
Atrous Spatial Pyramid Pooling (ASPP)	DeepLab v2	Applies several parallel 3x3 atrous convolutions with different dilation rates to the same feature map, capturing multi-scale context without resampling the image.
Improved ASPP with image-level features	DeepLab v3	Adds batch normalisation to ASPP branches and a global pooling branch that encodes image-level context, then concatenates and projects all branches together.
Encoder-decoder with low-level skip	DeepLab v3+	Adds a lightweight decoder that fuses the ASPP encoder output with early-layer features, producing sharper boundaries than direct upsampling.
Atrous depthwise separable convolution	DeepLab v3+	Replaces dense atrous convolutions with depthwise separable variants in both the backbone (Xception) and ASPP, cutting compute substantially.

The v2 paper describes the core idea as "convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks" that "allows us to explicitly control the resolution at which feature responses are computed" ^[2]. Concretely, a standard 3x3 convolution looks at three contiguous taps in each spatial dimension. An atrous 3x3 convolution with rate r looks at three taps spaced r pixels apart, leaving holes between them. The receptive field grows from 3x3 to (2r+1)x(2r+1), but the number of parameters and the per-pixel multiply-add cost are unchanged. By replacing a strided convolution deep in a backbone with an atrous one, you keep the receptive field the same while preserving the spatial resolution of the feature map. This is the technical trick that lets DeepLab use a classification backbone at output stride 8 or 16 instead of 32.

What is Atrous Spatial Pyramid Pooling (ASPP)?

Objects in natural images appear at many scales. A naive way to handle this is to run the network on the input at several scales and merge the predictions, which multiplies compute. ASPP instead runs several atrous convolutions in parallel on the same feature map, each with a different dilation rate (e.g. 6, 12, 18) ^[2]. Each branch sees the same features through a different receptive field, so collectively they capture small objects, mid-scale objects, and large-context regions in a single pass. The branches are concatenated and fused with a 1x1 convolution. From DeepLab v3 onward, ASPP also includes batch normalisation on every branch and an image-level global-pooling branch ^[3].

Which backbone networks does DeepLab use?

DeepLab is backbone-agnostic in principle, and the official TensorFlow implementation supports several ^[11]:

Backbone	Used in DeepLab version	Notes
VGG-16	v1, v2 (ablations)	The original backbone in DeepLab v1.
ResNet-101	v2, v3, v3+	The standard "server-side" backbone for v2 and v3; the modified ResNet-101 used in DeepLab keeps output stride 16 via atrous convolution. See ResNet.
ResNet-50	v3, v3+ (light variants)	A lighter, faster alternative to ResNet-101.
Xception-65, Xception-71	v3+	Aligned Xception variants modified to be fully convolutional and to use atrous separable convolutions.
MobileNet-v2, MobileNet-v3	v3, v3+ (mobile)	Used for mobile and on-device segmentation. See MobileNet.
PNASNet	v3+ (research variants)	Backbone discovered by neural architecture search.
Auto-DeepLab / HNASNet	Auto-DeepLab (Liu et al., CVPR 2019)	Backbone and meta-architecture jointly searched for segmentation.

How does DeepLab perform on benchmarks?

DeepLab models were the standard top-of-leaderboard entries on PASCAL VOC and Cityscapes from 2015 through about 2019. The headline numbers, all on the official test servers and as reported in the source papers, are:

Benchmark	DeepLab version	mIoU	Notes
PASCAL VOC 2012 test	v1	71.6 percent	VGG-16 backbone, dense CRF ^[1]
PASCAL VOC 2012 test	v2	79.7 percent	ResNet-101 backbone, ASPP, dense CRF ^[2]
PASCAL VOC 2012 test	v3	86.9 percent	ResNet-101, JFT-300M pretraining, no CRF ^[3]
PASCAL VOC 2012 test	v3+	89.0 percent	Xception-65, COCO + JFT pretraining, no CRF ^[4]
Cityscapes test	v3+	82.1 percent	Xception-71, no CRF ^[4]
Cityscapes test	Panoptic-DeepLab	84.2 percent (mIoU); 65.5 percent PQ	SWideRNet backbone ^[5]

For reference, FCN reported about 62 percent mIoU on PASCAL VOC 2012 test (2015) ^[8], and PSPNet reached 85.4 percent and 80.2 percent on Cityscapes (2017) ^[9]. DeepLab v3+ broke 89 percent on PASCAL VOC 2012 in 2018 ^[4].

What software implements DeepLab?

Implementation	Maintainer	Notes
tensorflow/models/research/deeplab	Google (TensorFlow team)	The official TensorFlow implementation, covering v1 through v3+, with backbones for MobileNet-v2/v3, Xception, ResNet-50/101, PNASNet, and Auto-DeepLab. Apache 2.0 license. ^[11]
torchvision.models.segmentation	PyTorch	Built-in DeepLab v3 with `deeplabv3_resnet50`, `deeplabv3_resnet101`, and `deeplabv3_mobilenet_v3_large` model builders, with COCO + VOC pretrained weights. ^[12]
kazuto1011/deeplab-pytorch	Community	Widely used PyTorch reimplementation of DeepLab v2 trained on COCO-Stuff and PASCAL VOC.
VainF/DeepLabV3Plus-Pytorch	Community	Popular PyTorch reimplementation of DeepLab v3 and v3+ for VOC and Cityscapes.
MMSegmentation	OpenMMLab	The OpenMMLab segmentation toolbox includes DeepLab v3 and v3+ as standard reference models.
Detectron2 (projects)	Meta AI	Includes Panoptic-DeepLab as a reference panoptic segmentation project.
Panoptic-DeepLab (bowenc0221)	Original authors	Official PyTorch reimplementation of Panoptic-DeepLab for the CVPR 2020 paper. ^[5]

What is DeepLab used for?

DeepLab models are deployed in production across several domains where dense per-pixel prediction is needed:

Autonomous driving: lane, road, vehicle, pedestrian, and free-space segmentation, typically benchmarked on Cityscapes, BDD100K, and KITTI.
Medical imaging: organ, lesion, and tumour segmentation in CT, MRI, and pathology slides. DeepLab v3+ with a ResNet-101 or MobileNet backbone is a common drop-in baseline alongside U-Net-style models.
Satellite and aerial imagery: building footprints, road networks, agricultural plots, deforestation, and damage assessment.
Robotics: scene understanding for manipulation and navigation, often coupled with depth or instance segmentation heads.
Photo and video editing: portrait segmentation for background blur, virtual backgrounds, and effects. The Pixel "Portrait Mode" in some Google Pixel generations and similar mobile features are based on DeepLab variants with MobileNet backbones.
Augmented reality: real-time scene segmentation for occlusion handling and surface understanding.
Industrial inspection: defect segmentation on manufactured parts, surface inspection, and quality control.

How does DeepLab compare with other segmentation methods?

Model	Year	Authors	Core idea	Notes
FCN	2015	Long, Shelhamer, Darrell	Classification net with fully convolutional head and skip connections	The first deep, end-to-end semantic segmentation network. About 62 percent mIoU on PASCAL VOC 2012 test. ^[8]
DeepLab v1-v3+	2014-2018	Chen et al.	Atrous convolution + ASPP + (v3+) encoder-decoder	The dominant family of dense-prediction segmentation models from 2015 to about 2020. ^[1]^[2]^[3]^[4]
U-Net	2015	Ronneberger, Fischer, Brox	Symmetric encoder-decoder with skip connections	Designed for biomedical segmentation; the standard baseline for medical imaging.
SegNet	2015 / 2017	Badrinarayanan, Kendall, Cipolla	Encoder-decoder with pooling-index unpooling	Aimed at memory-efficient road-scene segmentation.
PSPNet	2017	Zhao et al.	Pyramid pooling module on top of a deep backbone	85.4 percent on PASCAL VOC 2012 and 80.2 percent on Cityscapes test. Direct contemporary of DeepLab v3. ^[9]
RefineNet	2017	Lin et al.	Multi-path refinement with residual chained pooling	Strong PASCAL VOC and Cityscapes results.
HRNet	2019	Sun et al.	Maintains a high-resolution branch throughout the network	Avoids downsampling-then-upsampling; strong on segmentation, pose, and detection.
Mask R-CNN	2017	He et al.	Adds a mask head to Faster R-CNN	Solves instance segmentation, a different task.
SegFormer	2021	Xie et al.	Hierarchical transformer encoder, lightweight MLP decoder	Showed transformers could match DeepLab on Cityscapes with simpler decoders.
Mask2Former	2022	Cheng et al.	Mask classification with masked attention	A unified semantic, instance, and panoptic model that exceeded DeepLab and PSPNet on most leaderboards by 2022. ^[10]
OneFormer	2023	Jain et al.	Single transformer trained jointly on all three segmentation tasks	Outperforms specialised Mask2Former on ADE20K, Cityscapes, and COCO.

DeepLab v3+ remained competitive with these later models on many practical workloads through 2024, in part because it trains relatively quickly, runs efficiently on a single GPU, and ships with mature open-source code in both TensorFlow and PyTorch.

What is DeepLab's influence on computer vision?

The DeepLab papers are among the most cited works in computer vision. The v2 (TPAMI) paper alone has tens of thousands of citations on Google Scholar, and the v1, v3, and v3+ papers are each in the high thousands or more ^[2]. A few specific influences are worth noting:

Atrous (dilated) convolution became a default tool in segmentation, dense detection, and audio/sequence modelling. The same idea, under the name dilated convolution, is the core of WaveNet and many speech models.
ASPP and close variants appear in dozens of follow-up segmentation models and were widely copied into instance and panoptic segmentation pipelines.
The DeepLab decoder of v3+ became a template for adding low-level skip features to dense prediction heads.
The official tensorflow/models/research/deeplab repository has been a starting point for many production segmentation systems ^[11].
Liang-Chieh Chen's continued work on Panoptic-DeepLab, MaX-DeepLab, and kMaX-DeepLab carried the DeepLab name into the panoptic and transformer eras ^[5]^[6]^[7].

What are DeepLab's limitations?

DeepLab has some real limitations that are worth being honest about:

Compute cost at high resolution. Producing dense predictions at output stride 8 with a ResNet-101 or Xception backbone is expensive. Real-time use on full HD images typically requires switching to a MobileNet backbone or sacrificing output stride.
Boundary precision. The original DeepLab v1 and v2 needed a dense CRF post-processor specifically because the network output was too blurry on object boundaries ^[1]. The v3+ decoder helps a lot but does not fully solve the problem; thin structures (poles, wires) remain difficult.
CRF was slow. The dense CRF used in v1 and v2 added meaningful latency. Removing it in v3 was partly a recognition that the cost no longer justified the small accuracy gain ^[3].
Not directly an instance or panoptic model. DeepLab v1 through v3+ produce semantic segmentations only. Solving instance segmentation or panoptic segmentation requires the later Panoptic-DeepLab or MaX-DeepLab extensions, or pairing with Mask R-CNN ^[5]^[6].
Surpassed by transformers on hard benchmarks. Mask2Former, Mask DINO, OneFormer, and SAM-derived methods now hold the top entries on COCO panoptic, ADE20K, and Cityscapes panoptic ^[10]. DeepLab v3+ is rarely the SOTA in 2025; it is rather the strong, easy-to-train baseline.
Backbone choice matters a lot. Most published DeepLab numbers rely on heavy ImageNet (and sometimes JFT-300M) pretraining. Reproducing the SOTA results without those weights is hard.

Is DeepLab still used in 2026?

By 2026 the public state of the art in semantic and panoptic segmentation has moved toward transformer-based mask classification (Mask2Former, OneFormer) and foundation segmentation models such as the Segment Anything Model (SAM and SAM 2) ^[10]. DeepLab is no longer the headline architecture in new segmentation papers.

That said, DeepLab v3 and v3+ remain widely used in production. They train quickly on a single GPU, they have well-tested open-source implementations in both TensorFlow and PyTorch (including a built-in torchvision.models.segmentation.deeplabv3_* API), they are easy to deploy on mobile via MobileNet backbones, and they ship with predictable, reproducible Cityscapes and PASCAL VOC numbers ^[11]^[12]. For a team that needs a strong semantic segmentation baseline, DeepLab v3+ is still a common first choice.

In simple terms (ELI5)

Imagine you want a computer to colour in a photo so that every pixel of road is grey, every car is blue, and every tree is green. That is semantic segmentation. The hard part is that the networks that recognise objects shrink the picture down to a tiny thumbnail to figure out what is in it, so when they blow it back up the edges come out fuzzy. DeepLab fixes this with a trick called atrous convolution: it spreads the network's "eyes" further apart so they can see a big area without shrinking the picture, keeping the edges crisp. It also looks at the same scene through several differently sized windows at once (ASPP) so it can spot both a tiny dog and a huge bus in one glance. Early versions cleaned up the edges with an extra smoothing step (a CRF), and later ones (v3+) added a small "sharpening" stage instead.

References

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv:1412.7062. Presented at ICLR 2015. https://arxiv.org/abs/1412.7062 ↩
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(4):834-848. arXiv:1606.00915. https://arxiv.org/abs/1606.00915 ↩
Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv:1706.05587. https://arxiv.org/abs/1706.05587 ↩
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV 2018. arXiv:1802.02611. https://arxiv.org/abs/1802.02611 ↩
Cheng, B., Collins, M. D., Zhu, Y., Liu, T., Huang, T. S., Adam, H., and Chen, L.-C. (2020). Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation. CVPR 2020. arXiv:1911.10194. https://arxiv.org/abs/1911.10194 ↩
Wang, H., Zhu, Y., Adam, H., Yuille, A., and Chen, L.-C. (2021). MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers. CVPR 2021. arXiv:2012.00759. https://arxiv.org/abs/2012.00759 ↩
Yu, Q., Wang, H., Qiao, S., Collins, M., Zhu, Y., Adam, H., Yuille, A., and Chen, L.-C. (2022). k-means Mask Transformer (kMaX-DeepLab). ECCV 2022. ↩
Long, J., Shelhamer, E., and Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. CVPR 2015. arXiv:1411.4038. https://arxiv.org/abs/1411.4038 ↩
Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyramid Scene Parsing Network. CVPR 2017. arXiv:1612.01105. https://arxiv.org/abs/1612.01105 ↩
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Girdhar, R. (2022). Masked-attention Mask Transformer for Universal Image Segmentation (Mask2Former). CVPR 2022. arXiv:2112.01527. ↩
Official TensorFlow DeepLab implementation, tensorflow/models/research/deeplab. Apache 2.0 license. https://github.com/tensorflow/models/tree/master/research/deeplab ↩
PyTorch torchvision DeepLabV3 documentation. https://docs.pytorch.org/vision/main/models/deeplabv3.html ↩
Liang-Chieh Chen, DeepLab project page, http://liangchiehchen.com/projects/DeepLab.html
Krahenbuhl, P., and Koltun, V. (2011). Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NeurIPS 2011. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Depthwise Separable CNN MobileNet Stride Swin Transformer

What is DeepLab?

Who created DeepLab?

Why did DeepLab matter?

How did DeepLab evolve from v1 to v3+?

DeepLab v1 (2014 / ICLR 2015)

DeepLab v2 (2016 / TPAMI 2017)

DeepLab v3 (2017)

DeepLab v3+ (2018 / ECCV 2018)

Successor models

What is atrous convolution?

What is Atrous Spatial Pyramid Pooling (ASPP)?

Which backbone networks does DeepLab use?

How does DeepLab perform on benchmarks?

What software implements DeepLab?

What is DeepLab used for?

How does DeepLab compare with other segmentation methods?

What is DeepLab's influence on computer vision?

What are DeepLab's limitations?

Is DeepLab still used in 2026?

In simple terms (ELI5)

See also

References

Improve this article

Related Articles

MediaPipe

Tensor Processing Unit (TPU)

TensorFlow

Diffusion model

Translational invariance

Computer vision

What links here

Related Articles

MediaPipe

Tensor Processing Unit (TPU)

TensorFlow

Diffusion model

Translational invariance

Computer vision

What links here