DeepLab
Last reviewed
May 1, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,999 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,999 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepLab is a family of deep convolutional neural network architectures for [[semantic_segmentation|semantic segmentation]], developed by Liang-Chieh Chen and collaborators at UCLA and Google between 2014 and 2018. The DeepLab family introduced atrous (dilated) convolutions for dense prediction, atrous spatial pyramid pooling (ASPP) for capturing multi-scale context, and an encoder-decoder structure with atrous separable convolutions, achieving state-of-the-art results on the PASCAL VOC and Cityscapes benchmarks for several years. The line of work continued at Google Research after the v3+ release with extensions to panoptic segmentation (Panoptic-DeepLab, MaX-DeepLab, kMaX-DeepLab).
The original DeepLab paper, "Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs" (arXiv:1412.7062), was first posted in December 2014 and presented at ICLR 2015. Subsequent versions were published in TPAMI (v2, 2017), as a technical report (v3, 2017), and at ECCV 2018 (v3+). Together, the DeepLab papers have been cited tens of thousands of times and form one of the most widely used baselines in [[image_segmentation|image segmentation]] research and production.
Semantic segmentation is the task of assigning a class label to every pixel in an image, producing a dense prediction map at the same spatial resolution as the input. DeepLab models tackle this problem with deep convolutional networks repurposed from image classification. The central design challenge that DeepLab addresses is that classification networks aggressively reduce spatial resolution through pooling and strided convolutions to produce a small, semantically rich feature map, which is the opposite of what dense prediction needs. DeepLab's solution is to keep the deep features semantically rich while preserving (or recovering) spatial resolution, and to expose those features to multiple receptive-field sizes so that objects at different scales can be segmented in one forward pass.
The family is best understood as a four-step evolution. DeepLab v1 introduced the atrous (dilated) convolution and a fully connected conditional random field (CRF) post-processor. DeepLab v2 added the Atrous Spatial Pyramid Pooling (ASPP) module and switched the backbone to ResNet-101. DeepLab v3 improved ASPP with batch normalisation and image-level features and removed the CRF. DeepLab v3+ added a lightweight decoder for sharper boundaries and adopted depthwise atrous separable convolutions, often paired with an Xception backbone.
DeepLab originated as a UCLA / Google collaboration. The first paper was authored by Liang-Chieh Chen (then a PhD student at UCLA, advised by Alan Yuille), George Papandreou (Google), Iasonas Kokkinos (then at École Centrale Paris / INRIA), Kevin Murphy (Google), and Alan L. Yuille (UCLA). The DeepLab v3 and v3+ papers shifted the author list toward Google Research, with Florian Schroff and Hartwig Adam joining and Yukun Zhu added on v3+.
Liang-Chieh Chen joined Google after his PhD and continued the line of work, authoring or co-authoring Panoptic-DeepLab (CVPR 2020), Axial-DeepLab (ECCV 2020), MaX-DeepLab (CVPR 2021), and kMaX-DeepLab (ECCV 2022). He later moved to ByteDance.
| Paper | Year | Venue | First / corresponding author |
|---|---|---|---|
| Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs (DeepLab v1) | 2014 / 2015 | arXiv 1412.7062, ICLR 2015 | Liang-Chieh Chen |
| DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs (DeepLab v2) | 2016 / 2017 | arXiv 1606.00915, TPAMI 2017 | Liang-Chieh Chen |
| Rethinking Atrous Convolution for Semantic Image Segmentation (DeepLab v3) | 2017 | arXiv 1706.05587 | Liang-Chieh Chen |
| Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (DeepLab v3+) | 2018 | arXiv 1802.02611, ECCV 2018 | Liang-Chieh Chen |
| Panoptic-DeepLab | 2019 / 2020 | arXiv 1911.10194, CVPR 2020 | Bowen Cheng |
| MaX-DeepLab | 2020 / 2021 | arXiv 2012.00759, CVPR 2021 | Huiyu Wang |
| kMaX-DeepLab | 2022 | ECCV 2022 | Qihang Yu |
Before DeepLab, the dominant approach to semantic segmentation with deep nets was the Fully Convolutional Network (FCN) of Long, Shelhamer and Darrell, presented at CVPR 2015. FCN demonstrated that a classification network could be turned into a dense predictor by replacing its fully connected layers with convolutions and adding a deconvolution stage to upsample the small output back to the input size. The technique worked, but the predictions were blurry, especially around object boundaries, because the backbone had downsampled the feature map by a factor of 32 before the upsample.
DeepLab proposed a different fix. Rather than upsample blurry features, it kept the features at a higher resolution in the first place, using atrous convolution to expand the receptive field without striding. It then used a fully connected CRF as a post-processing step to sharpen boundaries by enforcing pixel-pair consistency across the whole image. This combination produced sharper segmentations than FCN at comparable cost, and it pushed the PASCAL VOC 2012 test set mIoU from FCN's roughly 62 percent to 71.6 percent in the original DeepLab v1.
The design choices that DeepLab introduced (atrous convolution, ASPP, encoder-decoder refinement) became standard tools in segmentation. Most segmentation papers published between 2015 and 2020 either built on DeepLab's components, compared against DeepLab as a baseline, or both. Atrous convolution in particular spread well beyond segmentation and is now common in object detection and dense prediction more generally.
The four main DeepLab releases each address a specific limitation of the previous one. The table below summarises the key change in each version, the backbone typically used, the use of CRF post-processing, and the headline result on the PASCAL VOC 2012 test server.
| Version | Year | Backbone | Key new ideas | CRF | PASCAL VOC 2012 test mIoU |
|---|---|---|---|---|---|
| DeepLab v1 | 2014 / ICLR 2015 | VGG-16 | Atrous ("hole") convolution; fully connected CRF post-processing | Yes | 71.6 percent |
| DeepLab v2 | 2016 / TPAMI 2017 | ResNet-101 (also VGG-16 in ablations) | Atrous Spatial Pyramid Pooling (ASPP); multi-scale inputs | Yes | 79.7 percent |
| DeepLab v3 | 2017 | ResNet-101 / ResNet-50 | Improved ASPP with batch normalisation and image-level features; cascaded atrous; CRF dropped | No | 86.9 percent (with JFT-300M pretraining) |
| DeepLab v3+ | 2018 / ECCV 2018 | Xception-65, Xception-71 (also ResNet-101, MobileNet-v2) | Encoder-decoder structure; atrous depthwise separable convolutions | No | 89.0 percent |
The v3 result of 86.9 percent on PASCAL VOC 2012 test is reported with an aligned ResNet-101 backbone pretrained on both ImageNet and the (Google-internal) JFT-300M dataset. Without JFT pretraining the mIoU is several points lower, which is typical for VOC-trained segmentation models. The v3+ result of 89.0 percent uses an Xception-65 backbone and the COCO + JFT pretraining recipe.
The original DeepLab paper combined two ideas. First, it adapted VGG-16 for dense prediction by replacing the last two pooling layers' striding with atrous convolutions, which kept the output stride at 8 instead of the original 32. Second, it ran a dense conditional random field (the model of Krähenbühl and Koltun, 2011) on the network's softmax output as a post-processing step, producing a sharper, more spatially consistent segmentation. The system reached 71.6 percent mIoU on the PASCAL VOC 2012 test set and ran at about 8 frames per second on a contemporary GPU.
The TPAMI version added two important pieces. The backbone was upgraded from VGG-16 to ResNet-101, which gave a substantial accuracy boost on its own. More importantly, the paper introduced Atrous Spatial Pyramid Pooling (ASPP), in which the same feature map is processed in parallel by several 3x3 atrous convolutions at different dilation rates (typically 6, 12, 18, 24). The outputs are concatenated and fused, capturing context at multiple scales without a costly image pyramid. With ASPP, multi-scale inputs, and dense CRF, DeepLab v2 reached 79.7 percent mIoU on PASCAL VOC 2012 test and set new records on PASCAL-Context, PASCAL-Person-Part, and Cityscapes.
"Rethinking Atrous Convolution for Semantic Image Segmentation" simplified the system. The CRF post-processor was removed; the authors found that with a sufficiently strong backbone and a better ASPP, CRF refinement no longer made a meaningful difference. The new ASPP module added batch normalisation to all parallel branches and added a global pooling branch (a 1x1 convolution applied to global average pooled features, then upsampled) to encode image-level context. The paper also explored cascaded atrous convolutions, replacing strided blocks deep in the network with atrous blocks at progressively larger rates. With ResNet-101 pretrained on ImageNet plus JFT-300M, DeepLab v3 reached 86.9 percent mIoU on PASCAL VOC 2012 test.
DeepLab v3+ kept the v3 ASPP encoder but added a small decoder. Decoded features from ASPP (at output stride 16) are upsampled by 4x and concatenated with low-level features from an early backbone block (typically conv2 of ResNet or the entry-flow block of Xception) before another upsample. This recovers sharper object boundaries that are otherwise lost when going straight from output stride 16 back to the full resolution.
The paper also proposed atrous separable convolutions: depthwise separable convolutions in which the depthwise convolution is itself atrous. This dramatically reduces compute compared to a dense atrous 3x3 while preserving accuracy. With an Xception-65 backbone pretrained on ImageNet plus JFT-300M, plus pretraining on COCO, DeepLab v3+ reached 89.0 percent mIoU on PASCAL VOC 2012 test and 82.1 percent on Cityscapes test, both without CRF post-processing.
After v3+, the DeepLab name continued at Google Research with extensions targeting panoptic segmentation (the joint task of [[semantic_segmentation|semantic]] and [[instance_segmentation|instance segmentation]]):
| Innovation | First introduced | What it does |
|---|---|---|
| Atrous (dilated) convolution | DeepLab v1 | Inserts "holes" between filter taps, expanding the receptive field by a factor of the dilation rate without adding parameters or reducing spatial resolution. Lets a classification backbone produce dense feature maps at output stride 8 or 16 instead of 32. |
| Fully connected CRF | DeepLab v1 | Post-processes the softmax output with a dense pairwise CRF (Krähenbühl and Koltun 2011) to sharpen boundaries by enforcing colour and position consistency between pixel pairs. |
| Atrous Spatial Pyramid Pooling (ASPP) | DeepLab v2 | Applies several parallel 3x3 atrous convolutions with different dilation rates to the same feature map, capturing multi-scale context without resampling the image. |
| Improved ASPP with image-level features | DeepLab v3 | Adds batch normalisation to ASPP branches and a global pooling branch that encodes image-level context, then concatenates and projects all branches together. |
| Encoder-decoder with low-level skip | DeepLab v3+ | Adds a lightweight decoder that fuses the ASPP encoder output with early-layer features, producing sharper boundaries than direct upsampling. |
| Atrous depthwise separable convolution | DeepLab v3+ | Replaces dense atrous convolutions with depthwise separable variants in both the backbone (Xception) and ASPP, cutting compute substantially. |
A standard 3x3 convolution looks at three contiguous taps in each spatial dimension. An atrous 3x3 convolution with rate r looks at three taps spaced r pixels apart, leaving holes between them. The receptive field grows from 3x3 to (2r+1)x(2r+1), but the number of parameters and the per-pixel multiply-add cost are unchanged. By replacing a strided convolution deep in a backbone with an atrous one, you keep the receptive field the same while preserving the spatial resolution of the feature map. This is the technical trick that lets DeepLab use a classification backbone at output stride 8 or 16 instead of 32.
Objects in natural images appear at many scales. A naive way to handle this is to run the network on the input at several scales and merge the predictions, which multiplies compute. ASPP instead runs several atrous convolutions in parallel on the same feature map, each with a different dilation rate (e.g. 6, 12, 18). Each branch sees the same features through a different receptive field, so collectively they capture small objects, mid-scale objects, and large-context regions in a single pass. The branches are concatenated and fused with a 1x1 convolution.
DeepLab is backbone-agnostic in principle, and the official TensorFlow implementation supports several:
| Backbone | Used in DeepLab version | Notes |
|---|---|---|
| VGG-16 | v1, v2 (ablations) | The original backbone in DeepLab v1. |
| ResNet-101 | v2, v3, v3+ | The standard "server-side" backbone for v2 and v3; the modified ResNet-101 used in DeepLab keeps output stride 16 via atrous convolution. See [[resnet |
| ResNet-50 | v3, v3+ (light variants) | A lighter, faster alternative to ResNet-101. |
| Xception-65, Xception-71 | v3+ | Aligned Xception variants modified to be fully convolutional and to use atrous separable convolutions. |
| MobileNet-v2, MobileNet-v3 | v3, v3+ (mobile) | Used for mobile and on-device segmentation. See [[mobilenet |
| PNASNet | v3+ (research variants) | Backbone discovered by neural architecture search. |
| Auto-DeepLab / HNASNet | Auto-DeepLab (Liu et al., CVPR 2019) | Backbone and meta-architecture jointly searched for segmentation. |
DeepLab models were the standard top-of-leaderboard entries on PASCAL VOC and Cityscapes from 2015 through about 2019. The headline numbers, all on the official test servers and as reported in the source papers, are:
| Benchmark | DeepLab version | mIoU | Notes |
|---|---|---|---|
| PASCAL VOC 2012 test | v1 | 71.6 percent | VGG-16 backbone, dense CRF |
| PASCAL VOC 2012 test | v2 | 79.7 percent | ResNet-101 backbone, ASPP, dense CRF |
| PASCAL VOC 2012 test | v3 | 86.9 percent | ResNet-101, JFT-300M pretraining, no CRF |
| PASCAL VOC 2012 test | v3+ | 89.0 percent | Xception-65, COCO + JFT pretraining, no CRF |
| Cityscapes test | v3+ | 82.1 percent | Xception-71, no CRF |
| Cityscapes test | Panoptic-DeepLab | 84.2 percent (mIoU); 65.5 percent PQ | SWideRNet backbone |
For reference, FCN reported about 62 percent mIoU on PASCAL VOC 2012 test (2015), and PSPNet reached 85.4 percent and 80.2 percent on Cityscapes (2017). DeepLab v3+ broke 89 percent on PASCAL VOC 2012 in 2018.
| Implementation | Maintainer | Notes |
|---|---|---|
| tensorflow/models/research/deeplab | Google (TensorFlow team) | The official TensorFlow implementation, covering v1 through v3+, with backbones for MobileNet-v2/v3, Xception, ResNet-50/101, PNASNet, and Auto-DeepLab. Apache 2.0 license. |
| torchvision.models.segmentation | PyTorch | Built-in DeepLab v3 with deeplabv3_resnet50, deeplabv3_resnet101, and deeplabv3_mobilenet_v3_large model builders, with COCO + VOC pretrained weights. |
| kazuto1011/deeplab-pytorch | Community | Widely used PyTorch reimplementation of DeepLab v2 trained on COCO-Stuff and PASCAL VOC. |
| VainF/DeepLabV3Plus-Pytorch | Community | Popular PyTorch reimplementation of DeepLab v3 and v3+ for VOC and Cityscapes. |
| MMSegmentation | OpenMMLab | The OpenMMLab segmentation toolbox includes DeepLab v3 and v3+ as standard reference models. |
| Detectron2 (projects) | Meta AI | Includes Panoptic-DeepLab as a reference panoptic segmentation project. |
| Panoptic-DeepLab (bowenc0221) | Original authors | Official PyTorch reimplementation of Panoptic-DeepLab for the CVPR 2020 paper. |
DeepLab models are deployed in production across several domains where dense per-pixel prediction is needed:
| Model | Year | Authors | Core idea | Notes |
|---|---|---|---|---|
| FCN | 2015 | Long, Shelhamer, Darrell | Classification net with fully convolutional head and skip connections | The first deep, end-to-end semantic segmentation network. About 62 percent mIoU on PASCAL VOC 2012 test. |
| DeepLab v1–v3+ | 2014–2018 | Chen et al. | Atrous convolution + ASPP + (v3+) encoder-decoder | The dominant family of dense-prediction segmentation models from 2015 to about 2020. |
| U-Net | 2015 | Ronneberger, Fischer, Brox | Symmetric encoder-decoder with skip connections | Designed for biomedical segmentation; the standard baseline for medical imaging. |
| SegNet | 2015 / 2017 | Badrinarayanan, Kendall, Cipolla | Encoder-decoder with pooling-index unpooling | Aimed at memory-efficient road-scene segmentation. |
| PSPNet | 2017 | Zhao et al. | Pyramid pooling module on top of a deep backbone | 85.4 percent on PASCAL VOC 2012 and 80.2 percent on Cityscapes test. Direct contemporary of DeepLab v3. |
| RefineNet | 2017 | Lin et al. | Multi-path refinement with residual chained pooling | Strong PASCAL VOC and Cityscapes results. |
| HRNet | 2019 | Sun et al. | Maintains a high-resolution branch throughout the network | Avoids downsampling-then-upsampling; strong on segmentation, pose, and detection. |
| [[mask_r_cnn | Mask R-CNN]] | 2017 | He et al. | Adds a mask head to Faster R-CNN |
| SegFormer | 2021 | Xie et al. | Hierarchical transformer encoder, lightweight MLP decoder | Showed transformers could match DeepLab on Cityscapes with simpler decoders. |
| Mask2Former | 2022 | Cheng et al. | Mask classification with masked attention | A unified semantic, instance, and panoptic model that exceeded DeepLab and PSPNet on most leaderboards by 2022. |
| OneFormer | 2023 | Jain et al. | Single transformer trained jointly on all three segmentation tasks | Outperforms specialised Mask2Former on ADE20K, Cityscapes, and COCO. |
DeepLab v3+ remained competitive with these later models on many practical workloads through 2024, in part because it trains relatively quickly, runs efficiently on a single GPU, and ships with mature open-source code in both TensorFlow and PyTorch.
The DeepLab papers are among the most cited works in computer vision. The v2 (TPAMI) paper alone has tens of thousands of citations on Google Scholar, and the v1, v3, and v3+ papers are each in the high thousands or more. A few specific influences are worth noting:
DeepLab has some real limitations that are worth being honest about:
By 2026 the public state of the art in semantic and panoptic segmentation has moved toward transformer-based mask classification (Mask2Former, OneFormer) and foundation segmentation models such as the Segment Anything Model (SAM and SAM 2). DeepLab is no longer the headline architecture in new segmentation papers.
That said, DeepLab v3 and v3+ remain widely used in production. They train quickly on a single GPU, they have well-tested open-source implementations in both TensorFlow and PyTorch (including a built-in torchvision.models.segmentation.deeplabv3_* API), they are easy to deploy on mobile via MobileNet backbones, and they ship with predictable, reproducible Cityscapes and PASCAL VOC numbers. For a team that needs a strong semantic segmentation baseline, DeepLab v3+ is still a common first choice.