# DeepLab

> Source: https://aiwiki.ai/wiki/deeplab
> Updated: 2026-06-09
> Categories: Computer Vision, Deep Learning, Google
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# DeepLab

**DeepLab** is a family of deep convolutional neural network architectures for [semantic segmentation](/wiki/semantic_segmentation), developed by Liang-Chieh Chen and collaborators at UCLA and Google between 2014 and 2018. The DeepLab family introduced atrous (dilated) convolutions for dense prediction, atrous spatial pyramid pooling (ASPP) for capturing multi-scale context, and an encoder-decoder structure with atrous separable convolutions, achieving state-of-the-art results on the PASCAL VOC and Cityscapes benchmarks for several years. The line of work continued at Google Research after the v3+ release with extensions to panoptic segmentation (Panoptic-DeepLab, MaX-DeepLab, kMaX-DeepLab).

The original DeepLab paper, "Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs" (arXiv:1412.7062), was first posted in December 2014 and presented at ICLR 2015. Subsequent versions were published in TPAMI (v2, 2017), as a technical report (v3, 2017), and at ECCV 2018 (v3+). Together, the DeepLab papers have been cited tens of thousands of times and form one of the most widely used baselines in [image segmentation](/wiki/image_segmentation) research and production.

## Definition and overview

Semantic segmentation is the task of assigning a class label to every pixel in an image, producing a dense prediction map at the same spatial resolution as the input. DeepLab models tackle this problem with deep convolutional networks repurposed from image classification. The central design challenge that DeepLab addresses is that classification networks aggressively reduce spatial resolution through pooling and strided convolutions to produce a small, semantically rich feature map, which is the opposite of what dense prediction needs. DeepLab's solution is to keep the deep features semantically rich while preserving (or recovering) spatial resolution, and to expose those features to multiple receptive-field sizes so that objects at different scales can be segmented in one forward pass.

The family is best understood as a four-step evolution. DeepLab v1 introduced the atrous (dilated) convolution and a fully connected conditional random field (CRF) post-processor. DeepLab v2 added the Atrous Spatial Pyramid Pooling (ASPP) module and switched the backbone to ResNet-101. DeepLab v3 improved ASPP with batch normalisation and image-level features and removed the CRF. DeepLab v3+ added a lightweight decoder for sharper boundaries and adopted depthwise atrous separable convolutions, often paired with an Xception backbone.

## Authors and origin

DeepLab originated as a UCLA / Google collaboration. The first paper was authored by Liang-Chieh Chen (then a PhD student at UCLA, advised by Alan Yuille), George Papandreou (Google), Iasonas Kokkinos (then at École Centrale Paris / INRIA), Kevin Murphy (Google), and Alan L. Yuille (UCLA). The DeepLab v3 and v3+ papers shifted the author list toward Google Research, with Florian Schroff and Hartwig Adam joining and Yukun Zhu added on v3+.

Liang-Chieh Chen joined Google after his PhD and continued the line of work, authoring or co-authoring Panoptic-DeepLab (CVPR 2020), Axial-DeepLab (ECCV 2020), MaX-DeepLab (CVPR 2021), and kMaX-DeepLab (ECCV 2022). He later moved to ByteDance.

| Paper | Year | Venue | First / corresponding author |
|---|---|---|---|
| Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs (DeepLab v1) | 2014 / 2015 | arXiv 1412.7062, ICLR 2015 | Liang-Chieh Chen |
| DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs (DeepLab v2) | 2016 / 2017 | arXiv 1606.00915, TPAMI 2017 | Liang-Chieh Chen |
| Rethinking Atrous Convolution for Semantic Image Segmentation (DeepLab v3) | 2017 | arXiv 1706.05587 | Liang-Chieh Chen |
| Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (DeepLab v3+) | 2018 | arXiv 1802.02611, ECCV 2018 | Liang-Chieh Chen |
| Panoptic-DeepLab | 2019 / 2020 | arXiv 1911.10194, CVPR 2020 | Bowen Cheng |
| MaX-DeepLab | 2020 / 2021 | arXiv 2012.00759, CVPR 2021 | Huiyu Wang |
| kMaX-DeepLab | 2022 | ECCV 2022 | Qihang Yu |

## Why it mattered

Before DeepLab, the dominant approach to semantic segmentation with deep nets was the Fully Convolutional Network (FCN) of Long, Shelhamer and Darrell, presented at CVPR 2015. FCN demonstrated that a classification network could be turned into a dense predictor by replacing its fully connected layers with convolutions and adding a deconvolution stage to upsample the small output back to the input size. The technique worked, but the predictions were blurry, especially around object boundaries, because the backbone had downsampled the feature map by a factor of 32 before the upsample.

DeepLab proposed a different fix. Rather than upsample blurry features, it kept the features at a higher resolution in the first place, using atrous convolution to expand the receptive field without striding. It then used a fully connected CRF as a post-processing step to sharpen boundaries by enforcing pixel-pair consistency across the whole image. This combination produced sharper segmentations than FCN at comparable cost, and it pushed the PASCAL VOC 2012 test set mIoU from FCN's roughly 62 percent to 71.6 percent in the original DeepLab v1.

The design choices that DeepLab introduced (atrous convolution, ASPP, encoder-decoder refinement) became standard tools in segmentation. Most segmentation papers published between 2015 and 2020 either built on DeepLab's components, compared against DeepLab as a baseline, or both. Atrous convolution in particular spread well beyond segmentation and is now common in object detection and dense prediction more generally.

## Versions and evolution

The four main DeepLab releases each address a specific limitation of the previous one. The table below summarises the key change in each version, the backbone typically used, the use of CRF post-processing, and the headline result on the PASCAL VOC 2012 test server.

| Version | Year | Backbone | Key new ideas | CRF | PASCAL VOC 2012 test mIoU |
|---|---|---|---|---|---|
| DeepLab v1 | 2014 / ICLR 2015 | VGG-16 | Atrous ("hole") convolution; fully connected CRF post-processing | Yes | 71.6 percent |
| DeepLab v2 | 2016 / TPAMI 2017 | ResNet-101 (also VGG-16 in ablations) | Atrous Spatial Pyramid Pooling (ASPP); multi-scale inputs | Yes | 79.7 percent |
| DeepLab v3 | 2017 | ResNet-101 / ResNet-50 | Improved ASPP with batch normalisation and image-level features; cascaded atrous; CRF dropped | No | 86.9 percent (with JFT-300M pretraining) |
| DeepLab v3+ | 2018 / ECCV 2018 | Xception-65, Xception-71 (also ResNet-101, MobileNet-v2) | Encoder-decoder structure; atrous depthwise separable convolutions | No | 89.0 percent |

The v3 result of 86.9 percent on PASCAL VOC 2012 test is reported with an aligned ResNet-101 backbone pretrained on both ImageNet and the (Google-internal) JFT-300M dataset. Without JFT pretraining the mIoU is several points lower, which is typical for VOC-trained segmentation models. The v3+ result of 89.0 percent uses an Xception-65 backbone and the COCO + JFT pretraining recipe.

### DeepLab v1 (2014 / ICLR 2015)

The original DeepLab paper combined two ideas. First, it adapted VGG-16 for dense prediction by replacing the last two pooling layers' striding with atrous convolutions, which kept the output stride at 8 instead of the original 32. Second, it ran a dense conditional random field (the model of Krähenbühl and Koltun, 2011) on the network's softmax output as a post-processing step, producing a sharper, more spatially consistent segmentation. The system reached 71.6 percent mIoU on the PASCAL VOC 2012 test set and ran at about 8 frames per second on a contemporary GPU.

### DeepLab v2 (2016 / TPAMI 2017)

The TPAMI version added two important pieces. The backbone was upgraded from VGG-16 to ResNet-101, which gave a substantial accuracy boost on its own. More importantly, the paper introduced Atrous Spatial Pyramid Pooling (ASPP), in which the same feature map is processed in parallel by several 3x3 atrous convolutions at different dilation rates (typically 6, 12, 18, 24). The outputs are concatenated and fused, capturing context at multiple scales without a costly image pyramid. With ASPP, multi-scale inputs, and dense CRF, DeepLab v2 reached 79.7 percent mIoU on PASCAL VOC 2012 test and set new records on PASCAL-Context, PASCAL-Person-Part, and Cityscapes.

### DeepLab v3 (2017)

"Rethinking Atrous Convolution for Semantic Image Segmentation" simplified the system. The CRF post-processor was removed; the authors found that with a sufficiently strong backbone and a better ASPP, CRF refinement no longer made a meaningful difference. The new ASPP module added batch normalisation to all parallel branches and added a global pooling branch (a 1x1 convolution applied to global average pooled features, then upsampled) to encode image-level context. The paper also explored cascaded atrous convolutions, replacing strided blocks deep in the network with atrous blocks at progressively larger rates. With ResNet-101 pretrained on ImageNet plus JFT-300M, DeepLab v3 reached 86.9 percent mIoU on PASCAL VOC 2012 test.

### DeepLab v3+ (2018 / ECCV 2018)

DeepLab v3+ kept the v3 ASPP encoder but added a small decoder. Decoded features from ASPP (at output stride 16) are upsampled by 4x and concatenated with low-level features from an early backbone block (typically conv2 of ResNet or the entry-flow block of Xception) before another upsample. This recovers sharper object boundaries that are otherwise lost when going straight from output stride 16 back to the full resolution.

The paper also proposed atrous separable convolutions: depthwise separable convolutions in which the depthwise convolution is itself atrous. This dramatically reduces compute compared to a dense atrous 3x3 while preserving accuracy. With an Xception-65 backbone pretrained on ImageNet plus JFT-300M, plus pretraining on COCO, DeepLab v3+ reached 89.0 percent mIoU on PASCAL VOC 2012 test and 82.1 percent on Cityscapes test, both without CRF post-processing.

### Successor models

After v3+, the DeepLab name continued at Google Research with extensions targeting panoptic segmentation (the joint task of [semantic](/wiki/semantic_segmentation) and [instance segmentation](/wiki/instance_segmentation)):

- **Panoptic-DeepLab** (Cheng et al., CVPR 2020) added two prediction heads to a DeepLab encoder: a semantic head and a class-agnostic instance head that predicts instance centers and per-pixel offsets. It was the first bottom-up panoptic system to match top-down ones in accuracy, reaching 84.2 percent mIoU, 39.0 percent AP, and 65.5 percent PQ on Cityscapes test.
- **Axial-DeepLab** (Wang et al., ECCV 2020) replaced the convolutional backbone with axial self-attention layers, an early transformer-flavoured backbone for dense prediction.
- **MaX-DeepLab** (Wang et al., CVPR 2021) was the first end-to-end panoptic segmentation model, using a dual-path mask transformer and a panoptic-quality-inspired bipartite matching loss. It removed the need for box detection and post-processing heuristics.
- **kMaX-DeepLab** (Yu et al., ECCV 2022) reformulated the cross-attention in a mask transformer as k-means clustering, simplifying training and improving accuracy on COCO and Cityscapes panoptic benchmarks.

## Key innovations

| Innovation | First introduced | What it does |
|---|---|---|
| Atrous (dilated) convolution | DeepLab v1 | Inserts "holes" between filter taps, expanding the receptive field by a factor of the dilation rate without adding parameters or reducing spatial resolution. Lets a classification backbone produce dense feature maps at output stride 8 or 16 instead of 32. |
| Fully connected CRF | DeepLab v1 | Post-processes the softmax output with a dense pairwise CRF (Krähenbühl and Koltun 2011) to sharpen boundaries by enforcing colour and position consistency between pixel pairs. |
| Atrous Spatial Pyramid Pooling (ASPP) | DeepLab v2 | Applies several parallel 3x3 atrous convolutions with different dilation rates to the same feature map, capturing multi-scale context without resampling the image. |
| Improved ASPP with image-level features | DeepLab v3 | Adds batch normalisation to ASPP branches and a global pooling branch that encodes image-level context, then concatenates and projects all branches together. |
| Encoder-decoder with low-level skip | DeepLab v3+ | Adds a lightweight decoder that fuses the ASPP encoder output with early-layer features, producing sharper boundaries than direct upsampling. |
| Atrous depthwise separable convolution | DeepLab v3+ | Replaces dense atrous convolutions with depthwise separable variants in both the backbone (Xception) and ASPP, cutting compute substantially. |

### Atrous convolution explained

A standard 3x3 convolution looks at three contiguous taps in each spatial dimension. An atrous 3x3 convolution with rate r looks at three taps spaced r pixels apart, leaving holes between them. The receptive field grows from 3x3 to (2r+1)x(2r+1), but the number of parameters and the per-pixel multiply-add cost are unchanged. By replacing a strided convolution deep in a backbone with an atrous one, you keep the receptive field the same while preserving the spatial resolution of the feature map. This is the technical trick that lets DeepLab use a classification backbone at output stride 8 or 16 instead of 32.

### ASPP explained

Objects in natural images appear at many scales. A naive way to handle this is to run the network on the input at several scales and merge the predictions, which multiplies compute. ASPP instead runs several atrous convolutions in parallel on the same feature map, each with a different dilation rate (e.g. 6, 12, 18). Each branch sees the same features through a different receptive field, so collectively they capture small objects, mid-scale objects, and large-context regions in a single pass. The branches are concatenated and fused with a 1x1 convolution.

## Backbone networks

DeepLab is backbone-agnostic in principle, and the official TensorFlow implementation supports several:

| Backbone | Used in DeepLab version | Notes |
|---|---|---|
| VGG-16 | v1, v2 (ablations) | The original backbone in DeepLab v1. |
| ResNet-101 | v2, v3, v3+ | The standard "server-side" backbone for v2 and v3; the modified ResNet-101 used in DeepLab keeps output stride 16 via atrous convolution. See [ResNet](/wiki/resnet). |
| ResNet-50 | v3, v3+ (light variants) | A lighter, faster alternative to ResNet-101. |
| Xception-65, Xception-71 | v3+ | Aligned Xception variants modified to be fully convolutional and to use atrous separable convolutions. |
| MobileNet-v2, MobileNet-v3 | v3, v3+ (mobile) | Used for mobile and on-device segmentation. See [MobileNet](/wiki/mobilenet). |
| PNASNet | v3+ (research variants) | Backbone discovered by neural architecture search. |
| Auto-DeepLab / HNASNet | Auto-DeepLab (Liu et al., CVPR 2019) | Backbone and meta-architecture jointly searched for segmentation. |

## Benchmark performance

DeepLab models were the standard top-of-leaderboard entries on PASCAL VOC and Cityscapes from 2015 through about 2019. The headline numbers, all on the official test servers and as reported in the source papers, are:

| Benchmark | DeepLab version | mIoU | Notes |
|---|---|---|---|
| PASCAL VOC 2012 test | v1 | 71.6 percent | VGG-16 backbone, dense CRF |
| PASCAL VOC 2012 test | v2 | 79.7 percent | ResNet-101 backbone, ASPP, dense CRF |
| PASCAL VOC 2012 test | v3 | 86.9 percent | ResNet-101, JFT-300M pretraining, no CRF |
| PASCAL VOC 2012 test | v3+ | 89.0 percent | Xception-65, COCO + JFT pretraining, no CRF |
| Cityscapes test | v3+ | 82.1 percent | Xception-71, no CRF |
| Cityscapes test | Panoptic-DeepLab | 84.2 percent (mIoU); 65.5 percent PQ | SWideRNet backbone |

For reference, FCN reported about 62 percent mIoU on PASCAL VOC 2012 test (2015), and PSPNet reached 85.4 percent and 80.2 percent on Cityscapes (2017). DeepLab v3+ broke 89 percent on PASCAL VOC 2012 in 2018.

## Implementations

| Implementation | Maintainer | Notes |
|---|---|---|
| tensorflow/models/research/deeplab | Google (TensorFlow team) | The official TensorFlow implementation, covering v1 through v3+, with backbones for MobileNet-v2/v3, Xception, ResNet-50/101, PNASNet, and Auto-DeepLab. Apache 2.0 license. |
| torchvision.models.segmentation | PyTorch | Built-in DeepLab v3 with `deeplabv3_resnet50`, `deeplabv3_resnet101`, and `deeplabv3_mobilenet_v3_large` model builders, with COCO + VOC pretrained weights. |
| kazuto1011/deeplab-pytorch | Community | Widely used PyTorch reimplementation of DeepLab v2 trained on COCO-Stuff and PASCAL VOC. |
| VainF/DeepLabV3Plus-Pytorch | Community | Popular PyTorch reimplementation of DeepLab v3 and v3+ for VOC and Cityscapes. |
| MMSegmentation | OpenMMLab | The OpenMMLab segmentation toolbox includes DeepLab v3 and v3+ as standard reference models. |
| Detectron2 (projects) | Meta AI | Includes Panoptic-DeepLab as a reference panoptic segmentation project. |
| Panoptic-DeepLab (bowenc0221) | Original authors | Official PyTorch reimplementation of Panoptic-DeepLab for the CVPR 2020 paper. |

## Use cases

DeepLab models are deployed in production across several domains where dense per-pixel prediction is needed:

- **Autonomous driving**: lane, road, vehicle, pedestrian, and free-space segmentation, typically benchmarked on Cityscapes, BDD100K, and KITTI.
- **Medical imaging**: organ, lesion, and tumour segmentation in CT, MRI, and pathology slides. DeepLab v3+ with a ResNet-101 or MobileNet backbone is a common drop-in baseline alongside [U-Net-style models](/wiki/image_segmentation).
- **Satellite and aerial imagery**: building footprints, road networks, agricultural plots, deforestation, and damage assessment.
- **Robotics**: scene understanding for manipulation and navigation, often coupled with depth or instance segmentation heads.
- **Photo and video editing**: portrait segmentation for background blur, virtual backgrounds, and effects. The Pixel "Portrait Mode" in some Google Pixel generations and similar mobile features are based on DeepLab variants with MobileNet backbones.
- **Augmented reality**: real-time scene segmentation for occlusion handling and surface understanding.
- **Industrial inspection**: defect segmentation on manufactured parts, surface inspection, and quality control.

## Comparison with other semantic segmentation methods

| Model | Year | Authors | Core idea | Notes |
|---|---|---|---|---|
| FCN | 2015 | Long, Shelhamer, Darrell | Classification net with fully convolutional head and skip connections | The first deep, end-to-end semantic segmentation network. About 62 percent mIoU on PASCAL VOC 2012 test. |
| DeepLab v1-v3+ | 2014-2018 | Chen et al. | Atrous convolution + ASPP + (v3+) encoder-decoder | The dominant family of dense-prediction segmentation models from 2015 to about 2020. |
| U-Net | 2015 | Ronneberger, Fischer, Brox | Symmetric encoder-decoder with skip connections | Designed for biomedical segmentation; the standard baseline for medical imaging. |
| SegNet | 2015 / 2017 | Badrinarayanan, Kendall, Cipolla | Encoder-decoder with pooling-index unpooling | Aimed at memory-efficient road-scene segmentation. |
| PSPNet | 2017 | Zhao et al. | Pyramid pooling module on top of a deep backbone | 85.4 percent on PASCAL VOC 2012 and 80.2 percent on Cityscapes test. Direct contemporary of DeepLab v3. |
| RefineNet | 2017 | Lin et al. | Multi-path refinement with residual chained pooling | Strong PASCAL VOC and Cityscapes results. |
| HRNet | 2019 | Sun et al. | Maintains a high-resolution branch throughout the network | Avoids downsampling-then-upsampling; strong on segmentation, pose, and detection. |
| [Mask R-CNN](/wiki/mask_r_cnn) | 2017 | He et al. | Adds a mask head to Faster R-CNN | Solves [instance segmentation](/wiki/instance_segmentation), a different task. |
| SegFormer | 2021 | Xie et al. | Hierarchical transformer encoder, lightweight MLP decoder | Showed transformers could match DeepLab on Cityscapes with simpler decoders. |
| Mask2Former | 2022 | Cheng et al. | Mask classification with masked attention | A unified semantic, instance, and panoptic model that exceeded DeepLab and PSPNet on most leaderboards by 2022. |
| OneFormer | 2023 | Jain et al. | Single transformer trained jointly on all three segmentation tasks | Outperforms specialised Mask2Former on ADE20K, Cityscapes, and COCO. |

DeepLab v3+ remained competitive with these later models on many practical workloads through 2024, in part because it trains relatively quickly, runs efficiently on a single GPU, and ships with mature open-source code in both TensorFlow and PyTorch.

## Influence

The DeepLab papers are among the most cited works in computer vision. The v2 (TPAMI) paper alone has tens of thousands of citations on Google Scholar, and the v1, v3, and v3+ papers are each in the high thousands or more. A few specific influences are worth noting:

- Atrous (dilated) convolution became a default tool in segmentation, dense detection, and audio/sequence modelling. The same idea, under the name dilated convolution, is the core of WaveNet and many speech models.
- ASPP and close variants appear in dozens of follow-up segmentation models and were widely copied into instance and panoptic segmentation pipelines.
- The DeepLab decoder of v3+ became a template for adding low-level skip features to dense prediction heads.
- The official tensorflow/models/research/deeplab repository has been a starting point for many production segmentation systems.
- Liang-Chieh Chen's continued work on Panoptic-DeepLab, MaX-DeepLab, and kMaX-DeepLab carried the DeepLab name into the panoptic and transformer eras.

## Limitations

DeepLab has some real limitations that are worth being honest about:

- **Compute cost at high resolution.** Producing dense predictions at output stride 8 with a ResNet-101 or Xception backbone is expensive. Real-time use on full HD images typically requires switching to a MobileNet backbone or sacrificing output stride.
- **Boundary precision.** The original DeepLab v1 and v2 needed a dense CRF post-processor specifically because the network output was too blurry on object boundaries. The v3+ decoder helps a lot but does not fully solve the problem; thin structures (poles, wires) remain difficult.
- **CRF was slow.** The dense CRF used in v1 and v2 added meaningful latency. Removing it in v3 was partly a recognition that the cost no longer justified the small accuracy gain.
- **Not directly an instance or panoptic model.** DeepLab v1 through v3+ produce semantic segmentations only. Solving [instance segmentation](/wiki/instance_segmentation) or panoptic segmentation requires the later Panoptic-DeepLab or MaX-DeepLab extensions, or pairing with [Mask R-CNN](/wiki/mask_r_cnn).
- **Surpassed by transformers on hard benchmarks.** Mask2Former, Mask DINO, OneFormer, and SAM-derived methods now hold the top entries on COCO panoptic, ADE20K, and Cityscapes panoptic. DeepLab v3+ is rarely the SOTA in 2025; it is rather the strong, easy-to-train baseline.
- **Backbone choice matters a lot.** Most published DeepLab numbers rely on heavy ImageNet (and sometimes JFT-300M) pretraining. Reproducing the SOTA results without those weights is hard.

## Recent context (2024-2026)

By 2026 the public state of the art in semantic and panoptic segmentation has moved toward transformer-based mask classification (Mask2Former, OneFormer) and foundation segmentation models such as the Segment Anything Model (SAM and SAM 2). DeepLab is no longer the headline architecture in new segmentation papers.

That said, DeepLab v3 and v3+ remain widely used in production. They train quickly on a single GPU, they have well-tested open-source implementations in both TensorFlow and PyTorch (including a built-in `torchvision.models.segmentation.deeplabv3_*` API), they are easy to deploy on mobile via MobileNet backbones, and they ship with predictable, reproducible Cityscapes and PASCAL VOC numbers. For a team that needs a strong semantic segmentation baseline, DeepLab v3+ is still a common first choice.

## See also

- [Semantic segmentation](/wiki/semantic_segmentation)
- [Image segmentation](/wiki/image_segmentation)
- [Instance segmentation](/wiki/instance_segmentation)
- [Mask R-CNN](/wiki/mask_r_cnn)
- [ResNet](/wiki/resnet)
- [MobileNet](/wiki/mobilenet)

## References

1. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv:1412.7062. Presented at ICLR 2015. https://arxiv.org/abs/1412.7062
2. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(4):834-848. arXiv:1606.00915. https://arxiv.org/abs/1606.00915
3. Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv:1706.05587. https://arxiv.org/abs/1706.05587
4. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV 2018. arXiv:1802.02611. https://arxiv.org/abs/1802.02611
5. Cheng, B., Collins, M. D., Zhu, Y., Liu, T., Huang, T. S., Adam, H., and Chen, L.-C. (2020). Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation. CVPR 2020. arXiv:1911.10194. https://arxiv.org/abs/1911.10194
6. Wang, H., Zhu, Y., Adam, H., Yuille, A., and Chen, L.-C. (2021). MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers. CVPR 2021. arXiv:2012.00759. https://arxiv.org/abs/2012.00759
7. Yu, Q., Wang, H., Qiao, S., Collins, M., Zhu, Y., Adam, H., Yuille, A., and Chen, L.-C. (2022). k-means Mask Transformer (kMaX-DeepLab). ECCV 2022.
8. Long, J., Shelhamer, E., and Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. CVPR 2015. arXiv:1411.4038. https://arxiv.org/abs/1411.4038
9. Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyramid Scene Parsing Network. CVPR 2017. arXiv:1612.01105. https://arxiv.org/abs/1612.01105
10. Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Girdhar, R. (2022). Masked-attention Mask Transformer for Universal Image Segmentation (Mask2Former). CVPR 2022. arXiv:2112.01527.
11. Official TensorFlow DeepLab implementation, tensorflow/models/research/deeplab. Apache 2.0 license. https://github.com/tensorflow/models/tree/master/research/deeplab
12. PyTorch torchvision DeepLabV3 documentation. https://docs.pytorch.org/vision/main/models/deeplabv3.html
13. Liang-Chieh Chen, DeepLab project page, http://liangchiehchen.com/projects/DeepLab.html
14. Krahenbuhl, P., and Koltun, V. (2011). Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NeurIPS 2011.

