Depthwise Separable CNN

Computer Vision Machine Learning Model Architecture

18 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 3,609 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A depthwise separable convolution is a factorized form of convolution that decomposes a standard convolutional operation into two sequential steps: a depthwise convolution and a pointwise convolution. This factorization reduces both the number of parameters and the computational cost compared to standard convolutions, making it a core building block for efficient convolutional neural network architectures designed for mobile and embedded deployment. Depthwise separable convolutions were first developed by Laurent Sifre during an internship at Google Brain in 2013^[1] and have since become central to widely used architectures including MobileNet, Xception, and EfficientNet.

ELI5 (Explain like I'm 5)

Imagine you want to paint a picture using three colored pencils (red, blue, green). A regular convolution is like mixing all three pencils together in every stroke to create new colors. A depthwise separable convolution splits the job into two simpler steps. First, you draw with each pencil separately on its own layer (that is the depthwise step). Then, you stack those layers and blend them together to make new colors (that is the pointwise step). Because each step is simpler than doing everything at once, you finish the painting much faster and with fewer pencils, but the result looks almost the same.

Background and motivation

Standard convolutional layers apply filters that operate across both the spatial dimensions (height and width) and the channel dimension of the input simultaneously. For an input with many channels and a large number of output filters, the computation grows rapidly. Specifically, a standard convolution with a kernel of size D_K x D_K applied to an input with M channels to produce N output channels requires D_K x D_K x M x N multiply-accumulate operations per spatial position. As modern networks grew deeper and wider, this cost became prohibitive for deployment on devices with limited compute budgets such as smartphones, embedded systems, and IoT devices.

The core insight behind depthwise separable convolutions is that spatial filtering and channel mixing can be decoupled without significant loss in representational power. Instead of learning a single large filter that jointly captures spatial and cross-channel patterns, the operation is split into two lighter steps. This reduces the total computation by a factor that is roughly proportional to the number of output channels or the square of the kernel size, depending on the configuration.

Mathematical formulation

The following notation is used throughout this section. Let the input feature map have spatial dimensions D_F x D_F with M input channels. Let the convolutional kernel have spatial dimensions D_K x D_K, and let N denote the number of output channels.

Standard convolution

A standard convolution applies a set of N filters, each of size D_K x D_K x M, to the input. The output feature map G at spatial position (i, j) for the k-th output channel is:

G(k, i, j) = sum over m, s, t of K(k, m, s, t) * F(m, i+s, j+t)

where K is the kernel tensor of shape N x M x D_K x D_K, and F is the input tensor of shape M x D_F x D_F.

The total computational cost (multiply-accumulate operations) is:

D_K x D_K x M x N x D_F x D_F

The total number of parameters (excluding bias) is:

D_K x D_K x M x N

Depthwise convolution

In the depthwise convolution step, a single 2D filter of size D_K x D_K is applied independently to each of the M input channels. No cross-channel mixing occurs at this stage. The output for channel m at spatial position (i, j) is:

G_dw(m, i, j) = sum over s, t of K_dw(m, s, t) * F(m, i+s, j+t)

where K_dw has shape M x D_K x D_K (one filter per channel).

Computational cost: D_K x D_K x M x D_F x D_F

Parameters: D_K x D_K x M

Pointwise convolution

The pointwise convolution applies a 1x1 convolution that linearly combines the outputs of the depthwise stage across channels. This step produces N output channels:

G_pw(k, i, j) = sum over m of K_pw(k, m) * G_dw(m, i, j)

where K_pw has shape N x M.

Computational cost: M x N x D_F x D_F

Parameters: M x N

Total cost of depthwise separable convolution

The combined cost is the sum of the depthwise and pointwise steps:

D_K x D_K x M x D_F x D_F + M x N x D_F x D_F

The total number of parameters (excluding bias) is:

D_K x D_K x M + M x N

Reduction ratio

The ratio of the depthwise separable cost to the standard convolution cost is:

(D_K^2 x M + M x N) / (D_K^2 x M x N) = 1/N + 1/D_K^2

For a typical configuration with N = 256 output channels and a 3x3 kernel (D_K = 3), this ratio is 1/256 + 1/9, which is approximately 0.115. This means the depthwise separable convolution uses roughly 8 to 9 times fewer operations than the standard convolution.^[2] For larger values of N, the savings become even greater.

Numerical example

The following table illustrates the difference for a concrete configuration: input spatial size 14 x 14, 512 input channels, 512 output channels, 3 x 3 kernel.

Metric	Standard convolution	Depthwise separable convolution	Reduction factor
Parameters	3 x 3 x 512 x 512 = 2,359,296	(3 x 3 x 512) + (512 x 512) = 4,608 + 262,144 = 266,752	~8.8x fewer
Multiply-adds (per spatial map)	3 x 3 x 512 x 512 x 14 x 14 = 462,422,016	(3 x 3 x 512 x 14 x 14) + (512 x 512 x 14 x 14) = 903,168 + 51,380,224 = 52,283,392	~8.8x fewer

As the table shows, depthwise separable convolutions achieve roughly 8.8x reduction in both parameters and computation for this configuration, closely matching the theoretical ratio of 1/N + 1/D_K^2 = 1/512 + 1/9 = 0.113.

History and development

The concept of separable convolutions has roots in classical signal processing, where separable filters decompose a 2D convolution into two 1D operations (spatial separability). Depthwise separable convolutions extend this idea to the channel dimension in neural networks.

Origins at Google Brain (2013)

Laurent Sifre developed depthwise separable convolutions during his internship at Google Brain in 2013. The work was inspired by research on transformation-invariant scattering by Sifre and Stephane Mallat.^[1] Sifre applied the technique as an architectural modification to AlexNet, achieving a small improvement in accuracy, a large increase in convergence speed, and a notable reduction in model size. The approach was first presented publicly at ICLR 2014 by Vincent Vanhoucke.^[12]

Inception and Xception (2015-2017)

The Inception architecture (GoogLeNet) introduced the idea of factorized convolutions through its Inception modules, which used parallel branches of 1x1, 3x3, and 5x5 convolutions.^[11] Francois Chollet observed that an Inception module could be interpreted as an intermediate step between a standard convolution and a depthwise separable convolution. Taking this reasoning to its extreme, Chollet proposed Xception ("Extreme Inception"), which replaced all Inception modules with depthwise separable convolutions. Xception was published at CVPR 2017 and demonstrated that it could slightly outperform Inception V3 on ImageNet while using the same number of parameters, indicating a more efficient use of model capacity.^[3]

MobileNet family (2017-2019)

The MobileNet family of architectures, developed at Google, made depthwise separable convolutions the standard building block for efficient mobile networks.

Architecture	Year	Key innovation	Reference
MobileNet V1	2017	Replaced standard convolutions with depthwise separable convolutions; introduced width multiplier and resolution multiplier for model scaling	Howard et al., 2017^[2]
MobileNet V2	2018	Introduced inverted residual blocks with linear bottlenecks; expansion layer before depthwise convolution	Sandler et al., 2018^[4]
MobileNet V3	2019	Combined neural architecture search (NAS) with squeeze-and-excitation modules and hard activation functions	Howard et al., 2019^[5]

EfficientNet (2019)

EfficientNet, proposed by Mingxing Tan and Quoc V. Le in 2019, used depthwise separable convolutions as part of its Mobile Inverted Bottleneck Convolution (MBConv) blocks.^[6] EfficientNet introduced compound scaling, which uniformly scales network depth, width, and resolution using a single compound coefficient. The baseline architecture (EfficientNet-B0) was discovered through neural architecture search, and the compound scaling method was applied to generate a family of models (B0 through B7) that achieved state-of-the-art accuracy on ImageNet with fewer parameters and FLOPs than previous architectures.

Architecture details

MobileNet V1 block structure

In MobileNet V1, the first layer is a standard convolution, and all subsequent layers use depthwise separable convolutions. Each depthwise separable block consists of:

A 3x3 depthwise convolution
Batch normalization
ReLU activation
A 1x1 pointwise convolution
Batch normalization
ReLU activation

The full network contains 28 layers (13 depthwise convolutions and 13 pointwise convolutions, plus the initial standard convolution and a final fully connected layer). Replacing standard convolutions with depthwise separable convolutions yields an 8-9x reduction in computation with only approximately 1% reduction in classification accuracy on ImageNet.^[2]

MobileNet V2 inverted residual block

MobileNet V2 introduced the inverted residual block, which differs from the standard residual block used in ResNet. In a standard residual block, the input is wide, compressed to a narrow bottleneck, and then expanded back. In the inverted residual block, the structure is reversed:

Expansion: A 1x1 pointwise convolution expands the number of channels by an expansion factor (typically 6x)
Depthwise convolution: A 3x3 depthwise convolution filters the expanded representation
Projection: A 1x1 pointwise convolution projects back to a narrow output (linear bottleneck, no ReLU)
Residual connection: A skip connection adds the input to the output (only when input and output dimensions match)

The key insight is that ReLU activations in narrow (low-dimensional) layers can destroy information, so the projection layer uses a linear activation instead. The residual connections are placed between the thin bottleneck layers rather than the wide expansion layers.^[4]

MobileNet V3 enhancements

MobileNet V3 further refined the block structure by incorporating:

Squeeze-and-excitation (SE) modules: Placed after the depthwise convolution, these modules adaptively recalibrate channel-wise feature responses by learning channel attention weights through a lightweight two-layer fully connected structure
Hard swish and hard sigmoid activations: These are computationally cheaper approximations of swish and sigmoid that reduce latency on mobile hardware
Platform-aware NAS: The architecture was partially designed using neural architecture search optimized for specific hardware targets (mobile CPUs)^[5]

EfficientNet MBConv block

The MBConv block in EfficientNet combines the inverted residual structure from MobileNet V2 with squeeze-and-excitation.^[6] The block structure is:

1x1 expansion convolution (with batch norm and swish activation)
D_K x D_K depthwise convolution (with batch norm and swish activation)
Squeeze-and-excitation module
1x1 projection convolution (with batch norm, linear activation)
Residual connection (when applicable)

Technique	Description	Typical use case
Standard convolution	Single filter operates across all spatial and channel dimensions jointly	General-purpose CNN architectures
Depthwise separable convolution	Factorized into depthwise (spatial) and pointwise (channel) steps	Mobile and efficient architectures
Grouped convolution	Input channels divided into groups; each group convolved independently	ResNeXt, ShuffleNet
Dilated (atrous) convolution	Inserts gaps between kernel elements to increase receptive field without increasing parameters	Semantic segmentation, DeepLab
Deformable convolution	Learns spatial offsets for sampling locations in the kernel	Object detection with geometric variations
1x1 convolution	Pointwise convolution that mixes channels without spatial filtering	Channel reduction in Inception, bottleneck layers

Grouped convolution generalizes both standard and depthwise convolution. When the number of groups equals 1, grouped convolution is the same as standard convolution. When the number of groups equals the number of input channels, grouped convolution is equivalent to depthwise convolution. Architectures like ShuffleNet combine grouped convolutions with channel shuffle operations to allow information flow between groups.^[9]

Relationship to the Inception hypothesis

Chollet's Xception paper framed depthwise separable convolutions through what could be called the "Inception hypothesis." Standard Inception modules use multiple parallel branches (1x1, 3x3, 5x5 convolutions) that each operate on subsets of the input channels, then concatenate their outputs. This can be viewed as a sparse approximation of a full convolution.

The extreme version of this idea uses one branch per input channel, which is exactly a depthwise convolution followed by a pointwise convolution. There is one subtle difference: in Inception modules, the pointwise (1x1) convolution comes first (reducing channels), followed by spatial convolutions. In depthwise separable convolutions as used in MobileNet, the spatial convolution (depthwise) comes first, followed by the pointwise convolution. Chollet found that the order did not significantly affect performance.^[3]

Applications

Depthwise separable convolutions are used across a wide range of computer vision and deep learning tasks.

Image classification

The MobileNet and EfficientNet families are among the most widely used architectures for image classification on resource-constrained devices. MobileNet V1 achieved 70.6% top-1 accuracy on ImageNet with only 3.4 million parameters and 569 million multiply-adds, compared to VGG-16 which requires 138 million parameters and 15.3 billion multiply-adds for 71.5% top-1 accuracy.^[2]

Object detection

Depthwise separable convolutions are used in lightweight object detection frameworks. SSD (Single Shot MultiBox Detector) with a MobileNet backbone provides real-time object detection on mobile devices. The YOLO family has also adopted depthwise separable convolutions in some of its lightweight variants.

Semantic segmentation

DeepLab V3+, proposed by Chen et al. in 2018, combined atrous (dilated) convolutions with depthwise separable convolutions in both its encoder and decoder modules. This variant, called "atrous separable convolution," achieved state-of-the-art results on the PASCAL VOC 2012 dataset (89.0% mIoU) and Cityscapes (82.1% mIoU) while significantly reducing computational cost compared to using standard convolutions.^[7]

Natural language processing

Although transformers have largely replaced convolutional approaches in NLP, depthwise separable convolutions have been used in some sequence modeling architectures. The paper "Depthwise Separable Convolutions for Neural Machine Translation" (Kaiser et al., 2018) applied the technique to machine translation, demonstrating that convolutional models with depthwise separable convolutions could achieve competitive performance with reduced computation.^[8]

Speech and audio processing

Depthwise separable convolutions have been adopted in lightweight speech recognition and keyword spotting models designed for on-device deployment. These models need to run continuously on battery-powered devices, making computational efficiency essential.

Hardware considerations

While depthwise separable convolutions reduce the number of arithmetic operations (FLOPs), their actual speedup on hardware depends on several factors.

Memory bandwidth bottleneck

Depthwise convolutions have a low arithmetic intensity (ratio of compute operations to memory accesses). Each depthwise filter operates on a single channel, producing a small amount of output relative to the data that must be loaded from memory. On GPU hardware optimized for high-throughput parallel computation, this means the depthwise step is often memory-bound rather than compute-bound. The pointwise (1x1) convolution, while having more favorable compute characteristics, still involves accessing all input and output channels at every spatial position.

GPU utilization

On GPUs, the depthwise convolution is typically mapped to a general matrix-vector multiplication (GEMV) rather than the more efficient general matrix-matrix multiplication (GEMM) used for standard convolutions. This results in lower hardware utilization. Several optimization strategies have been proposed to address this, including fusing the depthwise and pointwise operations into a single kernel to reduce intermediate memory accesses. Studies have shown that fusing both layers can achieve speedups of 1.7x to 2.0x compared to executing them separately.

Mobile and edge hardware

Depthwise separable convolutions tend to perform better on mobile CPUs and specialized AI accelerators (such as TPUs and neural processing units) that are designed with memory-efficient data paths. Apple's Neural Engine, Qualcomm's Hexagon DSP, and Google's Edge TPU all include optimizations for depthwise separable operations.

FPGA implementations

Field-programmable gate arrays (FPGAs) have been used to build custom accelerators for depthwise separable convolutions. These implementations can exploit data reuse patterns specific to the factorized structure, achieving high energy efficiency for edge inference workloads.

Limitations and challenges

Despite their advantages, depthwise separable convolutions have several known limitations.

Reduced representational capacity

By decoupling spatial filtering from channel mixing, depthwise separable convolutions lose the ability to learn joint spatial-channel features. This can reduce the model's representational capacity, particularly when the network is already small. In extremely compact configurations, the depthwise step may not have enough parameters to capture complex spatial patterns.

Channel independence assumption

The depthwise step treats each input channel independently, which assumes that useful spatial features can be extracted from individual channels in isolation. In practice, some tasks benefit from cross-channel interactions during spatial filtering. For example, in audio processing tasks, depthwise separable convolutions have sometimes underperformed standard convolutions when fine-grained inter-channel dependencies are important.

Training instability

Some practitioners have reported that replacing standard convolutions with depthwise separable convolutions can lead to training difficulties, including slower convergence or instability, particularly when the network architecture is not specifically designed for the factorized structure. Proper use of batch normalization, residual connections, and careful initialization can mitigate these issues.

Accuracy gap

While the accuracy gap between depthwise separable and standard convolutions is small in well-designed architectures (often less than 1-2% on ImageNet), it is not zero. For applications where maximum accuracy is required and compute budget is not a constraint, standard convolutions may still be preferred.

Variants and extensions

Several variants of the basic depthwise separable convolution have been proposed to address its limitations or adapt it to specific use cases.

Variant	Description	Source
Atrous separable convolution	Combines dilated (atrous) convolution with depthwise separable structure for multi-scale feature extraction	Chen et al., 2018 (DeepLabV3+)^[7]
Channel shuffle	After grouped/depthwise convolution, channels are shuffled to enable cross-group information flow	Zhang et al., 2018 (ShuffleNet V2)^[9]
Inverted residual	Expands channels before depthwise convolution and projects to narrow bottleneck with linear activation	Sandler et al., 2018 (MobileNet V2)^[4]
Blueprint separable convolution	Rearranges the order and normalization of depthwise and pointwise steps for improved accuracy	Haase and Amthor, 2020^[10]
Depth-multiplied depthwise convolution	Applies multiple filters per input channel in the depthwise step (depth multiplier > 1)	Howard et al., 2017 (MobileNet V1)^[2]

Implementation

Depthwise separable convolutions are supported by all major deep learning frameworks.

TensorFlow / Keras: tf.nn.separable_conv2d and tf.keras.layers.SeparableConv2D implement the full depthwise separable operation. The depthwise step supports atrous (dilated) convolution as well.
PyTorch: torch.nn.Conv2d with the groups parameter set equal to the number of input channels performs a depthwise convolution. A separate torch.nn.Conv2d with kernel size 1x1 performs the pointwise step. PyTorch does not have a single combined layer; both steps are typically composed in a nn.Sequential block.
ONNX: The standard Conv operator supports grouped convolution, enabling depthwise convolution when the group count equals the input channel count.

PyTorch example

import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3,
                 stride=1, padding=1):
        super().__init__()
        self.depthwise = nn.Conv2d(
            in_channels, in_channels, kernel_size,
            stride=stride, padding=padding, groups=in_channels
        )
        self.pointwise = nn.Conv2d(in_channels, out_channels, 1)

    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x

References

Sifre, L., & Mallat, S. (2013). "Rigid-Motion Scattering for Texture Classification." *arXiv preprint arXiv:1403.1687*. ↩
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." *arXiv preprint arXiv:1704.04861*. ↩
Chollet, F. (2017). "Xception: Deep Learning with Depthwise Separable Convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 1251-1258. ↩
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). "MobileNetV2: Inverted Residuals and Linear Bottlenecks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 4510-4520. ↩
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., & Adam, H. (2019). "Searching for MobileNetV3." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*, pp. 1314-1324. ↩
Tan, M., & Le, Q. V. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." *Proceedings of the 36th International Conference on Machine Learning (ICML)*, PMLR 97, pp. 6105-6114. ↩
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). "Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation." *Proceedings of the European Conference on Computer Vision (ECCV)*, pp. 801-818. ↩
Kaiser, L., Gomez, A. N., & Chollet, F. (2018). "Depthwise Separable Convolutions for Neural Machine Translation." *Proceedings of the 6th International Conference on Learning Representations (ICLR)*. ↩
Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 6848-6856. ↩
Haase, D., & Amthor, M. (2020). "Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 14600-14609. ↩
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). "Going Deeper with Convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 1-9. ↩
Vanhoucke, V. (2014). "Learning Visual Representations at Scale." *ICLR 2014 Invited Talk*. ↩
Ma, N., Zhang, X., Zheng, H.-T., & Sun, J. (2018). "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design." *Proceedings of the European Conference on Computer Vision (ECCV)*, pp. 116-131.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

MLPerf