See also: Machine learning terms
A depthwise separable convolutional neural network (often abbreviated sepCNN) is a convolutional neural network that replaces standard convolution layers with depthwise separable convolutions. The depthwise separable convolution factors a single standard convolution into two smaller operations: a depthwise convolution that filters each input channel independently, followed by a pointwise convolution (a 1x1 convolution) that mixes the channels. This factorization keeps spatial filtering and channel mixing separate. It cuts both the parameter count and the number of multiply-add operations by roughly an order of magnitude for the kernel sizes used in typical computer vision networks, with only a small drop in accuracy on benchmarks such as ImageNet.
The most widely used sepCNN architectures are the MobileNet family, the Xception network, and the EfficientNet family. All three rely on depthwise separable convolutions as their main building block, and all three target settings where compute, memory, or energy is limited, such as edge AI and on-device inference.
A standard 2D convolution layer takes an input feature map of shape (D_F, D_F, M) and produces an output feature map of shape (D_F, D_F, N) using N filters of shape (D_K, D_K, M). Each output value is a sum across all M input channels and across the D_K by D_K spatial window. A depthwise separable convolution does the same job in two steps.
D_K by D_K filter per input channel. There is no mixing across channels at this stage. The output has the same number of channels as the input.N filters of shape 1 by 1 by M to the depthwise output. This step mixes the M channels into the N desired output channels but does no spatial filtering.The two steps together do the same kind of work as a full convolution. They learn spatial patterns and they learn cross-channel combinations. They just learn them with different parameters instead of with one shared parameter tensor.
The table below uses the notation from the MobileNets paper, where D_K is the kernel size, D_F is the feature map spatial size, M is the number of input channels, and N is the number of output channels.
| Operation | Multiply-adds | Parameters |
|---|---|---|
| Standard convolution | D_K x D_K x M x N x D_F x D_F | D_K x D_K x M x N |
| Depthwise convolution | D_K x D_K x M x D_F x D_F | D_K x D_K x M |
| Pointwise convolution (1x1) | M x N x D_F x D_F | M x N |
| Depthwise separable (sum) | D_K x D_K x M x D_F x D_F + M x N x D_F x D_F | D_K x D_K x M + M x N |
Dividing the depthwise separable cost by the standard cost gives a clean reduction ratio:
(D_K^2 * M * D_F^2 + M * N * D_F^2) / (D_K^2 * M * N * D_F^2) = 1/N + 1/D_K^2
For a typical layer with a 3x3 kernel and N = 64 output channels, this works out to about 1/64 + 1/9 ~= 0.127, so the depthwise separable version uses roughly 1/8 of the multiply-adds of a standard 3x3 convolution. For larger output widths the savings grow toward 1/9 (about 89% fewer operations) for 3x3 kernels, or 1/25 (about 96% fewer) for 5x5 kernels (Howard et al., 2017).
Consider a layer with 32 input channels, 64 output channels, a 3x3 kernel, and a 56x56 feature map.
| Quantity | Standard conv | Depthwise separable |
|---|---|---|
| Parameters | 18,432 | 2,336 |
| Multiply-adds | ~57.8M | ~7.3M |
| Reduction vs standard | 1.0x | ~7.9x |
The parameter saving comes mostly from the fact that the depthwise step does not pay the M x N channel-mixing cost, and the pointwise step does not pay the D_K x D_K spatial cost.
The idea of separating spatial and channel mixing is older than its current name. Sifre and Mallat used separable convolutions in their 2013 work on rotation- and translation-invariant scattering for texture classification, which became the 2014 paper Rigid-Motion Scattering for Texture Classification (Sifre and Mallat, 2014). Sifre's PhD thesis at École Polytechnique generalized the construction further. Around the same time, the original Inception V3 modules used factored 1xK and Kx1 convolutions, which are spatially separable in a different sense.
The deep learning community adopted the depthwise plus pointwise factorization in earnest with two papers in 2017. François Chollet's Xception paper (Chollet, 2017) framed the depthwise separable convolution as the limit of an Inception module with one tower per channel. Howard and colleagues at Google published MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications in the same year (Howard et al., 2017), which made the design popular for on-device computer vision and gave the abbreviation sepCNN much of its current weight.
| Architecture | Year | Key contribution | ImageNet top-1 (approx.) | Parameters (approx.) |
|---|---|---|---|---|
| MobileNet v1 | 2017 | First widely used sepCNN; width and resolution multipliers | 70.6% | 4.2M |
| MobileNet v2 | 2018 | Inverted residual blocks with linear bottlenecks | 72.0% | 3.4M |
| MobileNet v3 (Large) | 2019 | NAS-designed blocks, hard-swish activation, SE modules | 75.2% | 5.4M |
| Xception | 2017 | 36 sepCNN layers, residual connections, replaces Inception V3 modules | 79.0% | 22.9M |
| EfficientNet-B0 | 2019 | MBConv blocks, compound scaling of depth, width, resolution | 77.1% | 5.3M |
| EfficientNet-B7 | 2019 | Same MBConv block scaled up | 84.3% | 66M |
| ShuffleNet v1 | 2018 | Channel shuffle plus group conv, related family | 67.6% | ~5M |
MobileNet v1 introduced two global hyperparameters that became standard practice: a width multiplier alpha that thins the channel count uniformly across all layers, and a resolution multiplier rho that shrinks the input image. Both let the same network template stretch from rich servers down to small embedded devices without redesign.
MobileNet v2 wraps the depthwise separable block in an inverted residual with a linear bottleneck. The block first expands a thin tensor with a 1x1 convolution, runs a depthwise 3x3 in the expanded space, then projects back to a thin tensor with another 1x1 convolution. The final 1x1 has no ReLU on its output, which Sandler et al. argue is necessary to preserve information in low-dimensional manifolds (Sandler et al., 2018).
MobileNet v3 was discovered in part through hardware-aware neural architecture search (NAS), then refined by hand. It adds Squeeze-and-Excitation layers in some blocks and uses the hard-swish activation, a piecewise approximation of swish that is cheaper to evaluate on mobile CPUs (Howard et al., 2019).
Xception keeps a more traditional shape with a stack of depthwise separable convolutions and residual connections. With essentially the same parameter budget as Inception V3, it slightly improves top-1 accuracy on ImageNet and significantly improves it on a 350-million-image internal Google dataset (Chollet, 2017).
EfficientNet uses the MBConv block, which is a direct descendant of the MobileNet v2 inverted residual. The contribution of EfficientNet is compound scaling: instead of growing only depth, only width, or only input resolution, it scales all three together with a single coefficient. The EfficientNet-B0 baseline has about 5.3M parameters and 77.1% top-1 accuracy on ImageNet, while EfficientNet-B7 reaches 84.3% (Tan and Le, 2019).
Most frameworks expose depthwise and pointwise convolutions as first-class operations.
| Framework | Depthwise convolution | Pointwise convolution | Combined op |
|---|---|---|---|
| TensorFlow / Keras | tf.keras.layers.DepthwiseConv2D | tf.keras.layers.Conv2D(kernel_size=1) | tf.keras.layers.SeparableConv2D |
| PyTorch | nn.Conv2d(..., groups=in_channels) | nn.Conv2d(..., kernel_size=1) | None as a single layer; usually composed |
| JAX / Flax | flax.linen.Conv(feature_group_count=in_channels) | flax.linen.Conv(kernel_size=(1,1)) | composed |
| ONNX | Conv op with group = C_in | Conv op with kernel_shape = [1,1] | composed |
In PyTorch the depthwise step is implemented by the standard Conv2d layer with the groups argument set equal to the number of input channels. Each input channel then has its own filter, and the output channel count must be a multiple of groups.
Depthwise separable convolutions show up in almost every domain that wants high-quality vision under a tight compute or memory budget.
Depthwise separable convolutions are one of several recurring tricks for shrinking CNNs. The table groups them by what they exploit.
| Technique | What it changes | Representative work |
|---|---|---|
| Depthwise separable conv | Factors spatial and channel mixing | MobileNet, Xception, EfficientNet |
| Group convolution | Splits channels into groups, each processed independently | AlexNet, ResNeXt |
| Channel shuffle | Mixes group-conv channels between layers | ShuffleNet |
| Squeeze-and-Excitation | Learned per-channel gating | SENet, MobileNet v3 |
| Bottleneck block | 1x1 down, 3x3, 1x1 up | ResNet, MobileNet v2 (inverted) |
| Quantization | Lower precision weights and activations | TFLite int8, GPTQ for vision |
| Pruning | Removes redundant weights or channels | Han et al., NetAdapt |
| Knowledge distillation | Trains a small student model on a larger teacher | Hinton et al., DistilBERT-style for vision |
These methods compose well. EfficientNet, for example, layers depthwise separable MBConv blocks, Squeeze-and-Excitation, and compound scaling on top of one another. Production deployments often add quantization and sometimes pruning on top of an already efficient sepCNN backbone.
The theoretical FLOP savings of depthwise separable convolutions do not always translate into proportional latency savings on real hardware.
Depthwise separable convolutions remain the workhorse of efficient computer vision, but the broader landscape has shifted since 2020. Vision Transformers (ViT) (Dosovitskiy et al., 2021) split the image into patches and apply self-attention rather than convolution. Pure ViTs scale impressively at the high end but are usually less efficient than sepCNN backbones in the small-model regime where MobileNet and EfficientNet live.
Hybrid models try to get both. MobileViT (Mehta and Rastegari, 2022) interleaves MobileNet-style depthwise separable blocks with small attention blocks. EfficientFormer and Mobile-Former follow the same pattern. ConvNeXt (Liu et al., 2022) returns to a pure-convolution design but uses depthwise convolutions with large 7x7 kernels in a transformer-like macro layout, taking inspiration from the Swin Transformer. EfficientNetV2 (Tan and Le, 2021) replaces the MBConv block with a Fused-MBConv in early stages, where the depthwise plus 1x1 expansion is collapsed back into a single 3x3 convolution, because on TPU and GPU the extra arithmetic intensity beats the FLOP savings.
In practice this means a vision model in 2025 is rarely only sepCNN, but it is also rarely no sepCNN. The factorization is a default tool, used where it pays off and dropped where it does not.
Imagine you have a stack of colored transparency sheets, one red, one green, one blue, and you want to make a new picture out of them.
The normal way is to look at all three sheets at once, in every little square, and decide what the new color should be. That is a lot of looking.
The depthwise separable way is sneakier. First, look at each colored sheet on its own and decide what shape patterns are on it. Just shapes, no mixing. Then, in a second step, decide how to mix the red, green, and blue answers into the final color, but only for one tiny square at a time, with no shape work.
You get a similar picture, but you did far less staring. That is why phones use this trick to recognize cats, faces, and street signs without their batteries dying.