MobileNet

AI Hardware Computer Vision Deep Learning Neural Networks

31 min read

Updated Jun 21, 2026

Suggest edit History

Raw Graph

Last reviewed

Jun 21, 2026

Sources

13 citations

Review status

Source-backed

Revision

v6 · 6,293 words

MobileNet is a family of efficient convolutional neural network (CNN) architectures developed by Google for mobile and edge AI applications. First introduced in 2017 by Andrew G. Howard and colleagues, MobileNet uses depthwise separable convolutions to deliver competitive computer vision accuracy within the strict compute, memory, and power constraints of smartphones, embedded systems, and IoT devices.^[1] This factorization cuts the computational cost by roughly 8 to 9 times versus standard convolutions at a loss of only about 1% accuracy on ImageNet, letting the original 224x224 model reach 70.6% top-1 accuracy with just 4.2 million parameters and 569 million multiply-add operations, around 27 times less compute than VGG-16 at comparable accuracy.^[1] MobileNet models are among the most widely deployed neural networks in the world, running on more than 4 billion devices through TensorFlow Lite (now LiteRT).^[13] The family has grown through four major versions, each improving the accuracy-efficiency tradeoff.

As of 2025, the MobileNet family spans four generations: MobileNetV1 (2017), MobileNetV2 (2018), MobileNetV3 (2019), and MobileNetV4 (2024). These models have become foundational building blocks for on-device machine learning, powering applications from image classification and object detection to semantic segmentation and pose estimation on billions of devices worldwide.

Why was MobileNet created?

The rise of deep learning in computer vision brought models that achieved remarkable accuracy on benchmarks such as ImageNet, but these models often required billions of floating-point operations and hundreds of millions of parameters. Networks like VGG, ResNet, and Inception were designed primarily for server-side inference with powerful GPUs. Deploying such models on mobile phones, drones, autonomous vehicles, or wearable devices posed serious challenges due to limited compute power, memory, battery life, and thermal constraints.

Before MobileNet, several approaches attempted to address model efficiency, including network pruning, quantization, and knowledge distillation. However, these methods were typically applied as post-processing steps to already-large models. MobileNet took a fundamentally different approach: designing an architecture from the ground up to be efficient, using depthwise separable convolutions as the primary building block.^[1]

The original idea behind depthwise separable convolutions can be traced back to Laurent Sifre's work during an internship at Google Brain in 2013, where he explored factorized convolutions as an architectural variation to improve convergence speed and reduce model size. This concept was later refined and formalized in the Xception architecture by Francois Chollet and ultimately in the MobileNet paper series.^[1]

MobileNetV1 (2017)

Overview

MobileNetV1 was introduced in the paper "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, all affiliated with Google. The paper was published on arXiv in April 2017.^[1] As the authors describe it, "MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks," paired with "two simple global hyper-parameters that efficiently trade off between latency and accuracy."^[1]

The central contribution of MobileNetV1 is the systematic use of depthwise separable convolutions to replace standard convolutions throughout the network. This single design choice reduces the computational cost by a factor of 8 to 9 while incurring only a small loss in accuracy (roughly 1% on ImageNet).^[1]

What are depthwise separable convolutions?

A standard convolution simultaneously filters and combines input features into new output features in a single step. For a standard convolutional layer with a kernel size of D_K x D_K, M input channels, N output channels, and a spatial feature map of size D_F x D_F, the computational cost is:

Standard convolution cost: D_K x D_K x M x N x D_F x D_F multiply-adds

Depthwise separable convolutions factorize this operation into two distinct steps:

Depthwise convolution: A single filter is applied to each input channel independently. This step handles spatial filtering. For M input channels, exactly M filters (one per channel) are used, each with a D_K x D_K kernel. The cost is D_K x D_K x M x D_F x D_F.
Pointwise convolution: A 1x1 convolution combines the outputs of the depthwise convolution across channels, projecting them into a new N-dimensional channel space. The cost is M x N x D_F x D_F.

The total cost of a depthwise separable convolution is:

Depthwise separable cost: D_K x D_K x M x D_F x D_F + M x N x D_F x D_F

The reduction ratio compared to standard convolution is 1/N + 1/D_K^2. For MobileNet's 3x3 kernels, this yields roughly an 8 to 9 times reduction in computation.^[1] Each depthwise separable convolution block in MobileNetV1 includes batch normalization and ReLU activation after both the depthwise and pointwise layers.^[1]

Architecture

The MobileNetV1 architecture consists of 28 layers. The first layer is a standard 3x3 convolution with 32 filters and stride 2, followed by 13 depthwise separable convolution blocks (each counting as two layers: one depthwise and one pointwise). The network concludes with a global average pooling layer (7x7) and a fully connected layer with 1,000 outputs for ImageNet classification.^[1]

Spatial downsampling is handled by depthwise convolutions with a stride of 2 at selected layers. Each time the spatial dimension is halved, the number of channels doubles to compensate for the loss of spatial information. The channel progression goes from 32 to 64, 128, 256, 512, and finally 1,024.^[1]

Width and Resolution Multipliers

MobileNetV1 introduces two hyperparameters that allow practitioners to trade off latency and accuracy according to their deployment requirements:

Width multiplier (alpha): This parameter uniformly scales the number of channels at each layer. For example, with alpha = 0.75, every layer has 75% of its original channels. Common values are 1.0, 0.75, 0.5, and 0.25. The computational cost scales roughly as alpha squared.
Resolution multiplier (rho): This parameter reduces the input resolution of the network. Lower resolutions reduce the spatial dimensions of all feature maps proportionally. Common input sizes include 224, 192, 160, and 128.

These multipliers allow MobileNetV1 to cover a wide range of deployment scenarios, from high-accuracy server-class models to extremely compact models for the most resource-constrained devices.^[1]

Performance

Configuration	Top-1 Accuracy (%)	Multiply-Adds (M)	Parameters (M)
MobileNetV1 1.0, 224	70.6	569	4.2
MobileNetV1 0.75, 224	68.4	325	2.6
MobileNetV1 0.50, 224	63.7	149	1.3
MobileNetV1 0.25, 224	50.6	41	0.5

At full width and 224x224 resolution, MobileNetV1 achieved 70.6% top-1 accuracy on ImageNet with only 569 million multiply-add operations and 4.2 million parameters.^[1] For comparison, VGGNet-16 requires roughly 15.3 billion multiply-adds and 138 million parameters to achieve 71.5% top-1 accuracy, making MobileNetV1 roughly 26 times smaller and 27 times less computationally expensive.^[1]

MobileNetV2 (2018)

Overview

MobileNetV2 was introduced in "MobileNetV2: Inverted Residuals and Linear Bottlenecks" by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. The paper was published at CVPR 2018 in Salt Lake City.^[2]

MobileNetV2 builds on the foundation of depthwise separable convolutions from V1 but introduces two key architectural innovations: inverted residual blocks and linear bottlenecks. These changes improve both accuracy and efficiency.^[2]

What are inverted residual blocks?

Traditional residual networks use a "wide-narrow-wide" bottleneck structure: the input is first projected to a lower-dimensional space, processed with a spatial convolution, and then projected back to a higher-dimensional space. The skip connection bridges the wide layers. MobileNetV2 inverts this pattern with a "narrow-wide-narrow" structure:

Expansion layer: A 1x1 pointwise convolution expands the input from a low-dimensional representation to a higher-dimensional one. The expansion factor t (typically 6) controls how much wider the intermediate representation becomes. For example, if the input has 24 channels, the expanded representation has 24 x 6 = 144 channels. ReLU6 activation is applied.
Depthwise convolution: A 3x3 depthwise convolution performs spatial filtering on the expanded representation. ReLU6 activation is applied.
Projection layer: A 1x1 pointwise convolution projects the filtered features back to a low-dimensional representation.

Shortcut connections are placed between the thin bottleneck layers (the inputs and outputs of the block), not between the expanded layers. This is the opposite of traditional residual networks, hence the name "inverted residual." The inverted design is more memory-efficient because the skip connections carry low-dimensional tensors, reducing peak memory usage during inference.^[2]

Linear Bottlenecks

A critical insight in MobileNetV2 is that applying a nonlinear activation (such as ReLU) in the narrow bottleneck layers destroys information. When the feature space is low-dimensional, ReLU can zero out a significant portion of the information, collapsing the learned manifold. To address this, MobileNetV2 removes the nonlinearity from the output of the bottleneck projection layer, using a linear activation instead. This preserves the representational power of the low-dimensional features. ReLU6 activation is used only after the expansion and depthwise convolution layers, where the feature space is high-dimensional enough to tolerate the information loss.^[2]

Architecture

The MobileNetV2 architecture begins with a standard 32-filter convolution layer and is followed by 19 inverted residual bottleneck layers. The architecture table uses the notation t for expansion factor, c for output channels, n for number of repeated blocks, and s for stride of the first block in each stage.^[2]

Input Resolution	Operator	t	c	n	s
224 x 224 x 3	conv2d	-	32	1	2
112 x 112 x 32	bottleneck	1	16	1	1
112 x 112 x 16	bottleneck	6	24	2	2
56 x 56 x 24	bottleneck	6	32	3	2
28 x 28 x 32	bottleneck	6	64	4	2
14 x 14 x 64	bottleneck	6	96	3	1
14 x 14 x 96	bottleneck	6	160	3	2
7 x 7 x 160	bottleneck	6	320	1	1
7 x 7 x 320	conv2d 1x1	-	1280	1	1

The network concludes with a 1x1 convolution expanding to 1,280 channels, followed by global average pooling and a classification layer.^[2]

Performance

Configuration	Top-1 Accuracy (%)	Multiply-Adds (M)	Parameters (M)
MobileNetV2 1.4, 224	75.0	582	6.06
MobileNetV2 1.0, 224	71.8	300	3.47
MobileNetV2 0.75, 224	69.8	209	2.61
MobileNetV2 0.50, 224	65.4	97	1.95
MobileNetV2 0.35, 224	60.3	59	1.66

At 1.0 width multiplier and 224x224 resolution, MobileNetV2 achieves 71.8% top-1 accuracy on ImageNet with only 300 million multiply-adds and 3.47 million parameters.^[2] Compared to MobileNetV1 (70.6% accuracy, 569M MAdds, 4.2M parameters), V2 is both more accurate and roughly 47% more computationally efficient. On a Google Pixel phone, MobileNetV2 runs 30 to 40% faster than MobileNetV1.^[2]

The paper also introduced SSDLite (a lightweight version of SSD for object detection that replaces standard convolutions in the detection head with depthwise separable convolutions) and Mobile DeepLabv3 (a lightweight version of DeepLabv3 for semantic segmentation), demonstrating MobileNetV2's versatility as a feature extraction backbone for multiple tasks.^[2]

MobileNetV3 (2019)

Overview

MobileNetV3 was introduced in "Searching for MobileNetV3" by Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. The paper was published at ICCV 2019.^[3]

MobileNetV3 represents a shift in methodology: instead of relying purely on manual architectural design, it combines hardware-aware neural architecture search (NAS) with hand-crafted refinements. The result is two model variants, MobileNetV3-Large and MobileNetV3-Small, targeting high and low resource use cases respectively.^[3]

Neural Architecture Search and NetAdapt

MobileNetV3's architecture was discovered through a two-stage process:

Platform-aware NAS: The initial architecture is found using a hardware-aware neural architecture search approach based on MnasNet. This search optimizes a multi-objective reward that balances accuracy with latency measured on real mobile hardware (Pixel phones), not merely estimated through FLOPs.
NetAdapt algorithm: The NAS-discovered architecture is then refined using the NetAdapt algorithm, which sequentially reduces latency while minimizing accuracy loss. At each step, proposals are generated that reduce latency by a small amount (delta), and the proposal with the best ratio of accuracy change to latency change is selected.

This combination of global automated search and local layer-by-layer refinement produces architectures that are well-tuned to specific hardware targets.^[3]

Hard Swish (h-swish) Activation

MobileNetV3 replaces the standard ReLU and swish activation functions with hard versions that are more computationally efficient:

Hard sigmoid: h-sigmoid(x) = ReLU6(x + 3) / 6
Hard swish: h-swish(x) = x * ReLU6(x + 3) / 6

The swish activation function (x * sigmoid(x)) improves accuracy compared to ReLU but is computationally expensive because of the sigmoid operation, which requires computing an exponential. The hard swish approximation replaces the sigmoid with a piecewise linear function (hard sigmoid), maintaining the accuracy benefits while being much cheaper to compute on mobile hardware. Hard swish is applied in the deeper layers of the network (those with 80 or more channels) where it has the greatest impact on accuracy, while ReLU is retained in the earlier layers where it is sufficient and less costly.^[3]

Squeeze-and-Excitation Modules

MobileNetV3 incorporates squeeze-and-excitation (SE) modules into its building blocks.^[3] Originally proposed by Hu et al. (2018), SE modules enable the network to adaptively recalibrate channel-wise feature responses by modeling interdependencies between channels:^[8]

Squeeze: Global average pooling compresses each channel's spatial dimensions into a single scalar, producing a channel descriptor vector.
Excitation: Two fully connected layers (with a reduction ratio) learn channel-wise attention weights, followed by a hard sigmoid activation.
Scale: The original feature map is element-wise multiplied by these learned weights, allowing the network to emphasize informative channels and suppress less useful ones.

In MobileNetV3, the SE bottleneck size is fixed at 1/4 of the number of channels in the expansion layer. This provides a meaningful accuracy improvement with a modest increase in parameter count and no discernible latency cost on most hardware.^[3]

Beyond NAS and NetAdapt, the MobileNetV3 paper introduces several manual architectural refinements:

Efficient last stage: The computationally expensive layers at the end of the network are redesigned. The final 1x1 convolution layer is moved to after the global average pooling, so that the expensive pointwise convolution operates on a 1x1 spatial dimension rather than 7x7. This reduces latency by approximately 7 milliseconds (roughly 11% of the total inference time) with no loss in accuracy.
Efficient first layer: The number of filters in the initial convolution layer is reduced from 32 to 16, with hard swish nonlinearity maintaining accuracy.
Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP): A lightweight segmentation head designed specifically for mobile semantic segmentation tasks.^[3]

Performance

Model	Top-1 Accuracy (%)	Multiply-Adds (M)	Parameters (M)
MobileNetV3-Large 1.0	75.2	217	5.4
MobileNetV3-Large 0.75	73.3	155	4.0
MobileNetV3-Small 1.0	67.5	66	2.9
MobileNetV3-Small 0.75	65.4	44	2.4
MobileNetV2 1.0 (baseline)	71.8	300	3.47

MobileNetV3-Large achieves 75.2% top-1 accuracy with only 217 million multiply-adds, a significant improvement over MobileNetV2's 71.8% at 300M MAdds. MobileNetV3-Large is 3.2% more accurate while reducing latency by about 15% compared to MobileNetV2. MobileNetV3-Small is 4.6% more accurate than the comparable MobileNetV2 variant while reducing latency by about 5%.^[3]

Google also released "minimalistic" variants of MobileNetV3 that remove SE modules and h-swish for deployment on hardware where these operations are not well supported. MobileNetV3-Large minimalistic achieves 72.3% top-1 accuracy with 209M multiply-adds and 3.9M parameters.^[3]

For 8-bit quantized inference, MobileNetV3-Large retains 73.9% top-1 accuracy (a drop of 1.3 percentage points from floating-point), demonstrating good quantization friendliness.^[3]

MobileNetV4 (2024)

Overview

MobileNetV4 was introduced in "MobileNetV4: Universal Models for the Mobile Ecosystem" by Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, Vaibhav Aggarwal, Tenghui Zhu, Daniele Moro, and Andrew Howard at Google. The paper was published at ECCV 2024.^[4]

MobileNetV4 was designed with a focus on universal efficiency: performing well not just on a single hardware platform but across a wide range of mobile accelerators, including mobile CPUs, GPUs, DSPs, Apple Neural Engine, and Google Pixel EdgeTPU. Prior MobileNet generations were typically optimized for specific hardware targets, but V4 aims for Pareto optimality across the entire mobile hardware ecosystem.^[4] The authors report that the integration of UIB, Mobile MQA, and the refined NAS recipe "results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU, a characteristic not found in any other models tested," and that the "MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms."^[4]

Universal Inverted Bottleneck (UIB)

The core building block of MobileNetV4 is the Universal Inverted Bottleneck (UIB), a flexible and unified structure that generalizes several existing block designs:

Inverted Bottleneck (IB): The standard MobileNetV2 block.
ConvNext block: A design inspired by the ConvNeXt architecture.
Feed Forward Network (FFN): A transformer-style FFN block.
Extra Depthwise (ExtraDW): A new variant introduced in MobileNetV4.

The UIB extends the inverted bottleneck by introducing two optional depthwise convolutions: one before the expansion layer and one between the expansion and projection layers. Whether these depthwise convolutions are present is determined through NAS optimization, allowing the search to select the most efficient block type for each position in the network.^[4]

Mobile Multi-Query Attention (Mobile MQA)

MobileNetV4 introduces a hybrid architecture that combines convolutions with attention mechanisms. The Mobile MQA block is an attention mechanism tailored for mobile accelerators. It uses shared keys and values across attention heads (multi-query attention), which reduces memory bandwidth requirements and improves operational intensity. Mobile MQA delivers a 39% inference speedup compared to standard multi-head self-attention while preserving accuracy.^[4]

Optimized NAS Recipe

MobileNetV4 also introduces an improved NAS recipe that enhances search effectiveness. The refined search process, combined with UIB and Mobile MQA, produces models that are Pareto optimal across a wide range of hardware platforms.^[4]

Enhanced Distillation

The paper introduces a novel knowledge distillation recipe that mixes datasets with different augmentations and adds balanced in-class data. Using this technique, the MobileNetV4-Hybrid-Large model achieves 87.0% top-1 accuracy on ImageNet-1K with just 3.8ms latency on the Pixel 8 EdgeTPU. The distilled model is 39 times smaller in MACs than its teacher while being only 0.5% less accurate.^[4]

Performance

Model	Top-1 Accuracy (%)	MACs (G)	Parameters (M)	Pixel 6 CPU (ms)	Pixel 8 EdgeTPU (ms)	Samsung S23 GPU (ms)
MNv4-Conv-S	73.8	0.2	3.8	2.4	0.7	2.0
MNv4-Conv-M	79.9	1.0	9.2	11.4	1.1	4.1
MNv4-Conv-L	82.9	5.9	31.0	59.9	2.4	13.2
MNv4-Hybrid-M	80.7	1.2	10.5	14.3	1.5	5.9
MNv4-Hybrid-L	83.4	7.2	35.9	87.6	3.8	18.1
MNv4-Hybrid-L (distilled)	87.0	7.2	35.9	87.6	3.8	18.1

Cross-Version Comparison

The following table summarizes the key specifications across all MobileNet versions at their default configurations on ImageNet-1K classification.

Model	Year	Top-1 Accuracy (%)	Multiply-Adds / MACs	Parameters (M)	Key Innovation
MobileNetV1 1.0	2017	70.6	569M	4.2	Depthwise separable convolutions
MobileNetV2 1.0	2018	71.8	300M	3.47	Inverted residuals, linear bottlenecks
MobileNetV3-Large 1.0	2019	75.2	217M	5.4	NAS + NetAdapt, h-swish, SE modules
MobileNetV3-Small 1.0	2019	67.5	66M	2.9	Compact NAS-optimized variant
MNv4-Conv-S	2024	73.8	200M	3.8	Universal Inverted Bottleneck
MNv4-Conv-M	2024	79.9	1.0G	9.2	UIB + optimized NAS
MNv4-Hybrid-L	2024	83.4	7.2G	35.9	UIB + Mobile MQA attention
MNv4-Hybrid-L (distilled)	2024	87.0	7.2G	35.9	Enhanced distillation recipe

From V1 to V4, the MobileNet family has evolved from a purely convolutional architecture to a hybrid convolutional-attention design. The accuracy on ImageNet has climbed from 70.6% (V1) to 87.0% (V4 with distillation), while the architectural innovations have progressively expanded from depthwise separable convolutions to inverted residuals, NAS-designed structures, and attention mechanisms.

Comparison with Other Lightweight Architectures

How does MobileNet differ from EfficientNet?

EfficientNet, introduced by Mingxing Tan and Quoc V. Le in 2019, takes a different approach to model efficiency. Rather than designing novel convolution types, EfficientNet uses a compound scaling method that simultaneously scales network width (number of channels), depth (number of layers), and input resolution using a fixed set of scaling coefficients derived from a grid search. The scaling follows the relationship alpha^phi * beta^phi * gamma^phi approximately equals 2, where phi is the compound coefficient. The baseline EfficientNet-B0 model was discovered through NAS and uses mobile inverted bottleneck convolution (MBConv) blocks, which are the same building blocks as MobileNetV2.^[5]

Model	Top-1 Accuracy (%)	FLOPs (M)	Parameters (M)
MobileNetV2 1.0	71.8	300	3.47
MobileNetV3-Large 1.0	75.2	217	5.4
EfficientNet-B0	77.1	390	5.3
EfficientNet-B1	79.8	700	7.8
MNv4-Conv-M	79.9	1,000	9.2

EfficientNet-B0 achieves higher accuracy than MobileNetV3-Large (77.1% vs. 75.2%) with a similar parameter count but at a higher computational cost (390M vs. 217M FLOPs).^[5] However, MobileNet models are generally faster in actual inference on mobile hardware because they are explicitly optimized for hardware-specific latency, whereas EfficientNet's compound scaling optimizes primarily for FLOPs. FLOPs alone do not perfectly predict on-device speed, because factors such as memory access patterns, operator support, and hardware-specific optimizations affect actual inference latency. MobileNetV4 closes and surpasses the accuracy gap through its UIB blocks and hybrid attention mechanisms.^[4]

How does MobileNet differ from ShuffleNet?

ShuffleNet, developed by Xiangyu Zhang and colleagues at Megvii (Face++), uses two key operations to achieve efficiency: pointwise group convolutions and channel shuffling. ShuffleNet V1 (Zhang et al., CVPR 2018) demonstrated that group convolutions combined with channel shuffling could significantly reduce computational cost. In pointwise group convolution, the input channels are partitioned into groups, and convolutions are applied independently within each group. The channel shuffle operation then rearranges the output channels across groups, ensuring that downstream layers can access features from all groups.^[6]

ShuffleNet V2 (Ma et al., ECCV 2018) refined this approach based on four practical guidelines for efficient architecture design. The authors argued that FLOPs alone are an insufficient proxy for actual inference speed and proposed: (1) equal channel width minimizes memory access cost, (2) excessive group convolution increases memory access cost, (3) network fragmentation reduces parallelism, and (4) element-wise operations are not negligible. ShuffleNet V2 replaced group convolutions with channel split operations for better hardware efficiency.^[7]

Model	Top-1 Accuracy (%)	FLOPs (M)	Parameters (M)
MobileNetV2 1.0	71.8	300	3.47
ShuffleNetV2 1.5x	72.6	299	3.5
MobileNetV3-Large 1.0	75.2	217	5.4
ShuffleNetV2 1.0x	69.4	146	2.3
MobileNetV1 0.50	63.7	149	1.3

At comparable FLOPs budgets (roughly 300M), ShuffleNetV2 1.5x achieves 72.6% top-1 accuracy, which is slightly better than MobileNetV2 1.0 (71.8%).^[7] ShuffleNet V2 tends to be faster than MobileNetV2 in actual inference on both GPU and ARM platforms because it follows hardware-friendly design principles.^[7] However, the differences are often small, and MobileNet models benefit from broader software ecosystem support in TensorFlow Lite and other deployment frameworks, plus more extensive hardware-specific optimizations.

MobileViT: Combining MobileNet with Vision Transformers

The success of vision transformers (ViTs) in computer vision prompted researchers to explore hybrid architectures that combine the efficiency of MobileNet-style convolutions with the global context modeling of transformer self-attention.

MobileViT

MobileViT was introduced by Sachin Mehta and Mohammad Rastegari at Apple, published at ICLR 2022.^[9] MobileViT is a lightweight hybrid backbone that alternates MobileNetV2-style inverted residual blocks with transformer-based "MobileViT blocks." Each MobileViT block unfolds spatial features into a sequence of patches, applies multi-head self-attention to capture global relationships across the entire feature map, and then folds the features back into their spatial arrangement. This approach treats transformers as convolutions, combining local spatial processing (from convolutions) with global receptive fields (from self-attention).^[9]

Variants and Performance

Model	Year	Top-1 Accuracy (%)	FLOPs (G)	Parameters (M)	Key Feature
MobileViT-XXS	2022	69.0	0.4	1.3	Smallest variant
MobileViT-XS	2022	74.8	0.7	2.3	Mid-range variant
MobileViT-S	2022	78.4	2.0	5.6	CNN + transformer hybrid
MobileViT v2 (1.0)	2022	75.6	-	~3.0	Separable self-attention, O(k) complexity
MobileViT v3-S	2023	~79.3	~2.0	~6.0	Improved fusion block

MobileViT-S achieves 78.4% top-1 accuracy with about 5.6 million parameters, which is 3.2% more accurate than MobileNetV3-Large and 6.2% more accurate than DeiT for a similar parameter count. On the MS-COCO object detection benchmark, MobileViT is 5.7% more accurate than MobileNetV3 at a similar parameter budget.^[9]

MobileViT V2

MobileViT V2 (Mehta and Rastegari, 2022) replaces standard multi-head self-attention (which has O(k^2) complexity with respect to the number of tokens k) with separable self-attention, reducing the complexity to O(k). Each token attends to a single learnable latent token via element-wise operations rather than attending to every other token directly. MobileViT v2 achieves 75.6% top-1 accuracy with approximately 3 million parameters, outperforming MobileViT v1 by about 1 percentage point while running 3.2 times faster on a mobile device.^[10]

MobileViT V3

MobileViT v3 (Wadekar and Roop, 2023) introduces improved fusion strategies for combining local, global, and input features. It replaces 3x3 convolutional layers with 1x1 convolutional layers in the fusion block and adds residual connections from the local representation block. The MobileViT v3 variants (XXS, XS, and S) achieve top-1 accuracies ranging from approximately 71% to 79% on ImageNet, boosting accuracy by roughly 2 percentage points over MobileViT v2 at comparable model sizes.^[11]

Applications and Deployment

What is MobileNet used for?

MobileNet models are among the most widely deployed neural networks in the world, running on over 4 billion devices through TensorFlow Lite (now LiteRT) and other on-device inference frameworks.^[13] Their compact size and low latency make them suitable for applications that require real-time inference without network connectivity.

Common on-device applications include:

Image classification: Identifying objects, scenes, and categories in photos taken on mobile devices.
Object detection: Real-time detection and localization of objects in camera feeds, used in applications like Google Lens and augmented reality. The MobileNet-SSD architecture pairs a MobileNet backbone with the Single Shot MultiBox Detector (SSD) framework, and MobileNetV2's SSDLite variant further reduces the detection head's computational cost by replacing standard convolutions with depthwise separable convolutions.^[2]
Semantic segmentation: Pixel-level labeling for applications such as portrait mode (background blur), video conferencing backgrounds, and scene understanding. Mobile DeepLabv3 replaces the heavy ResNet backbone in DeepLab with MobileNetV2, and MobileNetV3's LR-ASPP provides an even more efficient segmentation head.
Pose estimation: Detecting human body keypoints for fitness tracking, gesture recognition, and interactive applications.
Face detection and recognition: Powering face unlock, photo organization, and camera autofocus features. Google's MediaPipe framework uses MobileNet-based models for real-time face detection, hand tracking, and pose estimation, providing privacy-preserving inference without sending biometric data to cloud servers.
Text recognition: On-device optical character recognition (OCR) for translating signs, scanning documents, and reading text in images.

TensorFlow Lite and LiteRT

TensorFlow Lite (TFLite) is Google's open-source framework for running machine learning models on mobile, embedded, and IoT devices. MobileNet models have been first-class citizens in the TFLite ecosystem since its inception.^[13] TFLite provides optimized kernels for depthwise separable convolutions and supports various optimization techniques:

Post-training quantization: Converting floating-point weights to 8-bit integers, reducing model size by approximately 4 times with minimal accuracy loss.
Quantization-aware training: Training models with simulated quantization to preserve accuracy under low-precision inference.
GPU and hardware acceleration: TFLite delegates enable inference on mobile GPUs, Android's Neural Networks API (NNAPI), and specialized accelerators like the Pixel Neural Core and EdgeTPU.

In September 2024, Google rebranded TensorFlow Lite as LiteRT (Lite Runtime), reflecting the framework's expanded support for models authored in PyTorch, JAX, and Keras.^[12] The change is purely a rebrand; existing .tflite model files and conversion tools remain fully compatible with LiteRT.^[12]

Apple Core ML

Core ML is Apple's machine learning framework for iOS, iPadOS, macOS, watchOS, and tvOS. MobileNet models can be converted to the Core ML format (.mlmodel) using Apple's coremltools Python library. Core ML leverages the Apple Neural Engine (ANE) for hardware-accelerated inference, providing fast and energy-efficient execution. MobileViT, developed by Apple researchers, is natively supported in Core ML and optimized for Apple silicon.

Edge TPU and Coral

Google's Edge TPU, available through the Coral hardware platform, is specifically designed to run quantized TFLite models such as MobileNet. In internal benchmarks, inference with the Edge TPU is 70 to 100 times faster than on a CPU for MobileNet-based models. Coral devices include the Dev Board (a single-board Linux computer with an Edge TPU), the USB Accelerator (which adds an Edge TPU to any Linux system, including Raspberry Pi), and the Coral SoM (System-on-Module) for production deployments.

Google has released MobileNet-EdgeTPU variants that are specifically optimized for Edge TPU hardware. For example, MobilenetEdgeTPU at 1.0 width achieves 75.6% top-1 accuracy on ImageNet in 8-bit quantized mode with 990 million multiply-adds.

Other Edge AI Deployment Targets

Beyond smartphones and dedicated accelerators, MobileNet models are deployed across a wide range of edge devices:

Embedded systems: Raspberry Pi, NVIDIA Jetson, and other single-board computers use MobileNet for real-time vision tasks in robotics, industrial inspection, and smart home devices.
Microcontrollers: Through TensorFlow Lite Micro, extremely quantized MobileNet variants can run on ARM Cortex-M microcontrollers with as little as 256KB of memory.
Autonomous vehicles: MobileNet serves as a lightweight backbone for perception tasks in self-driving cars and drones where low latency is critical.
Wearable devices: Smartwatches and AR glasses use MobileNet-based models for gesture recognition, activity classification, and simple visual understanding.
Web browsers: MobileNet models can run in web browsers through TensorFlow.js, enabling client-side inference without server communication.

ONNX and Cross-Platform Deployment

MobileNet models are available in the ONNX (Open Neural Network Exchange) format, enabling deployment across frameworks and hardware platforms. ONNX Runtime provides optimized inference for MobileNet on a variety of hardware backends including CPUs, GPUs, and specialized accelerators from Intel, Qualcomm, and others. PyTorch Mobile provides another deployment path, with official MobileNetV3 implementations available in torchvision.

Quantization and Model Optimization

Deploying MobileNet on real hardware often involves additional optimization steps beyond the architecture itself.

Post-Training Quantization

Post-training quantization converts the 32-bit floating-point weights of a trained MobileNet model to 8-bit integers (INT8). This reduces the model size by roughly 4 times and speeds up inference on hardware that supports integer arithmetic. For MobileNetV3-Large, 8-bit quantization reduces the top-1 accuracy from 75.2% to 73.9%, a modest trade-off given the significant gains in size and speed.^[3] For MobileNetV3-Small, the drop is from 67.5% to 64.9%.^[3]

Quantization-Aware Training

Quantization-aware training (QAT) simulates quantization effects during the training process, allowing the network to learn to compensate for the reduced precision. QAT typically recovers most of the accuracy lost through post-training quantization and is recommended for deployment scenarios where every fraction of a percentage point of accuracy matters.

Pruning

Structured and unstructured pruning can further reduce MobileNet's parameter count by removing weights or entire channels that contribute least to the model's output. When combined with quantization, pruning can yield extremely compact models suitable for microcontroller-class devices.

Performance Benchmarks Across Architectures

The following table compares MobileNet models with other popular lightweight architectures on the ImageNet classification benchmark.

Model	Top-1 (%)	FLOPs	Params (M)	Architecture Family
MobileNetV1 1.0	70.6	569M	4.2	MobileNet
MobileNetV2 1.0	71.8	300M	3.47	MobileNet
MobileNetV3-Large 1.0	75.2	217M	5.4	MobileNet
MNv4-Conv-M	79.9	1.0G	9.2	MobileNet
MNv4-Hybrid-L (distilled)	87.0	7.2G	35.9	MobileNet
ShuffleNetV2 1.0x	69.4	146M	2.3	ShuffleNet
ShuffleNetV2 1.5x	72.6	299M	3.5	ShuffleNet
EfficientNet-B0	77.1	390M	5.3	EfficientNet
EfficientNet-B1	79.8	700M	7.8	EfficientNet
MobileViT-S	78.4	2.0G	5.6	MobileViT
NASNet-A Mobile	74.0	564M	5.3	NASNet

These numbers show that within the lightweight model space, accuracy and efficiency have improved dramatically over the span of seven years. MobileNetV4 with distillation achieves accuracy that was previously only possible with models orders of magnitude larger.

Summary of Key Differences Across Versions

Feature	MobileNetV1	MobileNetV2	MobileNetV3	MobileNetV4
Year	2017	2018	2019	2024
Core Block	Depthwise separable conv	Inverted residual with linear bottleneck	Inverted residual + SE + h-swish	Universal Inverted Bottleneck (UIB)
Residual Connections	No	Yes (narrow-to-narrow)	Yes (narrow-to-narrow)	Yes
Activation Function	ReLU	ReLU6	h-swish (deep layers), ReLU (early layers)	Varies by block type
Attention Mechanism	None	None	Squeeze-and-Excitation	SE + Mobile MQA (hybrid variants)
Architecture Design	Manual	Manual	NAS + NetAdapt + manual	NAS with UIB + optimized recipe
Scaling Strategy	Width and resolution multipliers	Width multiplier	NAS-driven (Large and Small)	Conv and Hybrid model families
Hardware Target	General mobile	General mobile	Pixel phone-optimized	Universal (CPU, GPU, DSP, ANE, EdgeTPU)

Influence and Legacy

The MobileNet architecture has had a profound influence on the field of efficient deep learning:

Depthwise separable convolutions have become a standard building block in efficient architectures, adopted by EfficientNet, MnasNet, and many others.
Inverted residual blocks from MobileNetV2 are used in EfficientNet (as MBConv blocks), MnasNet, and numerous NAS-discovered architectures.
The width and resolution multiplier approach from MobileNetV1 introduced the idea of parameterized model families that can be scaled to match different hardware constraints.
Hardware-aware NAS from MobileNetV3 helped establish the practice of optimizing architectures directly for target hardware latency rather than just FLOPs.
Universal efficiency from MobileNetV4 demonstrated how to build models that perform well across an entire ecosystem of diverse hardware accelerators.

MobileNet's open-source availability through TensorFlow and its integration into popular frameworks like PyTorch, Keras, and Hugging Face Transformers have made it one of the most widely used model families in production machine learning systems.

References

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." arXiv:1704.04861. https://arxiv.org/abs/1704.04861 ↩
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). "MobileNetV2: Inverted Residuals and Linear Bottlenecks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4510-4520. https://arxiv.org/abs/1801.04381 ↩
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., & Adam, H. (2019). "Searching for MobileNetV3." Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). https://arxiv.org/abs/1905.02244 ↩
Qin, D., Leichner, C., Delakis, M., Fornoni, M., Luo, S., Yang, F., Wang, W., Banbury, C., Ye, C., Akin, B., Aggarwal, V., Zhu, T., Moro, D., & Howard, A. (2024). "MobileNetV4: Universal Models for the Mobile Ecosystem." European Conference on Computer Vision (ECCV). https://arxiv.org/abs/2404.10518 ↩
Tan, M. & Le, Q. V. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." Proceedings of the International Conference on Machine Learning (ICML). https://arxiv.org/abs/1905.11946 ↩
Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/1707.01083 ↩
Ma, N., Zhang, X., Zheng, H.-T., & Sun, J. (2018). "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design." European Conference on Computer Vision (ECCV). https://arxiv.org/abs/1807.11164 ↩
Hu, J., Shen, L., & Sun, G. (2018). "Squeeze-and-Excitation Networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/1709.01507 ↩
Mehta, S. & Rastegari, M. (2022). "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer." International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2110.02178 ↩
Mehta, S. & Rastegari, M. (2022). "Separable Self-attention for Mobile Vision Transformers (MobileViT v2)." arXiv:2206.02680. https://arxiv.org/abs/2206.02680 ↩
Wadekar, S. N. & Roop, A. (2023). "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features." arXiv:2209.15159. https://arxiv.org/abs/2209.15159 ↩
Google. "TensorFlow Lite is now LiteRT." Google Developers Blog, September 2024. https://developers.googleblog.com/tensorflow-lite-is-now-litert/ ↩
Google. "TensorFlow Models: MobileNet." GitHub. https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/README.md ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation.

Suggest edit

Why was MobileNet created?

MobileNetV1 (2017)

Overview

What are depthwise separable convolutions?

Architecture

Width and Resolution Multipliers

Performance

MobileNetV2 (2018)

Overview

What are inverted residual blocks?

Linear Bottlenecks

Architecture

Performance

MobileNetV3 (2019)

Overview

Neural Architecture Search and NetAdapt

Hard Swish (h-swish) Activation

Squeeze-and-Excitation Modules

Architectural Refinements

Performance

MobileNetV4 (2024)

Overview

Universal Inverted Bottleneck (UIB)

Mobile Multi-Query Attention (Mobile MQA)

Optimized NAS Recipe

Enhanced Distillation

Performance

Cross-Version Comparison

Comparison with Other Lightweight Architectures

How does MobileNet differ from EfficientNet?

How does MobileNet differ from ShuffleNet?

MobileViT: Combining MobileNet with Vision Transformers

MobileViT

Variants and Performance

MobileViT V2

MobileViT V3

Applications and Deployment

What is MobileNet used for?

TensorFlow Lite and LiteRT

Apple Core ML

Edge TPU and Coral

Other Edge AI Deployment Targets

ONNX and Cross-Platform Deployment

Quantization and Model Optimization

Post-Training Quantization

Quantization-Aware Training

Pruning

Performance Benchmarks Across Architectures

Summary of Key Differences Across Versions

Influence and Legacy

References

Improve this article

Related Articles

Lyte

Translational invariance

Convolutional Neural Network

ResNet

EfficientNet

YOLO (object detection)

What links here (24 of 30)

Related Articles

Lyte

Translational invariance

Convolutional Neural Network

ResNet

EfficientNet

YOLO (object detection)

What links here (24 of 30)