EfficientNet

Computer Vision Deep Learning Neural Networks

22 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v7 · 4,460 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

EfficientNet is a family of convolutional neural network architectures and a model-scaling method that uniformly scales network depth, width, and input resolution with a single compound coefficient, developed by Mingxing Tan and Quoc V. Le at Google Brain and introduced in the 2019 paper "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks."^[1] Its flagship model, EfficientNet-B7, reached 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster on inference than the best existing convolutional network at the time, and the smallest model, EfficientNet-B0, matched ResNet-50 accuracy with roughly one-fifth the parameters.^[1] The paper was presented at the International Conference on Machine Learning (ICML) in Long Beach, California.^[1] The central contribution of EfficientNet is its compound scaling method, which replaced the ad hoc, single-dimension scaling strategies that had dominated prior deep learning research and produced a family of models (EfficientNet-B0 through B7) that achieved state-of-the-art accuracy while using significantly fewer parameters and floating-point operations (FLOPs) than competing architectures.^[1] A follow-up paper in 2021 introduced EfficientNetV2, which further improved training speed through progressive learning and architectural refinements.^[2]

The authors summarize the result directly: "our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet."^[1]

The original EfficientNet paper has become one of the most cited works in computer vision, accumulating more than 24,000 citations as of 2025.^[1]^[10] Its ideas have influenced the design of numerous subsequent architectures and remain widely used in both research and production systems.

Background and Motivation

Before EfficientNet, scaling up convolutional neural networks to improve accuracy was a common but inconsistent practice. Researchers would typically increase one of three dimensions of a network: depth (number of layers), width (number of channels per layer), or input resolution (the pixel dimensions of the input image).^[1] For example, ResNet demonstrated the benefits of increasing depth from 18 to 152 layers. Wide ResNet showed that wider networks could outperform very deep but narrow ones. Other work used higher-resolution inputs to capture finer details in images.

The problem with these single-dimension scaling approaches is that they quickly reach diminishing returns.^[1] Making a network deeper without correspondingly increasing its width or resolution leads to vanishing gradients and limited accuracy gains. Similarly, increasing resolution alone adds computational cost without proportionally improving performance. Practitioners had no principled way to decide how much to scale each dimension, and most scaling decisions were made by trial and error.

Tan and Le set out to answer a fundamental question: is there a principled method to scale up convolutional neural networks that balances all three dimensions simultaneously? Their answer was the compound scaling method.^[1]

What is the compound scaling method?

The Core Idea

The compound scaling method rests on a straightforward observation: network depth, width, and resolution are interdependent, and scaling them together in a balanced fashion yields better results than scaling any single dimension alone.^[1] Intuitively, a higher-resolution image requires more layers (greater depth) to capture larger receptive fields and more channels (greater width) to capture finer-grained patterns at the increased resolution.

Mathematical Formulation

The method defines three scaling coefficients that control how each dimension grows:

Depth: d = α^(φ)
Width: w = β^(φ)
Resolution: r = γ^(φ)

Here, α, β, and γ are constants determined by a small grid search on the baseline network, and φ (phi) is a user-specified compound coefficient that controls the total computational budget.^[1] The key constraint is:

α · β² · γ² ≈ 2

This constraint ensures that for each increment of φ, the total FLOPs roughly double.^[1] The FLOPS of a convolutional network scale proportionally with d, w², and r², which explains the squared terms for β and γ in the constraint.

For EfficientNet, the authors found the optimal base coefficients through a grid search on the B0 model:

α = 1.2 (depth multiplier)
β = 1.1 (width multiplier)
γ = 1.15 (resolution multiplier)

With these values fixed, different values of φ produce the full EfficientNet family from B0 to B7.^[1]

Why Compound Scaling Works

The paper includes ablation studies showing that scaling only one dimension (depth only, width only, or resolution only) provides diminishing accuracy gains beyond a certain point.^[1] For instance, scaling width alone with depth fixed at d = 1.0 and resolution fixed at r = 1.0 quickly saturates in accuracy. However, when all three dimensions are scaled together, the same FLOP budget yields substantially higher accuracy.^[1] The compound scaling method captures the intuition that a larger input image needs both more layers to increase the receptive field and more channels to capture the additional fine-grained detail.

Neural Architecture Search for EfficientNet-B0

Rather than designing the baseline architecture by hand, Tan and Le used neural architecture search (NAS) to discover EfficientNet-B0.^[1] The search procedure was adapted from MnasNet, an earlier NAS method also developed at Google.^[3]

Search Space and Objective

The search space consisted of mobile inverted bottleneck convolution (MBConv) blocks with varying kernel sizes, expansion ratios, and numbers of layers per stage.^[3] The architecture was divided into multiple stages, and the search algorithm could select different block configurations for each stage.

The optimization objective was multi-objective, balancing accuracy and computational efficiency:

ACC(m) x [FLOPS(m) / T]^w

where ACC(m) is the accuracy of model m, FLOPS(m) is the model's floating-point operations, T is a target FLOP count, and w = -0.07 controls the trade-off between accuracy and efficiency.^[1] Unlike MnasNet, which optimized for inference latency on specific hardware, EfficientNet optimized for FLOPs to keep the architecture hardware-agnostic.^[1]

Search Procedure

The search used a reinforcement learning-based controller (an RNN) that sampled candidate architectures from the search space.^[3] Each candidate was trained on ImageNet, and its accuracy and FLOP count were measured. The controller was then updated to produce architectures that better balance accuracy and efficiency. The target FLOP budget was approximately 400 million FLOPs, which is slightly larger than MnasNet's target.^[1]

The resulting architecture, EfficientNet-B0, served as the foundation for the entire EfficientNet family.^[1] All larger models (B1 through B7) were derived from B0 by applying the compound scaling formula with increasing values of φ.

EfficientNet-B0 Architecture

EfficientNet-B0 is organized into nine stages. The first stage is a standard 3x3 convolution, stages 2 through 8 use MBConv blocks with varying configurations, and the final stage consists of a 1x1 convolution followed by global average pooling and a fully connected classification layer.^[1]

Stage	Operator	Resolution	Channels	Layers	Stride
1	Conv 3x3	224 x 224	32	1	2
2	MBConv1, k3x3	112 x 112	16	1	1
3	MBConv6, k3x3	112 x 112	24	2	2
4	MBConv6, k5x5	56 x 56	40	2	2
5	MBConv6, k3x3	28 x 28	80	3	2
6	MBConv6, k5x5	14 x 14	112	3	1
7	MBConv6, k5x5	14 x 14	192	4	2
8	MBConv6, k3x3	7 x 7	320	1	1
9	Conv 1x1, Pooling, FC	7 x 7	1280	1	-

In this table, "MBConv1" refers to a mobile inverted bottleneck block with an expansion ratio of 1 (no expansion), and "MBConv6" refers to blocks with an expansion ratio of 6. The notation "k3x3" and "k5x5" indicates the kernel size of the depthwise convolution within each block.

The B0 architecture accepts 224 x 224 input images and contains 5.3 million parameters with 0.39 billion FLOPs.^[1] Despite its compact size, it achieves 77.1% top-1 accuracy on ImageNet, which already surpasses ResNet-50 (76.0% top-1) while using roughly five times fewer parameters.^[1]

MBConv Blocks

The Mobile Inverted Bottleneck Convolution (MBConv) block is the core building block of EfficientNet. It was originally introduced in MobileNetV2 and later refined for use in MnasNet and EfficientNet.^[4]

Inverted Residual Structure

A standard residual block (as in ResNet) uses a "wide-narrow-wide" pattern: it compresses the input through a bottleneck and then expands it back. The MBConv block inverts this pattern, using a "narrow-wide-narrow" structure:^[4]

Expansion: A 1x1 pointwise convolution expands the input channels by an expansion ratio (typically 6x for most stages in EfficientNet-B0, except the first MBConv stage which uses an expansion ratio of 1).
Depthwise convolution: A depthwise separable convolution with a 3x3 or 5x5 kernel processes the expanded feature map. Each channel is convolved independently, which drastically reduces the parameter count and computation compared to standard convolutions.
Projection: A second 1x1 pointwise convolution projects the expanded features back to a lower-dimensional output.

When the input and output dimensions match, a skip (residual) connection adds the input directly to the output, similar to ResNet.^[4]

Squeeze-and-Excitation Module

Each MBConv block in EfficientNet includes a squeeze-and-excitation (SE) module, which was introduced by Hu et al. in 2018.^[5] The SE module performs channel-wise attention in two steps:

Squeeze: Global average pooling reduces each channel's spatial dimensions (height and width) to a single scalar value, producing a channel descriptor vector.
Excitation: The channel descriptor is passed through two fully connected layers (a bottleneck with a reduction ratio of 0.25, followed by a sigmoid activation) to produce per-channel scaling weights.

These weights are then multiplied element-wise with the original feature map, allowing the network to emphasize informative channels and suppress less useful ones.^[5] The SE module adds minimal computational overhead but consistently improves accuracy.^[5]

Activation Function

EfficientNet uses the Swish activation function (also known as SiLU, or Sigmoid Linear Unit), defined as f(x) = x * sigmoid(x).^[8] Swish was discovered through automated search by Ramachandran et al. (2017) and tends to outperform ReLU in deep networks because it is smooth and non-monotonic, allowing small negative values to pass through.^[8]

Scaling from B0 to B7

Once the B0 baseline architecture and the compound scaling coefficients (α, β, γ) are fixed, the family of models is generated by increasing the compound coefficient φ. Each increment of φ roughly doubles the computational budget.^[1]

Model	Compound Coefficient (φ)	Input Resolution	Parameters	FLOPs	Top-1 Accuracy	Top-5 Accuracy
EfficientNet-B0	1.0	224 x 224	5.3M	0.39B	77.1%	93.3%
EfficientNet-B1	1.1	240 x 240	7.8M	0.70B	79.1%	94.4%
EfficientNet-B2	1.2	260 x 260	9.2M	1.0B	80.1%	94.9%
EfficientNet-B3	1.3	300 x 300	12M	1.8B	81.6%	95.7%
EfficientNet-B4	1.4	380 x 380	19M	4.2B	82.9%	96.4%
EfficientNet-B5	1.6	456 x 456	30M	9.9B	83.6%	96.7%
EfficientNet-B6	1.8	528 x 528	43M	19B	84.0%	96.8%
EfficientNet-B7	2.0	600 x 600	66M	37B	84.3%	97.0%

Several trends stand out from this progression. First, each step up in φ increases the input resolution, depth, and width simultaneously, which is what gives the compound scaling method its advantage. Second, the accuracy improvements are roughly logarithmic with respect to FLOPs: going from B0 to B3 adds about 4.5 percentage points of top-1 accuracy at the cost of roughly 4.6x more FLOPs, while going from B3 to B7 adds another 2.7 percentage points but requires about 20x more FLOPs.^[1] Third, even the largest model (B7) remains substantially smaller than many competing architectures that achieve similar accuracy.

How does EfficientNet compare with other architectures?

The efficiency gains of EfficientNet become particularly clear when compared with other popular convolutional neural network architectures of the same era.

Model	Parameters	FLOPs	Top-1 Accuracy (ImageNet)
ResNet-50	26M	4.1B	76.0%
DenseNet-169	14M	3.4B	76.2%
Inception-v3	24M	5.7B	78.8%
NASNet-A (Large)	89M	24B	82.7%
GPipe (AmoebaNet)	557M	-	84.3%
EfficientNet-B0	5.3M	0.39B	77.1%
EfficientNet-B4	19M	4.2B	82.9%
EfficientNet-B7	66M	37B	84.3%

Key observations from these comparisons:

EfficientNet-B0 vs. ResNet-50: B0 achieves 1.1 percentage points higher top-1 accuracy while using approximately 5x fewer parameters and 10x fewer FLOPs.^[1]
EfficientNet-B4 vs. NASNet-A: B4 surpasses NASNet-A by 0.2 percentage points while using 4.7x fewer parameters and 5.7x fewer FLOPs.^[1]
EfficientNet-B7 vs. GPipe: B7 matches GPipe's 84.3% top-1 accuracy while using 8.4x fewer parameters. On inference, B7 is 6.1x faster on CPU.^[1]
EfficientNet-B1 vs. ResNet-152: B1 achieves comparable accuracy to ResNet-152 while being 7.6x smaller and 5.7x faster on CPU inference.^[1]

The compound scaling method also demonstrated its generality by improving existing architectures. When applied to MobileNet, it added 1.4 percentage points of ImageNet accuracy. Applied to ResNet, it added 0.7 percentage points.^[1]

EfficientNetV2

In 2021, Tan and Le published a follow-up paper titled "EfficientNetV2: Smaller Models and Faster Training," presented at ICML 2021.^[2] The paper describes the family in its abstract as "a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models."^[2] While the original EfficientNet focused on inference efficiency (fewer parameters and FLOPs), EfficientNetV2 addressed a critical practical concern: training speed.^[2]

Motivation for V2

Profiling the original EfficientNet models revealed several training bottlenecks:^[2]

Large image sizes slow training. The larger EfficientNet variants (B5 through B7) use very high input resolutions (456 to 600 pixels), which consume enormous GPU memory and slow down training.
Depthwise convolutions are slow on accelerators. Although depthwise separable convolutions reduce FLOPs, they often cannot fully utilize modern GPU and TPU hardware because of their low arithmetic intensity (the ratio of computation to memory access).
Uniform scaling is suboptimal. The original compound scaling method scales all stages equally, but in practice, the early and late stages of a network may benefit from different scaling strategies.

Fused-MBConv Blocks

A key architectural change in EfficientNetV2 is the introduction of Fused-MBConv blocks, which replace the separate depthwise convolution and 1x1 expansion convolution in a standard MBConv block with a single standard 3x3 (or 5x5) convolution.^[2] This fused operation has more parameters and FLOPs than a depthwise separable equivalent, but it runs significantly faster on modern hardware because standard convolutions have higher arithmetic intensity and are better optimized in GPU libraries like cuDNN.^[2]

The key design insight is that Fused-MBConv should only be used in the early stages of the network (stages 1 through 3), where feature maps are large and the hardware utilization benefits are greatest.^[2] Replacing all MBConv blocks with Fused-MBConv throughout the network would increase parameters and FLOPs excessively while actually slowing down training. Later stages continue to use standard MBConv blocks.

Progressive Learning

EfficientNetV2 introduced an adaptive progressive learning strategy.^[2] The idea of training with progressively increasing image sizes was not new, but previous implementations had a drawback: they applied the same regularization strength regardless of image size, which led to accuracy drops.^[2]

The V2 approach adaptively adjusts regularization strength alongside image size during training.^[2] Smaller images require weaker regularization because the network has less capacity to overfit at lower resolutions. As the image size increases throughout training, regularization is correspondingly strengthened.

For EfficientNetV2-M, the training schedule (approximately 350 epochs on ImageNet, divided into four stages of about 87 epochs each) works as follows:^[2]

Training Stage	Image Size	RandAugment Magnitude	Mixup Alpha	Dropout Rate
Stage 1	128	5	0.0	0.1
Stage 2	212	10	0.1	0.2
Stage 3	296	15	0.15	0.3
Stage 4	380	20	0.2	0.4

This progressive schedule dramatically reduces training time while maintaining or improving final accuracy.

Architecture Differences from V1

Beyond Fused-MBConv, EfficientNetV2 incorporated several other architectural changes discovered through training-aware NAS:^[2]

Smaller expansion ratios: V2 uses smaller expansion ratios in MBConv blocks (typically 4 instead of 6), which reduces memory access overhead.
Preference for 3x3 kernels: V2 favors 3x3 kernels over the 5x5 kernels found in V1, compensating by adding more layers.
Non-uniform scaling: Unlike V1's uniform compound scaling, V2 uses a non-uniform scaling strategy that adds more layers to later stages, which are more efficient to scale.

EfficientNetV2 Variants and Performance

Model	Parameters	FLOPs	Top-1 Accuracy (ImageNet)	Training Time (32 TPUv3 cores)
EfficientNetV2-S	22M	8.8B	83.9%	~30h
EfficientNetV2-M	54M	24B	85.1%	~50h
EfficientNetV2-L	120M	53B	85.7%	~80h

For comparison, EfficientNet-B7 (V1) achieves 84.3% accuracy but requires approximately 139 hours of training on the same hardware. EfficientNetV2-M matches B7's accuracy while training approximately 11x faster and using fewer parameters.^[2]

When pretrained on ImageNet-21K (a larger dataset with 21,000 classes and approximately 14 million images), the V2 models achieve even higher accuracy:^[2]

EfficientNetV2-L: 86.8% top-1 on ImageNet
EfficientNetV2-XL: 87.3% top-1 on ImageNet

The EfficientNetV2-XL result of 87.3% top-1 accuracy on ImageNet ILSVRC2012 outperformed the Vision Transformer (ViT-L/16) by 2.0 percentage points while training 5x to 11x faster using the same computing resources.^[2] As the authors state, the models train "5x-11x faster" than ViT "using the same computing resources."^[2]

Noisy Student Training

In late 2019, researchers at Google Brain published "Self-training with Noisy Student improves ImageNet classification" (Xie et al., CVPR 2020), which used EfficientNet as its backbone.^[6] This semi-supervised learning approach pushed EfficientNet's performance well beyond its supervised training results.

The Noisy Student method works as follows:^[6]

Train an EfficientNet model (the "teacher") on labeled ImageNet data.
Use the teacher to generate pseudo-labels for 300 million unlabeled images.
Train a larger EfficientNet (the "student") on the combined labeled and pseudo-labeled data, with added noise such as dropout, stochastic depth, and data augmentation.
Iterate the process, using the student as the new teacher.

Using an EfficientNet-L2 architecture (a scaled-up variant larger than B7), Noisy Student training achieved 88.4% top-1 accuracy on ImageNet, which the authors note is "2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images," and a major improvement over the supervised-only result of 84.3% for B7.^[6] The approach also significantly improved robustness: top-1 accuracy on ImageNet-A (a dataset of adversarially filtered natural images) jumped from 61.0% to 83.7%.^[6]

Transfer Learning and Use as a Backbone

EfficientNet has become one of the most popular backbone networks for transfer learning across a wide range of computer vision tasks.^[1] Its combination of high accuracy and low computational cost makes it particularly attractive for applications where resources are constrained. On transfer-learning benchmarks reported in the original paper, EfficientNet reached 91.7% accuracy on CIFAR-100 and 98.8% on the Oxford Flowers dataset, matching or exceeding prior state-of-the-art results with far fewer parameters.^[1]

Image Classification

Pretrained EfficientNet models are widely used as feature extractors for custom image classification tasks. The standard approach involves replacing the final fully connected layer with a new classifier head and fine-tuning the network on a domain-specific dataset. EfficientNet models pretrained on ImageNet transfer well to medical imaging, satellite imagery, wildlife monitoring, and many other domains.^[1]

The B0 through B4 variants are especially popular for transfer learning because they offer strong accuracy at manageable computational costs. For resource-constrained settings such as mobile or edge deployment, B0 and B1 provide an excellent accuracy-to-cost ratio.

Object Detection: EfficientDet

Mingxing Tan, Ruoming Pang, and Quoc Le extended the compound scaling concept to object detection with EfficientDet (Tan et al., CVPR 2020).^[7] EfficientDet uses EfficientNet as its backbone and introduces two additional innovations:^[7]

BiFPN (Bidirectional Feature Pyramid Network): A weighted bi-directional feature pyramid that performs efficient multi-scale feature fusion.
Compound scaling for detection: The compound coefficient scales not only the backbone but also the feature pyramid, the prediction heads, and the input resolution.

EfficientDet achieved state-of-the-art results on COCO object detection while using 28x fewer FLOPs than YOLOv3, 30x fewer FLOPs than RetinaNet, and 19x fewer FLOPs than NAS-FPN with a ResNet backbone.^[7]

Semantic Segmentation and Other Tasks

EfficientNet backbones have been integrated into numerous segmentation frameworks, including U-Net and DeepLab variants. They have also been used for tasks such as action recognition, face recognition, medical image analysis, and generative models. The architecture's modularity and availability in major frameworks (TensorFlow, PyTorch, Keras) have contributed to its widespread adoption.

Implementation and Framework Support

EfficientNet is available in all major deep learning frameworks:

TensorFlow / Keras: Official implementations are included in tf.keras.applications with pretrained weights for B0 through B7 and V2-S, V2-M, V2-L.
PyTorch: Both torchvision.models and the popular timm (PyTorch Image Models) library by Ross Wightman provide EfficientNet implementations with pretrained weights.
ONNX: Models can be exported to ONNX format for deployment across various inference engines.

Google also released the official TensorFlow implementation and pretrained checkpoints through the TensorFlow Model Garden and the automl repository on GitHub.^[9]

Strengths and Limitations

Strengths

Parameter efficiency: EfficientNet models consistently achieve higher accuracy per parameter than competing architectures. B0 matches or exceeds ResNet-50's accuracy with roughly one-fifth the parameters.^[1]
Scalability: The compound scaling method provides a principled way to trade off accuracy and computational cost, making it straightforward to select the right model for a given hardware budget.
Transfer learning performance: The features learned by EfficientNet transfer well to downstream tasks, often outperforming larger models when fine-tuned on small datasets.
Modular design: The MBConv building blocks are well-understood and can be incorporated into other architectures.

Limitations

Training speed (V1): The original EfficientNet models, particularly the larger variants, are slow to train due to high input resolutions and the use of depthwise convolutions that underutilize GPU hardware. EfficientNetV2 addressed this issue.^[2]
Hardware utilization: Depthwise separable convolutions have low arithmetic intensity, which means they do not fully exploit the parallelism of modern accelerators. This is less of a concern for inference on edge devices but can be a bottleneck for GPU-based training.^[2]
Competition from transformers: Since 2020, Vision Transformers (ViT) and their variants have achieved competitive or superior results on many benchmarks. However, EfficientNet remains competitive at smaller scales and in transfer learning settings, and hybrid architectures that combine convolutional and transformer components have drawn from both traditions.

Summary of All EfficientNet Variants

The following table provides a consolidated overview of all major EfficientNet variants, including both V1 and V2 families.

Model	Input Resolution	Parameters	FLOPs	Top-1 Accuracy (ImageNet)
EfficientNet-B0	224 x 224	5.3M	0.39B	77.1%
EfficientNet-B1	240 x 240	7.8M	0.70B	79.1%
EfficientNet-B2	260 x 260	9.2M	1.0B	80.1%
EfficientNet-B3	300 x 300	12M	1.8B	81.6%
EfficientNet-B4	380 x 380	19M	4.2B	82.9%
EfficientNet-B5	456 x 456	30M	9.9B	83.6%
EfficientNet-B6	528 x 528	43M	19B	84.0%
EfficientNet-B7	600 x 600	66M	37B	84.3%
EfficientNetV2-S	384 x 384	22M	8.8B	83.9%
EfficientNetV2-M	480 x 480	54M	24B	85.1%
EfficientNetV2-L	480 x 480	120M	53B	85.7%

EfficientNetV2-S achieves nearly the same accuracy as EfficientNet-B7 with roughly one-third the parameters and one-quarter the FLOPs, illustrating the substantial improvements in the V2 design.^[2]

Legacy and Influence

EfficientNet's influence on the deep learning field extends well beyond its ImageNet benchmarks. The compound scaling method introduced a principled framework for thinking about model scaling that has been adopted and adapted by subsequent work in natural language processing, speech recognition, and other domains.

The architecture also played a key role in demonstrating the power of neural architecture search for producing practical, deployable models. While earlier NAS results like NASNet and AmoebaNet were often too large for practical use, EfficientNet showed that NAS could produce compact, efficient architectures that outperformed hand-designed models across a range of computational budgets.^[1]

In the context of the broader shift toward transformer-based architectures in computer vision, EfficientNet remains relevant in several ways. Many practitioners continue to prefer convolutional architectures for smaller datasets, mobile deployment, and real-time applications. Hybrid architectures such as CoAtNet combine convolutional stages (including MBConv blocks) with transformer stages, drawing directly on EfficientNet's design principles. The compound scaling idea has also been applied to transformer-based models.

As of 2025, EfficientNet models remain among the most downloaded and widely used pretrained models in the PyTorch and TensorFlow ecosystems, particularly for transfer learning applications where their combination of accuracy, efficiency, and ease of use continues to offer practical value.

References

Tan, M., & Le, Q. V. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." *Proceedings of the 36th International Conference on Machine Learning (ICML)*, pp. 6105-6114. arXiv:1905.11946 ↩
Tan, M., & Le, Q. V. (2021). "EfficientNetV2: Smaller Models and Faster Training." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2104.00298 ↩
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). "MnasNet: Platform-Aware Neural Architecture Search for Mobile." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1807.11626 ↩
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). "MobileNetV2: Inverted Residuals and Linear Bottlenecks." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. ↩
Hu, J., Shen, L., & Sun, G. (2018). "Squeeze-and-Excitation Networks." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. ↩
Xie, Q., Luong, M. T., Hovy, E., & Le, Q. V. (2020). "Self-training with Noisy Student improves ImageNet classification." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1911.04252 ↩
Tan, M., Pang, R., & Le, Q. V. (2020). "EfficientDet: Scalable and Efficient Object Detection." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1911.09070 ↩
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). "Searching for Activation Functions." arXiv:1710.05941 ↩
Google Research Blog. "EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling." https://research.google/blog/efficientnet-improving-accuracy-and-efficiency-through-automl-and-model-scaling/ ↩
Semantic Scholar. "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" (citation count). https://www.semanticscholar.org/paper/4f2eda8077dc7a69bb2b4e0a1a086cf054adb3f9 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

EfficientNet

Background and Motivation

What is the compound scaling method?

The Core Idea

Mathematical Formulation

Why Compound Scaling Works

Neural Architecture Search for EfficientNet-B0

Search Space and Objective

Search Procedure

EfficientNet-B0 Architecture

MBConv Blocks

Inverted Residual Structure

Squeeze-and-Excitation Module

Activation Function

Scaling from B0 to B7

How does EfficientNet compare with other architectures?

EfficientNetV2

Motivation for V2

Fused-MBConv Blocks

Progressive Learning

Architecture Differences from V1

EfficientNetV2 Variants and Performance

Noisy Student Training

Transfer Learning and Use as a Backbone

Image Classification

Object Detection: EfficientDet

Semantic Segmentation and Other Tasks

Implementation and Framework Support

Strengths and Limitations

Strengths

Limitations

Summary of All EfficientNet Variants

Legacy and Influence

References

Improve this article

What links here (24 of 41)

What links here (24 of 41)

Background and Motivation

What is the compound scaling method?

The Core Idea

Mathematical Formulation

Why Compound Scaling Works

Neural Architecture Search for EfficientNet-B0

Search Space and Objective

Search Procedure

EfficientNet-B0 Architecture

MBConv Blocks

Inverted Residual Structure

Squeeze-and-Excitation Module

Activation Function

Scaling from B0 to B7

How does EfficientNet compare with other architectures?

EfficientNetV2

Motivation for V2

Fused-MBConv Blocks

Progressive Learning

Architecture Differences from V1

EfficientNetV2 Variants and Performance

Noisy Student Training

Transfer Learning and Use as a Backbone

Image Classification

Object Detection: EfficientDet

Semantic Segmentation and Other Tasks

Implementation and Framework Support

Strengths and Limitations

Strengths

Limitations

Summary of All EfficientNet Variants

Legacy and Influence

References

Improve this article

Related Articles

Translational invariance

Convolutional Neural Network

ResNet

YOLO (object detection)

VGG

Inception (deep learning)

What links here (24 of 41)

Related Articles

Translational invariance

Convolutional Neural Network

ResNet

YOLO (object detection)

VGG

Inception (deep learning)

What links here (24 of 41)