EfficientNet is a family of convolutional neural network architectures and a scaling method developed by Mingxing Tan and Quoc V. Le at Google Brain. Introduced in the 2019 paper "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," the work was presented at the International Conference on Machine Learning (ICML) in Long Beach, California. The central contribution of EfficientNet is its compound scaling method, which uniformly scales network depth, width, and input resolution using a single compound coefficient. This approach replaced the ad hoc scaling strategies that had dominated prior deep learning research and produced a family of models (EfficientNet-B0 through B7) that achieved state-of-the-art accuracy on ImageNet while using significantly fewer parameters and floating-point operations (FLOPs) than competing architectures. A follow-up paper in 2021 introduced EfficientNetV2, which further improved training speed through progressive learning and architectural refinements.
The original EfficientNet paper has become one of the most cited works in computer vision, accumulating over 21,000 citations. Its ideas have influenced the design of numerous subsequent architectures and remain widely used in both research and production systems.
Before EfficientNet, scaling up convolutional neural networks to improve accuracy was a common but inconsistent practice. Researchers would typically increase one of three dimensions of a network: depth (number of layers), width (number of channels per layer), or input resolution (the pixel dimensions of the input image). For example, ResNet demonstrated the benefits of increasing depth from 18 to 152 layers. Wide ResNet showed that wider networks could outperform very deep but narrow ones. Other work used higher-resolution inputs to capture finer details in images.
The problem with these single-dimension scaling approaches is that they quickly reach diminishing returns. Making a network deeper without correspondingly increasing its width or resolution leads to vanishing gradients and limited accuracy gains. Similarly, increasing resolution alone adds computational cost without proportionally improving performance. Practitioners had no principled way to decide how much to scale each dimension, and most scaling decisions were made by trial and error.
Tan and Le set out to answer a fundamental question: is there a principled method to scale up convolutional neural networks that balances all three dimensions simultaneously? Their answer was the compound scaling method.
The compound scaling method rests on a straightforward observation: network depth, width, and resolution are interdependent, and scaling them together in a balanced fashion yields better results than scaling any single dimension alone. Intuitively, a higher-resolution image requires more layers (greater depth) to capture larger receptive fields and more channels (greater width) to capture finer-grained patterns at the increased resolution.
The method defines three scaling coefficients that control how each dimension grows:
Here, α, β, and γ are constants determined by a small grid search on the baseline network, and φ (phi) is a user-specified compound coefficient that controls the total computational budget. The key constraint is:
α · β² · γ² ≈ 2
This constraint ensures that for each increment of φ, the total FLOPs roughly double. The FLOPS of a convolutional network scale proportionally with d, w², and r², which explains the squared terms for β and γ in the constraint.
For EfficientNet, the authors found the optimal base coefficients through a grid search on the B0 model:
With these values fixed, different values of φ produce the full EfficientNet family from B0 to B7.
The paper includes ablation studies showing that scaling only one dimension (depth only, width only, or resolution only) provides diminishing accuracy gains beyond a certain point. For instance, scaling width alone with depth fixed at d = 1.0 and resolution fixed at r = 1.0 quickly saturates in accuracy. However, when all three dimensions are scaled together, the same FLOP budget yields substantially higher accuracy. The compound scaling method captures the intuition that a larger input image needs both more layers to increase the receptive field and more channels to capture the additional fine-grained detail.
Rather than designing the baseline architecture by hand, Tan and Le used neural architecture search (NAS) to discover EfficientNet-B0. The search procedure was adapted from MnasNet, an earlier NAS method also developed at Google.
The search space consisted of mobile inverted bottleneck convolution (MBConv) blocks with varying kernel sizes, expansion ratios, and numbers of layers per stage. The architecture was divided into multiple stages, and the search algorithm could select different block configurations for each stage.
The optimization objective was multi-objective, balancing accuracy and computational efficiency:
ACC(m) x [FLOPS(m) / T]^w
where ACC(m) is the accuracy of model m, FLOPS(m) is the model's floating-point operations, T is a target FLOP count, and w = -0.07 controls the trade-off between accuracy and efficiency. Unlike MnasNet, which optimized for inference latency on specific hardware, EfficientNet optimized for FLOPs to keep the architecture hardware-agnostic.
The search used a reinforcement learning-based controller (an RNN) that sampled candidate architectures from the search space. Each candidate was trained on ImageNet, and its accuracy and FLOP count were measured. The controller was then updated to produce architectures that better balance accuracy and efficiency. The target FLOP budget was approximately 400 million FLOPs, which is slightly larger than MnasNet's target.
The resulting architecture, EfficientNet-B0, served as the foundation for the entire EfficientNet family. All larger models (B1 through B7) were derived from B0 by applying the compound scaling formula with increasing values of φ.
EfficientNet-B0 is organized into nine stages. The first stage is a standard 3x3 convolution, stages 2 through 8 use MBConv blocks with varying configurations, and the final stage consists of a 1x1 convolution followed by global average pooling and a fully connected classification layer.
| Stage | Operator | Resolution | Channels | Layers | Stride |
|---|---|---|---|---|---|
| 1 | Conv 3x3 | 224 x 224 | 32 | 1 | 2 |
| 2 | MBConv1, k3x3 | 112 x 112 | 16 | 1 | 1 |
| 3 | MBConv6, k3x3 | 112 x 112 | 24 | 2 | 2 |
| 4 | MBConv6, k5x5 | 56 x 56 | 40 | 2 | 2 |
| 5 | MBConv6, k3x3 | 28 x 28 | 80 | 3 | 2 |
| 6 | MBConv6, k5x5 | 14 x 14 | 112 | 3 | 1 |
| 7 | MBConv6, k5x5 | 14 x 14 | 192 | 4 | 2 |
| 8 | MBConv6, k3x3 | 7 x 7 | 320 | 1 | 1 |
| 9 | Conv 1x1, Pooling, FC | 7 x 7 | 1280 | 1 | - |
In this table, "MBConv1" refers to a mobile inverted bottleneck block with an expansion ratio of 1 (no expansion), and "MBConv6" refers to blocks with an expansion ratio of 6. The notation "k3x3" and "k5x5" indicates the kernel size of the depthwise convolution within each block.
The B0 architecture accepts 224 x 224 input images and contains 5.3 million parameters with 0.39 billion FLOPs. Despite its compact size, it achieves 77.1% top-1 accuracy on ImageNet, which already surpasses ResNet-50 (76.0% top-1) while using roughly five times fewer parameters.
The Mobile Inverted Bottleneck Convolution (MBConv) block is the core building block of EfficientNet. It was originally introduced in MobileNetV2 and later refined for use in MnasNet and EfficientNet.
A standard residual block (as in ResNet) uses a "wide-narrow-wide" pattern: it compresses the input through a bottleneck and then expands it back. The MBConv block inverts this pattern, using a "narrow-wide-narrow" structure:
When the input and output dimensions match, a skip (residual) connection adds the input directly to the output, similar to ResNet.
Each MBConv block in EfficientNet includes a squeeze-and-excitation (SE) module, which was introduced by Hu et al. in 2018. The SE module performs channel-wise attention in two steps:
These weights are then multiplied element-wise with the original feature map, allowing the network to emphasize informative channels and suppress less useful ones. The SE module adds minimal computational overhead but consistently improves accuracy.
EfficientNet uses the Swish activation function (also known as SiLU, or Sigmoid Linear Unit), defined as f(x) = x * sigmoid(x). Swish was discovered through automated search by Ramachandran et al. (2017) and tends to outperform ReLU in deep networks because it is smooth and non-monotonic, allowing small negative values to pass through.
Once the B0 baseline architecture and the compound scaling coefficients (α, β, γ) are fixed, the family of models is generated by increasing the compound coefficient φ. Each increment of φ roughly doubles the computational budget.
| Model | Compound Coefficient (φ) | Input Resolution | Parameters | FLOPs | Top-1 Accuracy | Top-5 Accuracy |
|---|---|---|---|---|---|---|
| EfficientNet-B0 | 1.0 | 224 x 224 | 5.3M | 0.39B | 77.1% | 93.3% |
| EfficientNet-B1 | 1.1 | 240 x 240 | 7.8M | 0.70B | 79.1% | 94.4% |
| EfficientNet-B2 | 1.2 | 260 x 260 | 9.2M | 1.0B | 80.1% | 94.9% |
| EfficientNet-B3 | 1.3 | 300 x 300 | 12M | 1.8B | 81.6% | 95.7% |
| EfficientNet-B4 | 1.4 | 380 x 380 | 19M | 4.2B | 82.9% | 96.4% |
| EfficientNet-B5 | 1.6 | 456 x 456 | 30M | 9.9B | 83.6% | 96.7% |
| EfficientNet-B6 | 1.8 | 528 x 528 | 43M | 19B | 84.0% | 96.8% |
| EfficientNet-B7 | 2.0 | 600 x 600 | 66M | 37B | 84.3% | 97.0% |
Several trends stand out from this progression. First, each step up in φ increases the input resolution, depth, and width simultaneously, which is what gives the compound scaling method its advantage. Second, the accuracy improvements are roughly logarithmic with respect to FLOPs: going from B0 to B3 adds about 4.5 percentage points of top-1 accuracy at the cost of roughly 4.6x more FLOPs, while going from B3 to B7 adds another 2.7 percentage points but requires about 20x more FLOPs. Third, even the largest model (B7) remains substantially smaller than many competing architectures that achieve similar accuracy.
The efficiency gains of EfficientNet become particularly clear when compared with other popular convolutional neural network architectures of the same era.
| Model | Parameters | FLOPs | Top-1 Accuracy (ImageNet) |
|---|---|---|---|
| ResNet-50 | 26M | 4.1B | 76.0% |
| DenseNet-169 | 14M | 3.4B | 76.2% |
| Inception-v3 | 24M | 5.7B | 78.8% |
| NASNet-A (Large) | 89M | 24B | 82.7% |
| GPipe (AmoebaNet) | 557M | - | 84.3% |
| EfficientNet-B0 | 5.3M | 0.39B | 77.1% |
| EfficientNet-B4 | 19M | 4.2B | 82.9% |
| EfficientNet-B7 | 66M | 37B | 84.3% |
Key observations from these comparisons:
The compound scaling method also demonstrated its generality by improving existing architectures. When applied to MobileNet, it added 1.4 percentage points of ImageNet accuracy. Applied to ResNet, it added 0.7 percentage points.
In 2021, Tan and Le published a follow-up paper titled "EfficientNetV2: Smaller Models and Faster Training," presented at ICML 2021. While the original EfficientNet focused on inference efficiency (fewer parameters and FLOPs), EfficientNetV2 addressed a critical practical concern: training speed.
Profiling the original EfficientNet models revealed several training bottlenecks:
A key architectural change in EfficientNetV2 is the introduction of Fused-MBConv blocks, which replace the separate depthwise convolution and 1x1 expansion convolution in a standard MBConv block with a single standard 3x3 (or 5x5) convolution. This fused operation has more parameters and FLOPs than a depthwise separable equivalent, but it runs significantly faster on modern hardware because standard convolutions have higher arithmetic intensity and are better optimized in GPU libraries like cuDNN.
The key design insight is that Fused-MBConv should only be used in the early stages of the network (stages 1 through 3), where feature maps are large and the hardware utilization benefits are greatest. Replacing all MBConv blocks with Fused-MBConv throughout the network would increase parameters and FLOPs excessively while actually slowing down training. Later stages continue to use standard MBConv blocks.
EfficientNetV2 introduced an adaptive progressive learning strategy. The idea of training with progressively increasing image sizes was not new, but previous implementations had a drawback: they applied the same regularization strength regardless of image size, which led to accuracy drops.
The V2 approach adaptively adjusts regularization strength alongside image size during training. Smaller images require weaker regularization because the network has less capacity to overfit at lower resolutions. As the image size increases throughout training, regularization is correspondingly strengthened.
For EfficientNetV2-M, the training schedule (approximately 350 epochs on ImageNet, divided into four stages of about 87 epochs each) works as follows:
| Training Stage | Image Size | RandAugment Magnitude | Mixup Alpha | Dropout Rate |
|---|---|---|---|---|
| Stage 1 | 128 | 5 | 0.0 | 0.1 |
| Stage 2 | 212 | 10 | 0.1 | 0.2 |
| Stage 3 | 296 | 15 | 0.15 | 0.3 |
| Stage 4 | 380 | 20 | 0.2 | 0.4 |
This progressive schedule dramatically reduces training time while maintaining or improving final accuracy.
Beyond Fused-MBConv, EfficientNetV2 incorporated several other architectural changes discovered through training-aware NAS:
| Model | Parameters | FLOPs | Top-1 Accuracy (ImageNet) | Training Time (32 TPUv3 cores) |
|---|---|---|---|---|
| EfficientNetV2-S | 22M | 8.8B | 83.9% | ~30h |
| EfficientNetV2-M | 54M | 24B | 85.1% | ~50h |
| EfficientNetV2-L | 120M | 53B | 85.7% | ~80h |
For comparison, EfficientNet-B7 (V1) achieves 84.3% accuracy but requires approximately 139 hours of training on the same hardware. EfficientNetV2-M matches B7's accuracy while training approximately 11x faster and using fewer parameters.
When pretrained on ImageNet-21K (a larger dataset with 21,000 classes and approximately 14 million images), the V2 models achieve even higher accuracy:
The EfficientNetV2-XL result outperformed the Vision Transformer (ViT-L/16) by 2.0 percentage points while training 5x to 11x faster using the same computing resources.
In late 2019, researchers at Google Brain published "Self-training with Noisy Student improves ImageNet classification" (Xie et al., CVPR 2020), which used EfficientNet as its backbone. This semi-supervised learning approach pushed EfficientNet's performance well beyond its supervised training results.
The Noisy Student method works as follows:
Using an EfficientNet-L2 architecture (a scaled-up variant larger than B7), Noisy Student training achieved 88.4% top-1 accuracy on ImageNet, a major improvement over the supervised-only result of 84.3% for B7. The approach also significantly improved robustness: top-1 accuracy on ImageNet-A (a dataset of adversarially filtered natural images) jumped from 61.0% to 83.7%.
EfficientNet has become one of the most popular backbone networks for transfer learning across a wide range of computer vision tasks. Its combination of high accuracy and low computational cost makes it particularly attractive for applications where resources are constrained.
Pretrained EfficientNet models are widely used as feature extractors for custom image classification tasks. The standard approach involves replacing the final fully connected layer with a new classifier head and fine-tuning the network on a domain-specific dataset. EfficientNet models pretrained on ImageNet transfer well to medical imaging, satellite imagery, wildlife monitoring, and many other domains.
The B0 through B4 variants are especially popular for transfer learning because they offer strong accuracy at manageable computational costs. For resource-constrained settings such as mobile or edge deployment, B0 and B1 provide an excellent accuracy-to-cost ratio.
Mingxing Tan, Ruoming Pang, and Quoc Le extended the compound scaling concept to object detection with EfficientDet (Tan et al., CVPR 2020). EfficientDet uses EfficientNet as its backbone and introduces two additional innovations:
EfficientDet achieved state-of-the-art results on COCO object detection while using 28x fewer FLOPs than YOLOv3, 30x fewer FLOPs than RetinaNet, and 19x fewer FLOPs than NAS-FPN with a ResNet backbone.
EfficientNet backbones have been integrated into numerous segmentation frameworks, including U-Net and DeepLab variants. They have also been used for tasks such as action recognition, face recognition, medical image analysis, and generative models. The architecture's modularity and availability in major frameworks (TensorFlow, PyTorch, Keras) have contributed to its widespread adoption.
EfficientNet is available in all major deep learning frameworks:
tf.keras.applications with pretrained weights for B0 through B7 and V2-S, V2-M, V2-L.torchvision.models and the popular timm (PyTorch Image Models) library by Ross Wightman provide EfficientNet implementations with pretrained weights.Google also released the official TensorFlow implementation and pretrained checkpoints through the TensorFlow Model Garden and the automl repository on GitHub.
The following table provides a consolidated overview of all major EfficientNet variants, including both V1 and V2 families.
| Model | Input Resolution | Parameters | FLOPs | Top-1 Accuracy (ImageNet) |
|---|---|---|---|---|
| EfficientNet-B0 | 224 x 224 | 5.3M | 0.39B | 77.1% |
| EfficientNet-B1 | 240 x 240 | 7.8M | 0.70B | 79.1% |
| EfficientNet-B2 | 260 x 260 | 9.2M | 1.0B | 80.1% |
| EfficientNet-B3 | 300 x 300 | 12M | 1.8B | 81.6% |
| EfficientNet-B4 | 380 x 380 | 19M | 4.2B | 82.9% |
| EfficientNet-B5 | 456 x 456 | 30M | 9.9B | 83.6% |
| EfficientNet-B6 | 528 x 528 | 43M | 19B | 84.0% |
| EfficientNet-B7 | 600 x 600 | 66M | 37B | 84.3% |
| EfficientNetV2-S | 384 x 384 | 22M | 8.8B | 83.9% |
| EfficientNetV2-M | 480 x 480 | 54M | 24B | 85.1% |
| EfficientNetV2-L | 480 x 480 | 120M | 53B | 85.7% |
EfficientNetV2-S achieves nearly the same accuracy as EfficientNet-B7 with roughly one-third the parameters and one-quarter the FLOPs, illustrating the substantial improvements in the V2 design.
EfficientNet's influence on the deep learning field extends well beyond its ImageNet benchmarks. The compound scaling method introduced a principled framework for thinking about model scaling that has been adopted and adapted by subsequent work in natural language processing, speech recognition, and other domains.
The architecture also played a key role in demonstrating the power of neural architecture search for producing practical, deployable models. While earlier NAS results like NASNet and AmoebaNet were often too large for practical use, EfficientNet showed that NAS could produce compact, efficient architectures that outperformed hand-designed models across a range of computational budgets.
In the context of the broader shift toward transformer-based architectures in computer vision, EfficientNet remains relevant in several ways. Many practitioners continue to prefer convolutional architectures for smaller datasets, mobile deployment, and real-time applications. Hybrid architectures such as CoAtNet combine convolutional stages (including MBConv blocks) with transformer stages, drawing directly on EfficientNet's design principles. The compound scaling idea has also been applied to transformer-based models.
As of 2025, EfficientNet models remain among the most downloaded and widely used pretrained models in the PyTorch and TensorFlow ecosystems, particularly for transfer learning applications where their combination of accuracy, efficiency, and ease of use continues to offer practical value.