# EfficientNet

> Source: https://aiwiki.ai/wiki/efficientnet
> Updated: 2026-06-21
> Categories: Computer Vision, Deep Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**EfficientNet** is a family of [convolutional neural network](/wiki/convolutional_neural_network) architectures and a model-scaling method that uniformly scales network depth, width, and input resolution with a single compound coefficient, developed by Mingxing Tan and Quoc V. Le at [Google](/wiki/google) Brain and introduced in the 2019 paper "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks."[1] Its flagship model, EfficientNet-B7, reached 84.3% top-1 accuracy on [ImageNet](/wiki/imagenet) while being 8.4x smaller and 6.1x faster on inference than the best existing convolutional network at the time, and the smallest model, EfficientNet-B0, matched [ResNet](/wiki/resnet)-50 accuracy with roughly one-fifth the parameters.[1] The paper was presented at the International Conference on Machine Learning (ICML) in Long Beach, California.[1] The central contribution of EfficientNet is its **compound scaling method**, which replaced the ad hoc, single-dimension scaling strategies that had dominated prior [deep learning](/wiki/deep_learning) research and produced a family of models (EfficientNet-B0 through B7) that achieved state-of-the-art accuracy while using significantly fewer parameters and floating-point operations (FLOPs) than competing architectures.[1] A follow-up paper in 2021 introduced **EfficientNetV2**, which further improved training speed through progressive learning and architectural refinements.[2]

The authors summarize the result directly: "our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet."[1]

The original EfficientNet paper has become one of the most cited works in [computer vision](/wiki/computer_vision), accumulating more than 24,000 citations as of 2025.[1][10] Its ideas have influenced the design of numerous subsequent architectures and remain widely used in both research and production systems.

## Background and Motivation

Before EfficientNet, scaling up convolutional neural networks to improve accuracy was a common but inconsistent practice. Researchers would typically increase one of three dimensions of a network: **depth** (number of layers), **width** (number of channels per layer), or **input resolution** (the pixel dimensions of the input image).[1] For example, [ResNet](/wiki/resnet) demonstrated the benefits of increasing depth from 18 to 152 layers. Wide ResNet showed that wider networks could outperform very deep but narrow ones. Other work used higher-resolution inputs to capture finer details in images.

The problem with these single-dimension scaling approaches is that they quickly reach diminishing returns.[1] Making a network deeper without correspondingly increasing its width or resolution leads to vanishing gradients and limited accuracy gains. Similarly, increasing resolution alone adds computational cost without proportionally improving performance. Practitioners had no principled way to decide how much to scale each dimension, and most scaling decisions were made by trial and error.

Tan and Le set out to answer a fundamental question: is there a principled method to scale up convolutional neural networks that balances all three dimensions simultaneously? Their answer was the compound scaling method.[1]

## What is the compound scaling method?

### The Core Idea

The compound scaling method rests on a straightforward observation: network depth, width, and resolution are interdependent, and scaling them together in a balanced fashion yields better results than scaling any single dimension alone.[1] Intuitively, a higher-resolution image requires more layers (greater depth) to capture larger receptive fields and more channels (greater width) to capture finer-grained patterns at the increased resolution.

### Mathematical Formulation

The method defines three scaling coefficients that control how each dimension grows:

- **Depth:** d = α^(φ)
- **Width:** w = β^(φ)
- **Resolution:** r = γ^(φ)

Here, α, β, and γ are constants determined by a small grid search on the baseline network, and φ (phi) is a user-specified compound coefficient that controls the total computational budget.[1] The key constraint is:

> α · β² · γ² ≈ 2

This constraint ensures that for each increment of φ, the total FLOPs roughly double.[1] The FLOPS of a convolutional network scale proportionally with d, w², and r², which explains the squared terms for β and γ in the constraint.

For EfficientNet, the authors found the optimal base coefficients through a grid search on the B0 model:

- **α = 1.2** (depth multiplier)
- **β = 1.1** (width multiplier)
- **γ = 1.15** (resolution multiplier)

With these values fixed, different values of φ produce the full EfficientNet family from B0 to B7.[1]

### Why Compound Scaling Works

The paper includes ablation studies showing that scaling only one dimension (depth only, width only, or resolution only) provides diminishing accuracy gains beyond a certain point.[1] For instance, scaling width alone with depth fixed at d = 1.0 and resolution fixed at r = 1.0 quickly saturates in accuracy. However, when all three dimensions are scaled together, the same FLOP budget yields substantially higher accuracy.[1] The compound scaling method captures the intuition that a larger input image needs both more layers to increase the receptive field and more channels to capture the additional fine-grained detail.

## Neural Architecture Search for EfficientNet-B0

Rather than designing the baseline architecture by hand, Tan and Le used [neural architecture search](/wiki/neural_architecture_search) (NAS) to discover EfficientNet-B0.[1] The search procedure was adapted from MnasNet, an earlier NAS method also developed at Google.[3]

### Search Space and Objective

The search space consisted of mobile inverted bottleneck convolution (MBConv) blocks with varying kernel sizes, expansion ratios, and numbers of layers per stage.[3] The architecture was divided into multiple stages, and the search algorithm could select different block configurations for each stage.

The optimization objective was multi-objective, balancing accuracy and computational efficiency:

> ACC(m) x [FLOPS(m) / T]^w

where ACC(m) is the accuracy of model m, FLOPS(m) is the model's floating-point operations, T is a target FLOP count, and w = -0.07 controls the trade-off between accuracy and efficiency.[1] Unlike MnasNet, which optimized for inference latency on specific hardware, EfficientNet optimized for FLOPs to keep the architecture hardware-agnostic.[1]

### Search Procedure

The search used a reinforcement learning-based controller (an RNN) that sampled candidate architectures from the search space.[3] Each candidate was trained on ImageNet, and its accuracy and FLOP count were measured. The controller was then updated to produce architectures that better balance accuracy and efficiency. The target FLOP budget was approximately 400 million FLOPs, which is slightly larger than MnasNet's target.[1]

The resulting architecture, EfficientNet-B0, served as the foundation for the entire EfficientNet family.[1] All larger models (B1 through B7) were derived from B0 by applying the compound scaling formula with increasing values of φ.

## EfficientNet-B0 Architecture

EfficientNet-B0 is organized into nine stages. The first stage is a standard 3x3 convolution, stages 2 through 8 use MBConv blocks with varying configurations, and the final stage consists of a 1x1 convolution followed by global average pooling and a fully connected classification layer.[1]

| Stage | Operator | Resolution | Channels | Layers | Stride |
|-------|----------|------------|----------|--------|--------|
| 1 | Conv 3x3 | 224 x 224 | 32 | 1 | 2 |
| 2 | MBConv1, k3x3 | 112 x 112 | 16 | 1 | 1 |
| 3 | MBConv6, k3x3 | 112 x 112 | 24 | 2 | 2 |
| 4 | MBConv6, k5x5 | 56 x 56 | 40 | 2 | 2 |
| 5 | MBConv6, k3x3 | 28 x 28 | 80 | 3 | 2 |
| 6 | MBConv6, k5x5 | 14 x 14 | 112 | 3 | 1 |
| 7 | MBConv6, k5x5 | 14 x 14 | 192 | 4 | 2 |
| 8 | MBConv6, k3x3 | 7 x 7 | 320 | 1 | 1 |
| 9 | Conv 1x1, Pooling, FC | 7 x 7 | 1280 | 1 | - |

In this table, "MBConv1" refers to a mobile inverted bottleneck block with an expansion ratio of 1 (no expansion), and "MBConv6" refers to blocks with an expansion ratio of 6. The notation "k3x3" and "k5x5" indicates the kernel size of the depthwise convolution within each block.

The B0 architecture accepts 224 x 224 input images and contains 5.3 million parameters with 0.39 billion FLOPs.[1] Despite its compact size, it achieves 77.1% top-1 accuracy on ImageNet, which already surpasses [ResNet](/wiki/resnet)-50 (76.0% top-1) while using roughly five times fewer parameters.[1]

## MBConv Blocks

The Mobile Inverted Bottleneck [Convolution](/wiki/convolution) (MBConv) block is the core building block of EfficientNet. It was originally introduced in [MobileNetV2](/wiki/mobilenet) and later refined for use in MnasNet and EfficientNet.[4]

### Inverted Residual Structure

A standard residual block (as in ResNet) uses a "wide-narrow-wide" pattern: it compresses the input through a bottleneck and then expands it back. The MBConv block inverts this pattern, using a "narrow-wide-narrow" structure:[4]

1. **Expansion:** A 1x1 pointwise convolution expands the input channels by an expansion ratio (typically 6x for most stages in EfficientNet-B0, except the first MBConv stage which uses an expansion ratio of 1).
2. **Depthwise convolution:** A [depthwise separable convolution](/wiki/convolutional_neural_network) with a 3x3 or 5x5 kernel processes the expanded feature map. Each channel is convolved independently, which drastically reduces the parameter count and computation compared to standard convolutions.
3. **Projection:** A second 1x1 pointwise convolution projects the expanded features back to a lower-dimensional output.

When the input and output dimensions match, a skip (residual) connection adds the input directly to the output, similar to ResNet.[4]

### Squeeze-and-Excitation Module

Each MBConv block in EfficientNet includes a [squeeze-and-excitation](/wiki/squeeze_and_excitation_networks) (SE) module, which was introduced by Hu et al. in 2018.[5] The SE module performs channel-[wise](/wiki/wise_benchmark) attention in two steps:

1. **Squeeze:** Global average pooling reduces each channel's spatial dimensions (height and width) to a single scalar value, producing a channel descriptor vector.
2. **Excitation:** The channel descriptor is passed through two fully connected layers (a bottleneck with a reduction ratio of 0.25, followed by a sigmoid activation) to produce per-channel scaling weights.

These weights are then multiplied element-wise with the original feature map, allowing the network to emphasize informative channels and suppress less useful ones.[5] The SE module adds minimal computational overhead but consistently improves accuracy.[5]

### Activation Function

EfficientNet uses the **Swish** activation function (also known as SiLU, or Sigmoid Linear Unit), defined as f(x) = x * sigmoid(x).[8] Swish was discovered through automated search by Ramachandran et al. (2017) and tends to outperform [ReLU](/wiki/relu) in deep networks because it is smooth and non-monotonic, allowing small negative values to pass through.[8]

## Scaling from B0 to B7

Once the B0 baseline architecture and the compound scaling coefficients (α, β, γ) are fixed, the family of models is generated by increasing the compound coefficient φ. Each increment of φ roughly doubles the computational budget.[1]

| Model | Compound Coefficient (φ) | Input Resolution | Parameters | FLOPs | Top-1 Accuracy | Top-5 Accuracy |
|-------|---------------------------|------------------|------------|-------|----------------|----------------|
| EfficientNet-B0 | 1.0 | 224 x 224 | 5.3M | 0.39B | 77.1% | 93.3% |
| EfficientNet-B1 | 1.1 | 240 x 240 | 7.8M | 0.70B | 79.1% | 94.4% |
| EfficientNet-B2 | 1.2 | 260 x 260 | 9.2M | 1.0B | 80.1% | 94.9% |
| EfficientNet-B3 | 1.3 | 300 x 300 | 12M | 1.8B | 81.6% | 95.7% |
| EfficientNet-B4 | 1.4 | 380 x 380 | 19M | 4.2B | 82.9% | 96.4% |
| EfficientNet-B5 | 1.6 | 456 x 456 | 30M | 9.9B | 83.6% | 96.7% |
| EfficientNet-B6 | 1.8 | 528 x 528 | 43M | 19B | 84.0% | 96.8% |
| EfficientNet-B7 | 2.0 | 600 x 600 | 66M | 37B | 84.3% | 97.0% |

Several trends stand out from this progression. First, each step up in φ increases the input resolution, depth, and width simultaneously, which is what gives the compound scaling method its advantage. Second, the accuracy improvements are roughly logarithmic with respect to FLOPs: going from B0 to B3 adds about 4.5 percentage points of top-1 accuracy at the cost of roughly 4.6x more FLOPs, while going from B3 to B7 adds another 2.7 percentage points but requires about 20x more FLOPs.[1] Third, even the largest model (B7) remains substantially smaller than many competing architectures that achieve similar accuracy.

## How does EfficientNet compare with other architectures?

The efficiency gains of EfficientNet become particularly clear when compared with other popular convolutional neural network architectures of the same era.

| Model | Parameters | FLOPs | Top-1 Accuracy (ImageNet) |
|-------|------------|-------|---------------------------|
| [ResNet](/wiki/resnet)-50 | 26M | 4.1B | 76.0% |
| [DenseNet](/wiki/densenet)-169 | 14M | 3.4B | 76.2% |
| Inception-v3 | 24M | 5.7B | 78.8% |
| [NASNet](/wiki/nasnet)-A (Large) | 89M | 24B | 82.7% |
| GPipe (AmoebaNet) | 557M | - | 84.3% |
| EfficientNet-B0 | 5.3M | 0.39B | 77.1% |
| EfficientNet-B4 | 19M | 4.2B | 82.9% |
| EfficientNet-B7 | 66M | 37B | 84.3% |

Key observations from these comparisons:

- **EfficientNet-B0 vs. ResNet-50:** B0 achieves 1.1 percentage points higher top-1 accuracy while using approximately 5x fewer parameters and 10x fewer FLOPs.[1]
- **EfficientNet-B4 vs. NASNet-A:** B4 surpasses NASNet-A by 0.2 percentage points while using 4.7x fewer parameters and 5.7x fewer FLOPs.[1]
- **EfficientNet-B7 vs. GPipe:** B7 matches GPipe's 84.3% top-1 accuracy while using 8.4x fewer parameters. On inference, B7 is 6.1x faster on CPU.[1]
- **EfficientNet-B1 vs. ResNet-152:** B1 achieves comparable accuracy to ResNet-152 while being 7.6x smaller and 5.7x faster on CPU inference.[1]

The compound scaling method also demonstrated its generality by improving existing architectures. When applied to [MobileNet](/wiki/mobilenet), it added 1.4 percentage points of ImageNet accuracy. Applied to ResNet, it added 0.7 percentage points.[1]

## EfficientNetV2

In 2021, Tan and Le published a follow-up paper titled "EfficientNetV2: Smaller Models and Faster Training," presented at ICML 2021.[2] The paper describes the family in its abstract as "a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models."[2] While the original EfficientNet focused on inference efficiency (fewer parameters and FLOPs), EfficientNetV2 addressed a critical practical concern: **training speed**.[2]

### Motivation for V2

Profiling the original EfficientNet models revealed several training bottlenecks:[2]

1. **Large image sizes slow training.** The larger EfficientNet variants (B5 through B7) use very high input resolutions (456 to 600 pixels), which consume enormous GPU memory and slow down training.
2. **Depthwise convolutions are slow on accelerators.** Although [depthwise separable convolutions](/wiki/convolutional_neural_network) reduce FLOPs, they often cannot fully utilize modern GPU and TPU hardware because of their low arithmetic intensity (the ratio of computation to memory access).
3. **Uniform scaling is suboptimal.** The original compound scaling method scales all stages equally, but in practice, the early and late stages of a network may benefit from different scaling strategies.

### Fused-MBConv Blocks

A key architectural change in EfficientNetV2 is the introduction of **Fused-MBConv** blocks, which replace the separate depthwise convolution and 1x1 expansion convolution in a standard MBConv block with a single standard 3x3 (or 5x5) convolution.[2] This fused operation has more parameters and FLOPs than a depthwise separable equivalent, but it runs significantly faster on modern hardware because standard convolutions have higher arithmetic intensity and are better optimized in GPU libraries like cuDNN.[2]

The key design insight is that Fused-MBConv should only be used in the early stages of the network (stages 1 through 3), where feature maps are large and the hardware utilization benefits are greatest.[2] Replacing all MBConv blocks with Fused-MBConv throughout the network would increase parameters and FLOPs excessively while actually slowing down training. Later stages continue to use standard MBConv blocks.

### Progressive Learning

EfficientNetV2 introduced an **adaptive progressive learning** strategy.[2] The idea of training with progressively increasing image sizes was not new, but previous implementations had a drawback: they applied the same regularization strength regardless of image size, which led to accuracy drops.[2]

The V2 approach adaptively adjusts regularization strength alongside image size during training.[2] Smaller images require weaker regularization because the network has less capacity to overfit at lower resolutions. As the image size increases throughout training, regularization is correspondingly strengthened.

For EfficientNetV2-M, the training schedule (approximately 350 epochs on ImageNet, divided into four stages of about 87 epochs each) works as follows:[2]

| Training Stage | Image Size | RandAugment Magnitude | Mixup Alpha | Dropout Rate |
|---------------|------------|----------------------|-------------|-------------|
| Stage 1 | 128 | 5 | 0.0 | 0.1 |
| Stage 2 | 212 | 10 | 0.1 | 0.2 |
| Stage 3 | 296 | 15 | 0.15 | 0.3 |
| Stage 4 | 380 | 20 | 0.2 | 0.4 |

This progressive schedule dramatically reduces training time while maintaining or improving final accuracy.

### Architecture Differences from V1

Beyond Fused-MBConv, EfficientNetV2 incorporated several other architectural changes discovered through training-aware NAS:[2]

- **Smaller expansion ratios:** V2 uses smaller expansion ratios in MBConv blocks (typically 4 instead of 6), which reduces memory access overhead.
- **Preference for 3x3 kernels:** V2 favors 3x3 kernels over the 5x5 kernels found in V1, compensating by adding more layers.
- **Non-uniform scaling:** Unlike V1's uniform compound scaling, V2 uses a non-uniform scaling strategy that adds more layers to later stages, which are more efficient to scale.

### EfficientNetV2 Variants and Performance

| Model | Parameters | FLOPs | Top-1 Accuracy (ImageNet) | Training Time (32 TPUv3 cores) |
|-------|------------|-------|---------------------------|-------------------------------|
| EfficientNetV2-S | 22M | 8.8B | 83.9% | ~30h |
| EfficientNetV2-M | 54M | 24B | 85.1% | ~50h |
| EfficientNetV2-L | 120M | 53B | 85.7% | ~80h |

For comparison, EfficientNet-B7 (V1) achieves 84.3% accuracy but requires approximately 139 hours of training on the same hardware. EfficientNetV2-M matches B7's accuracy while training approximately 11x faster and using fewer parameters.[2]

When pretrained on ImageNet-21K (a larger dataset with 21,000 classes and approximately 14 million images), the V2 models achieve even higher accuracy:[2]

- EfficientNetV2-L: 86.8% top-1 on ImageNet
- EfficientNetV2-XL: 87.3% top-1 on ImageNet

The EfficientNetV2-XL result of 87.3% top-1 accuracy on ImageNet ILSVRC2012 outperformed the [Vision Transformer](/wiki/vision_transformer) (ViT-L/16) by 2.0 percentage points while training 5x to 11x faster using the same computing resources.[2] As the authors state, the models train "5x-11x faster" than ViT "using the same computing resources."[2]

## Noisy Student Training

In late 2019, researchers at Google Brain published "Self-training with Noisy Student improves ImageNet classification" (Xie et al., CVPR 2020), which used EfficientNet as its backbone.[6] This semi-supervised learning approach pushed EfficientNet's performance well beyond its supervised training results.

The Noisy Student method works as follows:[6]

1. Train an EfficientNet model (the "teacher") on labeled ImageNet data.
2. Use the teacher to generate pseudo-labels for 300 million unlabeled images.
3. Train a larger EfficientNet (the "student") on the combined labeled and pseudo-labeled data, with added noise such as [dropout](/wiki/dropout), stochastic depth, and [data augmentation](/wiki/data_augmentation).
4. Iterate the process, using the student as the new teacher.

Using an EfficientNet-L2 architecture (a scaled-up variant larger than B7), Noisy Student training achieved **88.4% top-1 accuracy** on ImageNet, which the authors note is "2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images," and a major improvement over the supervised-only result of 84.3% for B7.[6] The approach also significantly improved robustness: top-1 accuracy on ImageNet-A (a dataset of adversarially filtered natural images) jumped from 61.0% to 83.7%.[6]

## Transfer Learning and Use as a Backbone

EfficientNet has become one of the most popular backbone networks for [transfer learning](/wiki/transfer_learning) across a wide range of computer vision tasks.[1] Its combination of high accuracy and low computational cost makes it particularly attractive for applications where resources are constrained. On transfer-learning benchmarks reported in the original paper, EfficientNet reached 91.7% accuracy on CIFAR-100 and 98.8% on the Oxford Flowers dataset, matching or exceeding prior state-of-the-art results with far fewer parameters.[1]

### Image Classification

Pretrained EfficientNet models are widely used as feature extractors for custom image classification tasks. The standard approach involves replacing the final fully connected layer with a new classifier head and [fine-tuning](/wiki/fine_tuning) the network on a domain-specific dataset. EfficientNet models pretrained on ImageNet transfer well to medical imaging, satellite imagery, wildlife monitoring, and many other domains.[1]

The B0 through B4 variants are especially popular for transfer learning because they offer strong accuracy at manageable computational costs. For resource-constrained settings such as mobile or edge deployment, B0 and B1 provide an excellent accuracy-to-cost ratio.

### Object Detection: EfficientDet

Mingxing Tan, Ruoming Pang, and Quoc Le extended the compound scaling concept to [object detection](/wiki/object_detection) with EfficientDet (Tan et al., CVPR 2020).[7] EfficientDet uses EfficientNet as its backbone and introduces two additional innovations:[7]

- **BiFPN (Bidirectional Feature Pyramid Network):** A weighted bi-directional feature pyramid that performs efficient multi-scale feature fusion.
- **Compound scaling for detection:** The compound coefficient scales not only the backbone but also the feature pyramid, the prediction heads, and the input resolution.

EfficientDet achieved state-of-the-art results on COCO object detection while using 28x fewer FLOPs than YOLOv3, 30x fewer FLOPs than RetinaNet, and 19x fewer FLOPs than NAS-FPN with a ResNet backbone.[7]

### Semantic Segmentation and Other Tasks

EfficientNet backbones have been integrated into numerous segmentation frameworks, including [U-Net](/wiki/unet) and DeepLab variants. They have also been used for tasks such as action recognition, face recognition, medical image analysis, and generative models. The architecture's modularity and availability in major frameworks (TensorFlow, [PyTorch](/wiki/pytorch), Keras) have contributed to its widespread adoption.

## Implementation and Framework Support

EfficientNet is available in all major deep learning frameworks:

- **TensorFlow / Keras:** Official implementations are included in `tf.keras.applications` with pretrained weights for B0 through B7 and V2-S, V2-M, V2-L.
- **PyTorch:** Both `torchvision.models` and the popular `timm` (PyTorch Image Models) library by Ross Wightman provide EfficientNet implementations with pretrained weights.
- **[ONNX](/wiki/onnx):** Models can be exported to ONNX format for deployment across various inference engines.

Google also released the official [TensorFlow](/wiki/tensorflow) implementation and pretrained checkpoints through the TensorFlow Model Garden and the `automl` repository on GitHub.[9]

## Strengths and Limitations

### Strengths

- **Parameter efficiency:** EfficientNet models consistently achieve higher accuracy per parameter than competing architectures. B0 matches or exceeds ResNet-50's accuracy with roughly one-fifth the parameters.[1]
- **Scalability:** The compound scaling method provides a principled way to trade off accuracy and computational cost, making it straightforward to select the right model for a given hardware budget.
- **Transfer learning performance:** The features learned by EfficientNet transfer well to downstream tasks, often outperforming larger models when fine-tuned on small datasets.
- **Modular design:** The MBConv building blocks are well-understood and can be incorporated into other architectures.

### Limitations

- **Training speed (V1):** The original EfficientNet models, particularly the larger variants, are slow to train due to high input resolutions and the use of depthwise convolutions that underutilize GPU hardware. EfficientNetV2 addressed this issue.[2]
- **Hardware utilization:** Depthwise separable convolutions have low arithmetic intensity, which means they do not fully exploit the parallelism of modern accelerators. This is less of a concern for inference on edge devices but can be a bottleneck for GPU-based training.[2]
- **Competition from [transformers](/wiki/transformer):** Since 2020, [Vision Transformers](/wiki/vision_transformer) (ViT) and their variants have achieved competitive or superior results on many benchmarks. However, EfficientNet remains competitive at smaller scales and in transfer learning settings, and hybrid architectures that combine convolutional and transformer components have drawn from both traditions.

## Summary of All EfficientNet Variants

The following table provides a consolidated overview of all major EfficientNet variants, including both V1 and V2 families.

| Model | Input Resolution | Parameters | FLOPs | Top-1 Accuracy (ImageNet) |
|-------|-----------------|------------|-------|---------------------------|
| EfficientNet-B0 | 224 x 224 | 5.3M | 0.39B | 77.1% |
| EfficientNet-B1 | 240 x 240 | 7.8M | 0.70B | 79.1% |
| EfficientNet-B2 | 260 x 260 | 9.2M | 1.0B | 80.1% |
| EfficientNet-B3 | 300 x 300 | 12M | 1.8B | 81.6% |
| EfficientNet-B4 | 380 x 380 | 19M | 4.2B | 82.9% |
| EfficientNet-B5 | 456 x 456 | 30M | 9.9B | 83.6% |
| EfficientNet-B6 | 528 x 528 | 43M | 19B | 84.0% |
| EfficientNet-B7 | 600 x 600 | 66M | 37B | 84.3% |
| EfficientNetV2-S | 384 x 384 | 22M | 8.8B | 83.9% |
| EfficientNetV2-M | 480 x 480 | 54M | 24B | 85.1% |
| EfficientNetV2-L | 480 x 480 | 120M | 53B | 85.7% |

EfficientNetV2-S achieves nearly the same accuracy as EfficientNet-B7 with roughly one-third the parameters and one-quarter the FLOPs, illustrating the substantial improvements in the V2 design.[2]

## Legacy and Influence

EfficientNet's influence on the deep learning field extends well beyond its ImageNet benchmarks. The compound scaling method introduced a principled framework for thinking about model scaling that has been adopted and adapted by subsequent work in [natural language processing](/wiki/natural_language_processing), speech recognition, and other domains.

The architecture also played a key role in demonstrating the power of [neural architecture search](/wiki/neural_architecture_search) for producing practical, deployable models. While earlier NAS results like NASNet and AmoebaNet were often too large for practical use, EfficientNet showed that NAS could produce compact, efficient architectures that outperformed hand-designed models across a range of computational budgets.[1]

In the context of the broader shift toward [transformer](/wiki/transformer)-based architectures in computer vision, EfficientNet remains relevant in several ways. Many practitioners continue to prefer convolutional architectures for smaller datasets, mobile deployment, and real-time applications. Hybrid architectures such as CoAtNet combine convolutional stages (including MBConv blocks) with transformer stages, drawing directly on EfficientNet's design principles. The compound scaling idea has also been applied to transformer-based models.

As of 2025, EfficientNet models remain among the most downloaded and widely used pretrained models in the [PyTorch](/wiki/pytorch) and [TensorFlow](/wiki/tensorflow) ecosystems, particularly for transfer learning applications where their combination of accuracy, efficiency, and ease of use continues to offer practical value.

## References

1. Tan, M., & Le, Q. V. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." *Proceedings of the 36th International Conference on Machine Learning (ICML)*, pp. 6105-6114. [arXiv:1905.11946](https://arxiv.org/abs/1905.11946)

2. Tan, M., & Le, Q. V. (2021). "EfficientNetV2: Smaller Models and Faster Training." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. [arXiv:2104.00298](https://arxiv.org/abs/2104.00298)

3. Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). "MnasNet: Platform-Aware Neural Architecture Search for Mobile." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. [arXiv:1807.11626](https://arxiv.org/abs/1807.11626)

4. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). "MobileNetV2: Inverted Residuals and Linear Bottlenecks." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*.

5. Hu, J., Shen, L., & Sun, G. (2018). "Squeeze-and-Excitation Networks." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*.

6. Xie, Q., Luong, M. T., Hovy, E., & Le, Q. V. (2020). "Self-training with Noisy Student improves ImageNet classification." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. [arXiv:1911.04252](https://arxiv.org/abs/1911.04252)

7. Tan, M., Pang, R., & Le, Q. V. (2020). "EfficientDet: Scalable and Efficient Object Detection." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. [arXiv:1911.09070](https://arxiv.org/abs/1911.09070)

8. Ramachandran, P., Zoph, B., & Le, Q. V. (2017). "Searching for Activation Functions." [arXiv:1710.05941](https://arxiv.org/abs/1710.05941)

9. Google Research Blog. "EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling." [https://research.google/blog/efficientnet-improving-accuracy-and-efficiency-through-automl-and-model-scaling/](https://research.google/blog/efficientnet-improving-accuracy-and-efficiency-through-automl-and-model-scaling/)

10. Semantic Scholar. "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" (citation count). [https://www.semanticscholar.org/paper/4f2eda8077dc7a69bb2b4e0a1a086cf054adb3f9](https://www.semanticscholar.org/paper/4f2eda8077dc7a69bb2b4e0a1a086cf054adb3f9)