Inception (deep learning)

Computer Vision Deep Learning Neural Networks

22 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v6 · 4,442 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Inception is a family of convolutional neural network (CNN) architectures developed by researchers at Google, first introduced in 2014. The original architecture, known as GoogLeNet (also called Inception v1), won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014 with a top-5 error rate of 6.67%.^[1] The defining property of the family, in the words of the original paper, is that "the main hallmark of this architecture is the improved utilization of the computing resources inside the network," achieved "by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant."^[1] As a result, GoogLeNet reached state-of-the-art accuracy with roughly 6.8 million parameters, about 9 times fewer than AlexNet and 20 times fewer than VGG-16.^[1] Over multiple iterations (v1 through v4, plus Inception-ResNet variants), the architecture introduced several ideas that have had a lasting impact on deep learning and computer vision, including the inception module, aggressive dimensionality reduction with 1x1 convolutions, factorized convolutions, batch normalization, label smoothing, and the combination of inception modules with residual connections.

Background and Motivation

In the early 2010s, progress in image classification was driven largely by making neural networks deeper and wider. AlexNet (2012) demonstrated that deep CNNs could dramatically outperform traditional computer vision methods on large-scale benchmarks, and subsequent work such as VGGNet (2014) pushed accuracy further by stacking many layers of small 3x3 convolutions.^[9]^[8] However, simply increasing network size came with significant drawbacks: more parameters meant higher computational cost, greater memory usage, and a higher risk of overfitting, especially when labeled training data was limited.

The Inception architecture was developed to address this tension. Rather than choosing a single filter size for each layer, the authors proposed using multiple filter sizes in parallel and letting the network learn which combinations of features were most useful. The goal was to increase the depth and width of a network while keeping the computational budget roughly constant.^[1] This approach drew on theoretical work by Arora et al. (2014), which suggested that optimal sparse network structures could be approximated by dense components operating at different scales.^[1]

Why is it called Inception (and GoogLeNet)?

The name "GoogLeNet" is a tribute to Yann LeCun's pioneering LeNet-5 architecture (1998), one of the earliest successful CNNs.^[1] The capitalization pattern (the capital "L" in the middle) makes this homage explicit. The word "Inception" was chosen as a reference to the internet meme "we need to go deeper," derived from Christopher Nolan's 2010 film Inception.^[1] This fit the theme of building networks with greater depth, as the architecture stacked multiple inception modules to create a much deeper network than its predecessors.

Inception v1 (GoogLeNet)

Paper and Authors

Inception v1 was introduced in the paper "Going Deeper with Convolutions" by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. The paper was published in 2014 and presented at CVPR 2015.^[1]

How does the inception module work?

The core building block of GoogLeNet is the inception module. Instead of applying a single convolution filter at each layer, the inception module processes the input through four parallel branches simultaneously:^[1]

1x1 convolution for capturing fine-grained, channel-wise features
1x1 convolution followed by 3x3 convolution for medium-scale spatial features
1x1 convolution followed by 5x5 convolution for larger-scale spatial features
3x3 max pooling followed by 1x1 convolution for pooled features

The outputs of all four branches are concatenated along the channel (depth) dimension, producing a single output tensor that is passed to the next layer.^[1] By operating at multiple scales in parallel, the network can capture both fine and coarse features from the same input.

Naive vs. Dimensionality-Reduced Inception Module

The original (naive) version of the inception module simply applied all convolutions directly to the input. This was computationally expensive because 3x3 and 5x5 convolutions on high-dimensional inputs produce a very large number of operations. To address this, the authors introduced 1x1 convolutions as bottleneck layers before the more expensive 3x3 and 5x5 convolutions. These 1x1 convolutions reduce the number of input channels (dimensionality reduction), greatly decreasing computational cost without sacrificing representational power.^[1]

This idea was inspired by the "Network in Network" approach proposed by Min Lin, Qiang Chen, and Shuicheng Yan in 2013, which showed that 1x1 convolutions could serve as learned linear projections across channels.^[5]

As a concrete example from the paper: for a 28x28 input with 192 channels fed into 96 filters of size 3x3, the naive approach requires 165,888 parameters. With a 1x1 reduction layer first (reducing to a lower channel count), the total drops to 67,584 parameters, a reduction of roughly 59%.^[1]

Overall Architecture

GoogLeNet consists of 22 layers with parameters (27 layers if pooling layers are counted).^[1] The architecture includes:

An initial stem of conventional convolutional and pooling layers
Nine stacked inception modules (labeled inception 3a through 5b), with progressively increasing filter counts
Global average pooling replacing the traditional fully connected layers at the end of the network
A single fully connected layer and softmax classifier for the 1,000 ImageNet classes

The use of global average pooling instead of large fully connected layers was a significant design choice. By averaging each feature map from 7x7 down to 1x1, the architecture avoided the massive parameter overhead associated with fully connected layers. The authors reported that this change improved top-1 accuracy by approximately 0.6% compared to using fully connected layers.^[1]

Auxiliary Classifiers

Because GoogLeNet is deep, it was susceptible to the vanishing gradient problem, where gradients become very small as they propagate backward through many layers during training. To combat this, the authors added two auxiliary classifiers at intermediate points in the network (after inception module 4a and inception module 4d).^[1]

Each auxiliary classifier consists of:

5x5 average pooling with stride 3
1x1 convolution with 128 filters and ReLU activation
A fully connected layer with 1,024 units and ReLU activation
Dropout at 70%
A softmax output layer for 1,000 classes

During training, the loss from each auxiliary classifier was added to the total loss with a weight of 0.3. At inference time, the auxiliary classifiers were discarded entirely.^[1] Later work on Inception v3 found that auxiliary classifiers acted more as regularizers than as gradient conduits; the authors observed that networks with auxiliary branches did not converge faster early in training but eventually reached slightly higher accuracy.^[3]

Parameter Efficiency

GoogLeNet contained approximately 6.8 million parameters (6,998,552 to be exact, excluding auxiliary classifier parameters). This was roughly 9 times fewer than AlexNet (approximately 61 million parameters) and about 20 times fewer than VGG-16 (approximately 138 million parameters). Despite this dramatic reduction, GoogLeNet outperformed both architectures on ImageNet.^[1]

ILSVRC 2014 Results

GoogLeNet won first place in the ILSVRC 2014 classification task with a top-5 error rate of 6.67%, a major improvement over the previous year's winner, ZFNet (11.7% error). VGGNet, the runner-up in the same competition, achieved 7.3% top-5 error but required roughly 20 times more parameters.^[1]

On standard ImageNet validation benchmarks, GoogLeNet achieved 68.7% top-1 accuracy and 88.9% top-5 accuracy in single-crop evaluation.^[1]

Inception v2 (BN-Inception)

Paper and Authors

Inception v2 was introduced in the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" by Sergey Ioffe and Christian Szegedy, published in 2015 (ICML 2015). This version is sometimes called BN-Inception because its primary contribution was applying batch normalization to the Inception architecture.^[2]

Batch Normalization

The key innovation of Inception v2 was the addition of batch normalization layers throughout the network. Batch normalization normalizes the activations within each mini-batch during training, reducing what the authors termed "internal covariate shift."^[2] This technique provided several practical benefits:

Training converged much faster (the authors achieved the same accuracy as the original Inception in 14 times fewer training steps)
Higher learning rates could be used safely
The need for dropout was reduced or eliminated
L2 weight regularization could be relaxed by a factor of 5

Architectural Changes

Beyond adding batch normalization, Inception v2 replaced the 5x5 convolutional layers in the inception modules with two consecutive 3x3 convolutional layers. This factorization reduces the parameter count (from 25 parameters per filter position for a 5x5 kernel to 18 parameters for two stacked 3x3 kernels) while maintaining the same effective receptive field.^[2]

Performance

BN-Inception contained approximately 13.6 million parameters and achieved a top-5 validation error of 4.9% on ImageNet (4.82% on the test set). It was the second architecture to surpass human-level performance (estimated at about 5.1% top-5 error) on the ILSVRC 2015 benchmark, after ResNet. On ImageNet validation, BN-Inception reached approximately 73.5% top-1 accuracy.^[2]

Inception v3

Paper and Authors

Inception v3 was introduced in the paper "Rethinking the Inception Architecture for Computer Vision" by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. The paper was submitted in December 2015 and published at CVPR 2016. Inception v3 placed as first runner-up in the ILSVRC 2015 image classification challenge, behind ResNet.^[3]

What are the general design principles of Inception v3?

A central contribution of the v3 paper was a set of four general design principles, distilled from large-scale experimentation, that guided how the architecture was factorized and scaled. The authors cautioned that the principles were guidelines rather than rigid rules, noting that "grave deviations from these principles tended to result in deterioration in the quality of the networks."^[3] The four principles are:^[3]

Avoid representational bottlenecks. "Avoid representational bottlenecks, especially early in the network," and "one should avoid bottlenecks with extreme compression."^[3]
Higher dimensional representations are easier to process locally. "Increasing the activations per tile in a convolutional network allows for more disentangled features."^[3]
Spatial aggregation over lower dimensional embeddings. "Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power."^[3]
Balance the width and depth of the network. "Optimal performance of the network can be reached by balancing the number of filters per stage and the depth of the network."^[3]

Factorized Convolutions

Inception v3 introduced a set of general design principles for factorizing convolutions to reduce computational cost:^[3]

Factorizing 5x5 convolutions into two 3x3 convolutions. A single 5x5 convolution has 25 multiply-add operations per output position. Two stacked 3x3 convolutions cover the same receptive field with only 18 operations, a 28% reduction in computation.^[3]

Factorizing nxn convolutions into asymmetric 1xn and nx1 convolutions. A 3x3 convolution can be decomposed into a 1x3 convolution followed by a 3x1 convolution (or vice versa), reducing computation by 33% compared to a single 3x3 convolution. The architecture also employs asymmetric convolutions at larger scales, such as 1x7 and 7x1, to capture elongated spatial patterns efficiently. The authors found that asymmetric factorization worked best on medium-sized feature maps (between 12x12 and 20x20 in spatial resolution).^[3]

Factorizing the initial 7x7 convolution. The first convolution layer from the original GoogLeNet (a 7x7 filter) was factorized into a sequence of three 3x3 convolutions, further reducing parameters and computation in the network stem.^[3]

Efficient Grid Size Reduction

Traditionally, pooling operations are used to reduce the spatial dimensions of feature maps. However, pooling before convolution can create a representational bottleneck, while convolving before pooling is computationally expensive. Inception v3 introduced an efficient grid size reduction technique that uses parallel stride-2 convolution and stride-2 pooling branches, concatenating their outputs. This reduces spatial dimensions without losing information or wasting computation.^[3]

Label Smoothing Regularization

Inception v3 introduced label smoothing, a regularization technique that prevents the model from becoming overconfident in its predictions. Instead of using hard one-hot labels (where the correct class has probability 1.0 and all others have 0.0), label smoothing distributes a small portion of the probability mass uniformly across all classes. For example, with a smoothing parameter of 0.1, the correct class is assigned probability 0.9 and the remaining 0.1 is distributed evenly among the other 999 classes. This encourages the model to generalize better. Adding label smoothing reduced top-1 error from 23.1% to 22.8% in the authors' ablation study.^[3]

Other Training Improvements

Inception v3 also adopted the RMSProp optimizer (instead of SGD with momentum used in earlier versions) and adjusted the training of auxiliary classifiers. The authors found that batch-normalizing the auxiliary classifier outputs helped improve the final model's accuracy.^[3]

Architecture and Performance

Inception v3 takes 299x299 pixel input images (up from 224x224 in v1) and contains approximately 23.9 million parameters (23,885,392). It requires about 5 billion multiply-add operations per forward pass.^[3]

Single-model, single-crop evaluation yielded a top-1 error of 21.2% and top-5 error of 5.6% on the ILSVRC 2012 validation set. With an ensemble of four models and 144-crop evaluation, Inception v3 achieved 3.58% top-5 error on the validation set and 3.5% on the test set.^[3]

Inception v4 and Inception-ResNet

Paper and Authors

Inception v4 and the Inception-ResNet variants were introduced in the paper "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning" by Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. The paper was first released in February 2016 and published at the AAAI Conference on Artificial Intelligence in 2017.^[4]

Inception v4

Inception v4 is a pure (non-residual) Inception architecture that was designed with a simplified and more uniform structure compared to previous versions. Earlier Inception designs had been somewhat constrained by the need to maintain compatibility with the DistBelief training framework. With the move to TensorFlow, the authors were free to redesign the architecture without those constraints.^[4]

Inception v4 features a new, streamlined stem module and three types of inception blocks (Inception-A, Inception-B, and Inception-C), each tailored to different spatial resolutions within the network. The architecture also uses dedicated reduction blocks (Reduction-A and Reduction-B) to decrease spatial dimensions between stages.^[4]

Inception v4 contains approximately 42.7 million parameters.^[4]

Do residual connections help Inception?

The Inception-ResNet variants combine inception modules with residual connections (skip connections), a technique introduced by He et al. in ResNet (2015).^[6] In these hybrid architectures, the output of an inception block is added to its input through a shortcut connection, rather than simply being concatenated as in standard Inception modules. This allows the network to learn residual functions, which can ease training of very deep networks.^[4]

The paper's central empirical finding was that "training with residual connections accelerates the training of Inception networks significantly," with "residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin."^[4] In other words, residual connections sped up convergence substantially while improving final accuracy only marginally.

Two versions were introduced:

Inception-ResNet-v1: Has roughly the same computational cost as Inception v3. It uses smaller inception blocks designed to match the parameter and compute budget of Inception v3.
Inception-ResNet-v2: Has roughly the same computational cost as Inception v4. It uses larger, more complex inception blocks and contains approximately 55.8 million parameters.

Activation Scaling

The authors discovered that when the number of filters in residual Inception blocks exceeded approximately 1,000, the training process could become unstable, with activations occasionally "dying" (collapsing to zero) and never recovering, even with very low learning rates. To solve this, they introduced residual scaling: the output of each residual branch is multiplied by a small constant (typically between 0.1 and 0.3) before being added back to the main path. This simple technique stabilized training and allowed the use of very wide residual Inception networks.^[4]

Performance Results

The following table summarizes single-crop, single-model evaluation results on the ILSVRC 2012 validation set:

Network	Top-1 Error	Top-5 Error
Inception v3	21.2%	5.6%
Inception-ResNet-v1	21.3%	5.5%
Inception v4	20.0%	5.0%
Inception-ResNet-v2	19.9%	4.9%

With 12-crop evaluation on the full 50,000 validation images:

Network	Top-1 Error	Top-5 Error
Inception v3	19.8%	4.6%
Inception-ResNet-v1	19.8%	4.6%
Inception v4	18.7%	4.2%
Inception-ResNet-v2	18.7%	4.1%

An ensemble of one Inception-v4 model and three Inception-ResNet-v2 models achieved 3.08% top-5 error on the ImageNet test set, which at the time represented one of the best published results.^[4]

The residual versions of Inception trained significantly faster than their non-residual counterparts, though their final accuracies were comparable. This confirmed that residual connections primarily accelerate convergence rather than enabling fundamentally better representations.^[4]

Comparison of Inception Versions

The table below summarizes the key characteristics and performance of each Inception variant:

Version	Year	Paper	Parameters	Input Size	Top-1 Accuracy (approx.)	Top-5 Error (single crop)	Key Innovations
Inception v1 (GoogLeNet)	2014	Going Deeper with Convolutions	~6.8M	224x224	68.7%	6.67% (competition)	Inception module, 1x1 bottleneck convolutions, auxiliary classifiers, global average pooling
Inception v2 (BN-Inception)	2015	Batch Normalization	~13.6M	224x224	~73.5%	~4.9%	Batch normalization, 5x5 replaced by two 3x3 convolutions
Inception v3	2015	Rethinking the Inception Architecture	~23.9M	299x299	78.8%	5.6%	Factorized convolutions (asymmetric 1xn/nx1), label smoothing, RMSProp, efficient grid reduction
Inception v4	2016	Inception-v4, Inception-ResNet	~42.7M	299x299	80.0%	5.0%	Simplified uniform architecture, new stem, three dedicated inception block types
Inception-ResNet-v1	2016	Inception-v4, Inception-ResNet	~(similar to v3)	299x299	78.7%	5.5%	Residual connections added to inception modules, activation scaling
Inception-ResNet-v2	2016	Inception-v4, Inception-ResNet	~55.8M	299x299	80.1%	4.9%	Larger residual inception blocks, activation scaling, faster convergence

Comparison with Other Architectures

The Inception family emerged alongside several other influential CNN architectures in the 2012 to 2016 period. The following table provides a comparison:

Architecture	Year	Parameters	Layers (with params)	ILSVRC Top-5 Error	Key Design Philosophy
AlexNet	2012	~61M	8	15.3% (ILSVRC 2012)	First large-scale deep CNN; ReLU, dropout, data augmentation
VGG-16	2014	~138M	16	7.3% (ILSVRC 2014)	Uniform 3x3 convolutions stacked deeply; simple but expensive
GoogLeNet (Inception v1)	2014	~6.8M	22	6.67% (ILSVRC 2014)	Multi-scale parallel convolutions; 1x1 bottlenecks; parameter efficiency
ResNet-50	2015	~25.6M	50	~6.7% (single model)	Residual (skip) connections; enabled very deep training (up to 152 layers)
ResNet-152	2015	~60M	152	3.57% (ILSVRC 2015 ensemble)	Deepest ResNet variant; ILSVRC 2015 winner
Inception v3	2015	~23.9M	~48 conv layers	5.6% (single crop)	Factorized convolutions; label smoothing; 1st runner-up ILSVRC 2015
Inception v4	2016	~42.7M	~(deeper)	5.0% (single crop)	Streamlined Inception blocks; no residual constraints
Inception-ResNet-v2	2016	~55.8M	~(deeper)	4.9% (single crop)	Hybrid inception + residual; activation scaling

One of the most striking aspects of GoogLeNet was its parameter efficiency. With only 6.8 million parameters, it significantly outperformed VGG-16 (138 million parameters) on the same benchmark, demonstrating that intelligent architectural design could be more effective than simply scaling up model size.^[1] VGG-16 used roughly 20 times more parameters while achieving a higher error rate.^[8] AlexNet, with about 61 million parameters, had an even higher error rate of 15.3%, which highlights the efficiency gains of the Inception approach.^[9]

Compared to ResNet, the Inception architectures offered comparable accuracy with a different design philosophy. While ResNet relied on residual connections to enable very deep (100+ layer) networks with relatively simple block designs, Inception focused on multi-scale feature extraction within each module.^[6] The Inception-ResNet variants eventually combined both approaches, showing that the two ideas were complementary rather than competing.^[4]

Key Technical Concepts

1x1 Convolutions for Dimensionality Reduction

The use of 1x1 convolutions is one of the most important ideas in the Inception architecture. A 1x1 convolution operates across the channel dimension of a feature map without changing spatial dimensions. It functions as a learned linear combination of channels, allowing the network to:

Reduce the number of channels (dimensionality reduction) before expensive 3x3 or 5x5 convolutions, saving computation
Add nonlinearity (when followed by a ReLU activation) without changing the spatial resolution
Act as a form of cross-channel feature pooling

This concept was first introduced in the "Network in Network" paper by Lin et al. (2013) and was adopted extensively by the Inception architecture.^[5] In the naive inception module, applying 5x5 convolutions directly on high-dimensional feature maps would be prohibitively expensive. By first reducing channels with a 1x1 convolution, the computational cost drops dramatically while preserving the ability to learn complex features.^[1]

Global Average Pooling

Instead of using one or more fully connected layers at the end of the network (as in AlexNet and VGG), GoogLeNet uses global average pooling. This operation takes the spatial average of each feature map, converting a tensor of shape HxWxC into a vector of length C. This approach has several advantages:

It drastically reduces the number of parameters (fully connected layers in VGG-16 account for the majority of its 138 million parameters)
It reduces overfitting because there are no learnable weights in the pooling operation
It provides built-in spatial invariance

Global average pooling was originally proposed in the Network in Network paper and became standard practice in subsequent architectures including ResNet.^[5]

Batch Normalization

Introduced alongside Inception v2, batch normalization normalizes the inputs to each layer by subtracting the mini-batch mean and dividing by the mini-batch standard deviation, then applying a learned scale and shift. This reduces the sensitivity of training to parameter initialization and learning rate choices. In practice, batch normalization made it possible to train Inception networks much faster and reach higher accuracy.^[2]

Label Smoothing

Label smoothing, introduced in Inception v3, is a regularization technique that softens the target probability distribution. Instead of training against hard 0/1 labels, the model trains against a mixture: a fraction (1 - epsilon) of the probability is assigned to the correct class, and epsilon is distributed uniformly across all classes. With epsilon = 0.1 and 1,000 classes, the correct class gets a target probability of 0.9 and each incorrect class gets approximately 0.0001. This discourages the model from producing extremely confident predictions and improves generalization.^[3]

Legacy and Influence

The Inception family of architectures introduced several ideas that became standard practice in deep learning:

Multi-scale feature extraction. The concept of processing inputs at multiple scales in parallel (using different filter sizes) influenced subsequent architectures and contributed to the broader understanding that networks benefit from capturing features at different spatial resolutions.

Efficient architecture design. GoogLeNet demonstrated that careful architectural choices (1x1 bottlenecks, global average pooling) could achieve state-of-the-art accuracy with a fraction of the parameters used by simpler designs like VGG.^[1] This philosophy of parameter efficiency carried forward into architectures like MobileNet and EfficientNet.

Xception. In 2017, Francois Chollet (the creator of Keras) published the Xception architecture, which took the Inception hypothesis to its logical extreme. Chollet observed that the Inception module's parallel branches approximated a partial separation of cross-channel and spatial correlations. Xception replaced standard inception modules with depthwise separable convolutions, which fully separate cross-channel and spatial processing. With the same number of parameters as Inception v3, Xception achieved higher accuracy, validating the underlying intuition behind the Inception design.^[7]

Neural Architecture Search. The success of hand-designed Inception modules inspired automated methods for discovering optimal network architectures. NASNet (Zoph et al., 2018), which used reinforcement learning to search for optimal cell structures, produced modules that bore a resemblance to Inception-style multi-branch designs. EfficientNet (Tan and Le, 2019) built on this line of work, using compound scaling to balance network depth, width, and resolution.

Batch normalization. While batch normalization was introduced in the context of Inception v2, it quickly became one of the most widely adopted techniques in deep learning, used in virtually every modern architecture.^[2]

Label smoothing. Originally introduced as a minor regularization trick in Inception v3, label smoothing has been widely adopted in training modern large-scale models, including Transformer-based architectures for both vision and natural language processing.^[3]

Is Inception still used today?

Pre-trained Inception models remain readily available in all major deep learning frameworks, which keeps them in active use for transfer learning and as feature extractors:

TensorFlow / Keras: keras.applications.InceptionV3 and keras.applications.InceptionResNetV2
PyTorch: torchvision.models.inception_v3 and torchvision.models.googlenet
Hugging Face timm: Inception v3, Inception v4, and Inception-ResNet-v2 are all available through the timm library

These pre-trained models are commonly used for transfer learning, where the learned features from ImageNet classification are fine-tuned for domain-specific tasks such as medical imaging, satellite imagery analysis, and industrial quality inspection. Inception v3 also remains the standard feature extractor behind the Frechet Inception Distance (FID), a widely used metric for evaluating the quality of images produced by generative models. Although newer families such as Vision Transformer (ViT) and EfficientNet now lead the largest image-classification benchmarks, the architectural ideas Inception introduced (1x1 bottlenecks, factorized convolutions, and parameter-efficient design) remain foundational.

References

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). "Going Deeper with Convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1409.4842 ↩
Ioffe, S. & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*. arXiv:1502.03167 ↩
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1512.00567 ↩
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017). "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning." *Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence*. arXiv:1602.07261 ↩
Lin, M., Chen, Q., & Yan, S. (2014). "Network In Network." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1312.4400 ↩
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1512.03385 ↩
Chollet, F. (2017). "Xception: Deep Learning with Depthwise Separable Convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1610.02357 ↩
Simonyan, K. & Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1409.1556 ↩
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS) 25*. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

Background and Motivation

Why is it called Inception (and GoogLeNet)?

Inception v1 (GoogLeNet)

Paper and Authors

How does the inception module work?

Naive vs. Dimensionality-Reduced Inception Module

Overall Architecture

Auxiliary Classifiers

Parameter Efficiency

ILSVRC 2014 Results

Inception v2 (BN-Inception)

Paper and Authors

Batch Normalization

Architectural Changes

Performance

Inception v3

Paper and Authors

What are the general design principles of Inception v3?

Factorized Convolutions

Efficient Grid Size Reduction

Label Smoothing Regularization

Other Training Improvements

Architecture and Performance

Inception v4 and Inception-ResNet

Paper and Authors

Inception v4

Do residual connections help Inception?

Activation Scaling

Performance Results

Comparison of Inception Versions

Comparison with Other Architectures

Key Technical Concepts

1x1 Convolutions for Dimensionality Reduction

Global Average Pooling

Batch Normalization

Label Smoothing

Legacy and Influence

Is Inception still used today?

References

Improve this article

Related Articles

Translational invariance

Convolutional Neural Network

ResNet

EfficientNet

YOLO (object detection)

VGG

What links here (24 of 31)

Related Articles

Translational invariance

Convolutional Neural Network

ResNet

EfficientNet

YOLO (object detection)

VGG

What links here (24 of 31)