VGG

VGG, also known as VGGNet, is a convolutional neural network architecture developed by Karen Simonyan and Andrew Zisserman at the Visual Geometry Group (VGG) of the University of Oxford. The architecture was introduced in the 2014 paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" and presented at the 3rd International Conference on Learning Representations (ICLR) in 2015. VGG demonstrated that simply increasing the depth of a network by stacking small 3x3 convolutional filters could significantly improve image classification accuracy, a finding that influenced the entire trajectory of deep learning research.

VGG achieved second place in the classification task and first place in the localization task at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014. Despite being surpassed in raw classification accuracy by GoogLeNet at the same competition, VGG became one of the most widely adopted architectures in computer vision due to its simplicity, uniform design, and effectiveness as a feature extractor for transfer learning.

History and Background

The Visual Geometry Group

The Visual Geometry Group (VGG) is a research group within the Department of Engineering Science at the University of Oxford. It was founded by Andrew Zisserman, who serves as Professor of Computer Vision Engineering and is a Royal Society Research Professor. Zisserman is known internationally for his work on multiple view geometry, visual recognition, and large-scale retrieval in images and video. He is the only person to have received the Marr Prize three times (1993, 1998, and 2003) and was elected a Fellow of the Royal Society in 2007.

Karen Simonyan, who co-authored the VGG paper, received his PhD in computer vision from the University of Oxford in 2013 under the supervision of Zisserman. Simonyan went on to co-found Vision Factory, an Oxford University spin-off focused on improving visual recognition systems using deep learning. Vision Factory was acquired by Google DeepMind in October 2014. Simonyan later contributed to several notable projects at DeepMind, including WaveNet, AlphaGo Zero, and AlphaZero.

Context: The Depth Question

Before VGG, the winning architecture at ILSVRC 2012, AlexNet, had demonstrated the power of deep convolutional networks for image classification with 8 layers. The ZFNet (Zeiler and Fergus), which won ILSVRC 2013, made modest architectural adjustments but did not fundamentally explore the question of how network depth affects accuracy. Simonyan and Zisserman set out to investigate this question systematically by constructing networks of increasing depth while keeping all other design decisions as simple and uniform as possible.

The resulting paper was submitted to arXiv on September 4, 2014, and went through six revisions before the final version was published on April 10, 2015. It has since become one of the most cited papers in the deep learning literature.

Architecture

Design Philosophy

The core idea behind VGG is architectural simplicity combined with increased depth. Rather than experimenting with different filter sizes, complex module designs, or varying layer configurations, the VGG architects chose to use exclusively 3x3 convolutional filters throughout the entire network. This stands in contrast to AlexNet, which used 11x11 filters in its first layer and 5x5 filters in its second layer.

The key insight is that a stack of two 3x3 convolutional layers (without spatial pooling in between) has an effective receptive field of 5x5, while a stack of three 3x3 layers has an effective receptive field of 7x7. Using smaller filters stacked in sequence provides two advantages over a single large filter:

More non-linearity: Each convolutional layer is followed by a ReLU activation function, so stacking three 3x3 layers introduces three non-linear transformations instead of one. This makes the decision function more discriminative.
Fewer parameters: Three 3x3 convolution layers with C channels each have 3 x (3 x 3 x C x C) = 27C^2 parameters, while a single 7x7 layer has 7 x 7 x C x C = 49C^2 parameters. This represents a 45% reduction in parameters.

General Architecture

All VGG configurations share the same overall structure:

Input: A fixed-size 224 x 224 RGB image.
Preprocessing: Subtraction of the mean RGB value computed on the training set from each pixel.
Convolutional layers: All convolutional filters are 3x3 with stride 1 and padding of 1 pixel to preserve spatial resolution. Configuration C also uses some 1x1 convolutional filters, which act as linear transformations of the input channels followed by non-linearity.
Pooling: Max-pooling is performed over 2x2 windows with stride 2. Five max-pooling layers are used throughout the network, placed after certain convolutional blocks.
Fully connected layers: Three fully connected layers follow the convolutional stack. The first two have 4,096 channels each, and the third has 1,000 channels (one per ImageNet class).
Output: A softmax layer for classification.
Activation: All hidden layers use ReLU non-linearity.
No Local Response Normalization (LRN): The authors found that LRN, as used in AlexNet, did not improve performance on VGG and only increased memory consumption and computation time.

The number of feature map channels starts at 64 in the first convolutional block and doubles after each max-pooling layer, reaching a maximum of 512.

Configurations (A through E)

The original paper defines six configurations of increasing depth, labeled A through E. Configurations A through E correspond to 11, 11, 13, 16, 16, and 19 weight layers respectively. The table below shows the layer structure of each configuration.

Layer Block	A (VGG-11)	A-LRN	B (VGG-13)	C (VGG-16)	D (VGG-16)	E (VGG-19)
Block 1	conv3-64	conv3-64, LRN	conv3-64, conv3-64	conv3-64, conv3-64	conv3-64, conv3-64	conv3-64, conv3-64
Pool	maxpool	maxpool	maxpool	maxpool	maxpool	maxpool
Block 2	conv3-128	conv3-128	conv3-128, conv3-128	conv3-128, conv3-128	conv3-128, conv3-128	conv3-128, conv3-128
Pool	maxpool	maxpool	maxpool	maxpool	maxpool	maxpool
Block 3	conv3-256, conv3-256	conv3-256, conv3-256	conv3-256, conv3-256	conv3-256, conv3-256, conv1-256	conv3-256, conv3-256, conv3-256	conv3-256, conv3-256, conv3-256, conv3-256
Pool	maxpool	maxpool	maxpool	maxpool	maxpool	maxpool
Block 4	conv3-512, conv3-512	conv3-512, conv3-512	conv3-512, conv3-512	conv3-512, conv3-512, conv1-512	conv3-512, conv3-512, conv3-512	conv3-512, conv3-512, conv3-512, conv3-512
Pool	maxpool	maxpool	maxpool	maxpool	maxpool	maxpool
Block 5	conv3-512, conv3-512	conv3-512, conv3-512	conv3-512, conv3-512	conv3-512, conv3-512, conv1-512	conv3-512, conv3-512, conv3-512	conv3-512, conv3-512, conv3-512, conv3-512
Pool	maxpool	maxpool	maxpool	maxpool	maxpool	maxpool
FC Layers	FC-4096, FC-4096, FC-1000	FC-4096, FC-4096, FC-1000	FC-4096, FC-4096, FC-1000	FC-4096, FC-4096, FC-1000	FC-4096, FC-4096, FC-1000	FC-4096, FC-4096, FC-1000
Softmax	softmax	softmax	softmax	softmax	softmax	softmax
Weight layers	11	11	13	16	16	19

All convolutional layers use 3x3 filters except for the 1x1 filters in Configuration C. Configuration A-LRN is identical to Configuration A but includes a Local Response Normalization layer after the first convolutional layer. Configurations D and E (commonly called VGG-16 and VGG-19) are the most widely used variants.

Parameter Counts

Despite the large depth, the number of parameters in VGG networks is dominated by the fully connected layers rather than the convolutional layers. The first fully connected layer alone, which takes the 7x7x512 output of the last convolutional block and maps it to 4,096 channels, contains 7 x 7 x 512 x 4,096 = 102,760,448 parameters. The three fully connected layers together account for roughly 124 million of the total parameters.

Configuration	Common Name	Weight Layers	Parameters (Millions)
A	VGG-11	11	~133M
A-LRN	VGG-11 (LRN)	11	~133M
B	VGG-13	13	~133M
C	VGG-16 (1x1 conv)	16	~134M
D	VGG-16	16	~138M
E	VGG-19	19	~144M

The relatively small difference in parameter count between configurations reflects the fact that most parameters reside in the fully connected layers, which are identical across all configurations. The convolutional layers, despite being the defining feature of each variant, contribute only a modest number of additional parameters as depth increases.

Computational Cost

VGG networks are computationally expensive. VGG-16 requires approximately 15.5 billion floating point operations (FLOPs) for a single forward pass on a 224x224 input image, and VGG-19 requires approximately 19.6 billion FLOPs. The model weights for VGG-16 occupy about 528 MB of storage, and VGG-19 occupies about 549 MB.

By comparison, GoogLeNet achieves comparable or better accuracy with only about 1.5 billion FLOPs and 6.8 million parameters. This disparity in efficiency is one of the main limitations of the VGG architecture.

Training

Optimization

The VGG networks were trained using mini-batch stochastic gradient descent (SGD) with the following hyperparameters:

Hyperparameter	Value
Batch size	256
Momentum	0.9
Weight decay	5 x 10^-4
Initial learning rate	0.01
Learning rate schedule	Decreased by factor of 10 when validation accuracy stopped improving
Number of LR decreases	3 (during training)
Dropout	0.5 (applied to first two FC layers)
Epochs	~74 (370K iterations)

Training was conducted on a system with four NVIDIA Titan Black GPUs and took approximately two to three weeks per network, depending on the configuration.

Weight Initialization

Initializing very deep networks is challenging because poor initialization can stall learning due to instability of gradients. To address this, the authors used a staged training approach. They first trained Configuration A (the shallowest network, with 11 layers), which was shallow enough to be initialized with random weights drawn from a normal distribution with zero mean and 0.01 variance. The biases were initialized to zero.

Once Configuration A converged, its learned weights were used to initialize the first four convolutional blocks and three fully connected layers of the deeper configurations. The remaining layers that did not have a corresponding layer in Configuration A were initialized randomly. This approach allowed the deeper networks to begin training from a reasonable starting point rather than from scratch.

The authors later noted that it was possible to initialize weights using the procedure of Glorot and Bengio (2010) without pre-training, though they used the staged approach in their experiments.

Data Augmentation

The training images were augmented using the following techniques:

Random cropping: 224 x 224 patches were randomly cropped from the rescaled training images.
Random horizontal flipping: Each crop was randomly flipped horizontally with a 50% probability.
Random RGB color shift: The PCA-based color augmentation scheme from the AlexNet paper was applied, where principal components of the RGB pixel values across the training set were computed and random multiples of the principal components were added to each training image.

Multi-Scale Training

A key aspect of VGG training was the use of scale jittering. Training images were rescaled so that the shorter side equaled a value S, and then 224 x 224 crops were extracted. Two approaches to setting S were evaluated:

Fixed-scale training: S was fixed to either 256 or 384 pixels. When training at S = 384, the weights were initialized from the model trained at S = 256 to speed up convergence.
Multi-scale training (scale jittering): S was randomly sampled from the range [256, 512] for each training image. This allowed the model to see objects at different scales during training, acting as a form of data augmentation by scale jittering. The multi-scale model was initialized with the S = 384 fixed-scale model.

Multi-scale training consistently outperformed fixed-scale training across all configurations.

Evaluation

Dense Evaluation

At test time, the fully connected layers were converted into convolutional layers (the first FC layer to a 7x7 convolutional layer, and the remaining two to 1x1 convolutional layers). This converted the classification network into a fully convolutional network that could accept inputs of any size. The resulting "class score map" was spatially averaged (sum-pooled) to produce a fixed-size vector of class scores. The image was also horizontally flipped, and the scores from the original and flipped versions were averaged to produce the final prediction.

Multi-Crop Evaluation

As an alternative to dense evaluation, the authors also tested multi-crop evaluation following the approach from GoogLeNet. Each test image was resized to multiple scales, and 150 crops were extracted per scale. The softmax class posteriors were averaged across all crops.

Combining Dense and Multi-Crop

The authors found that combining dense and multi-crop evaluation by averaging their softmax outputs produced the best results, as the two methods are complementary. Dense evaluation captures fine-grained spatial information, while multi-crop evaluation provides better sampling of the input at the boundary regions due to the use of padding in convolutions.

Results

Single-Scale Evaluation

The single-scale evaluation results on the ImageNet validation set demonstrated the consistent benefit of increasing depth. The table below shows results for each configuration with Q (test scale) set equal to S (training scale) for fixed S, or Q = 384 for multi-scale trained models (S in [256, 512]).

Configuration	Train Scale (S)	Test Scale (Q)	Top-1 Error (%)	Top-5 Error (%)
A (VGG-11)	256	256	29.6	10.4
A-LRN	256	256	29.7	10.5
B (VGG-13)	256	256	28.7	9.9
C	256	256	28.1	9.4
C	384	384	28.1	9.3
C	[256; 512]	384	27.3	8.8
D (VGG-16)	256	256	27.0	8.8
D (VGG-16)	384	384	26.8	8.7
D (VGG-16)	[256; 512]	384	25.6	8.1
E (VGG-19)	256	256	27.3	9.0
E (VGG-19)	384	384	26.9	8.7
E (VGG-19)	[256; 512]	384	25.5	8.0

Several observations stand out from these results:

LRN does not help: Configuration A-LRN performed slightly worse than A without LRN, confirming that Local Response Normalization is unnecessary.
Deeper is better: Error decreased consistently from A (11 layers) to D (16 layers). Configuration E (19 layers) achieved a marginal improvement over D.
1x1 convolutions help, but 3x3 is better: Configuration C (with 1x1 convolutions) outperformed B, but D (which replaces those 1x1 layers with 3x3 layers) performed even better, confirming the value of spatial context captured by 3x3 filters.
Multi-scale training helps: Scale jittering (S in [256; 512]) consistently improved performance over fixed-scale training.

Multi-Scale Evaluation

When testing at multiple scales and averaging the results, performance improved across all configurations.

Configuration	Train Scale (S)	Test Scales (Q)	Top-1 Error (%)	Top-5 Error (%)
B (VGG-13)	256	224, 256, 288	28.2	9.6
C	256	224, 256, 288	27.7	9.2
C	384	352, 384, 416	27.8	9.2
C	[256; 512]	256, 384, 512	26.3	8.2
D (VGG-16)	256	224, 256, 288	26.6	8.6
D (VGG-16)	384	352, 384, 416	26.5	8.6
D (VGG-16)	[256; 512]	256, 384, 512	24.8	7.5
E (VGG-19)	256	224, 256, 288	26.9	8.7
E (VGG-19)	384	352, 384, 416	26.7	8.6
E (VGG-19)	[256; 512]	256, 384, 512	24.8	7.5

Dense vs. Multi-Crop Evaluation

The combination of dense and multi-crop evaluation methods yielded the best single-model results.

Configuration	Evaluation Method	Top-1 Error (%)	Top-5 Error (%)
D (VGG-16)	Dense	24.8	7.5
D (VGG-16)	Multi-crop	24.6	7.5
D (VGG-16)	Multi-crop + Dense	24.4	7.2
E (VGG-19)	Dense	24.8	7.5
E (VGG-19)	Multi-crop	24.6	7.4
E (VGG-19)	Multi-crop + Dense	24.4	7.1

ILSVRC 2014 Competition Results

At the ILSVRC 2014 competition, the VGG team submitted an ensemble of models and achieved the following results:

Task	VGG Result	Placement
Classification (top-5 error)	7.3%	2nd place
Localization (error)	25.3%	1st place

For classification, the winning entry was GoogLeNet with a top-5 error of 6.7%. However, the VGG team noted that a single VGG-16 model achieved 7.0% top-5 test error, outperforming a single GoogLeNet model (7.9% top-5 error). GoogLeNet's advantage came from its ensemble of seven networks and more sophisticated multi-crop evaluation.

Comparison with State of the Art

The table below compares VGG with contemporary and subsequent architectures.

Architecture	Year	Top-5 Error (%)	Parameters	Depth (Layers)	FLOPs
AlexNet	2012	16.4	~60M	8	~720M
ZFNet	2013	11.7	~60M	8	~720M
VGG-16 (single model)	2014	7.0	~138M	16	~15.5B
VGG-19 (single model)	2014	7.1	~144M	19	~19.6B
GoogLeNet (single model)	2014	7.9	~6.8M	22	~1.5B
GoogLeNet (ensemble)	2014	6.7	~6.8M	22	~1.5B
ResNet-152	2015	3.6	~60M	152	~11.3B

This comparison highlights VGG's position in the evolution of deep learning architectures. It significantly outperformed AlexNet and ZFNet in accuracy but at the cost of much higher parameter counts and computational requirements. GoogLeNet achieved similar accuracy with far fewer parameters through its Inception modules. ResNet, introduced the following year, surpassed all previous architectures by using skip connections to enable training of much deeper networks.

Detailed Layer-by-Layer Analysis of VGG-16

VGG-16 (Configuration D) is the most commonly used variant. The following table provides a layer-by-layer breakdown.

Layer	Type	Filter Size	Stride	Output Size	Parameters
Input	-	-	-	224 x 224 x 3	0
conv1_1	Convolution	3 x 3	1	224 x 224 x 64	1,792
conv1_2	Convolution	3 x 3	1	224 x 224 x 64	36,928
pool1	Max Pooling	2 x 2	2	112 x 112 x 64	0
conv2_1	Convolution	3 x 3	1	112 x 112 x 128	73,856
conv2_2	Convolution	3 x 3	1	112 x 112 x 128	147,584
pool2	Max Pooling	2 x 2	2	56 x 56 x 128	0
conv3_1	Convolution	3 x 3	1	56 x 56 x 256	295,168
conv3_2	Convolution	3 x 3	1	56 x 56 x 256	590,080
conv3_3	Convolution	3 x 3	1	56 x 56 x 256	590,080
pool3	Max Pooling	2 x 2	2	28 x 28 x 256	0
conv4_1	Convolution	3 x 3	1	28 x 28 x 512	1,180,160
conv4_2	Convolution	3 x 3	1	28 x 28 x 512	2,359,808
conv4_3	Convolution	3 x 3	1	28 x 28 x 512	2,359,808
pool4	Max Pooling	2 x 2	2	14 x 14 x 512	0
conv5_1	Convolution	3 x 3	1	14 x 14 x 512	2,359,808
conv5_2	Convolution	3 x 3	1	14 x 14 x 512	2,359,808
conv5_3	Convolution	3 x 3	1	14 x 14 x 512	2,359,808
pool5	Max Pooling	2 x 2	2	7 x 7 x 512	0
fc6	Fully Connected	-	-	4096	102,764,544
fc7	Fully Connected	-	-	4096	16,781,312
fc8	Fully Connected	-	-	1000	4,097,000
softmax	Softmax	-	-	1000	0
Total					~138,357,544

The convolutional layers account for approximately 14.7 million parameters, while the fully connected layers account for approximately 123.6 million parameters. This means that roughly 89% of VGG-16's parameters are concentrated in the fully connected layers.

Limitations

While VGG was a significant advancement when it was introduced, the architecture has several well-known limitations.

Large Memory Footprint

VGG-16's 138 million parameters require approximately 528 MB of storage. During training with a batch size of 128, the model can require upward of 14 GB of GPU memory. This made VGG difficult to train on the hardware available at the time and remains a concern even with modern GPUs when working with larger batch sizes or higher-resolution inputs.

High Computational Cost

With approximately 15.5 billion FLOPs per forward pass, VGG-16 is significantly more expensive to run than architectures that achieve comparable or better accuracy. For example, GoogLeNet uses roughly 10x fewer FLOPs while matching VGG in accuracy. This makes VGG impractical for many real-time and edge computing applications without model compression techniques.

Parameter Inefficiency

The vast majority of VGG's parameters reside in the three fully connected layers, which contribute relatively little to the network's representational power compared to the convolutional layers. Later architectures like GoogLeNet addressed this by replacing fully connected layers with global average pooling, which dramatically reduced parameter counts.

Vanishing Gradient Problem

Although VGG successfully trained networks up to 19 layers deep, the authors found that going deeper did not yield substantial improvements. Configuration E (19 layers) only marginally outperformed Configuration D (16 layers). Deeper variants would have suffered from the vanishing gradient problem, where gradients diminish as they are backpropagated through many layers, making learning difficult. This limitation was later addressed by ResNet's skip connections.

Slow Training

Training a single VGG network required two to three weeks on four NVIDIA Titan Black GPUs. The staged initialization approach (training shallow networks first to initialize deeper ones) added even more time to the overall process.

Legacy and Impact

Despite its limitations, VGG has had a lasting influence on the field of deep learning and computer vision.

Transfer Learning Backbone

VGG-16 and VGG-19 became standard feature extraction backbones in the years following their release. Pre-trained VGG models, available through frameworks like PyTorch, TensorFlow, and Keras, were widely used for transfer learning on tasks with limited labeled data. The features learned by VGG's convolutional layers proved highly transferable to domains including medical image analysis, satellite imagery classification, and fine-grained visual recognition.

Neural Style Transfer

VGG-19 became the standard network for neural style transfer, as popularized by Gatys et al. in their 2015 paper "A Neural Algorithm of Artistic Style." The hierarchical features captured by different layers of VGG, from low-level textures in early layers to high-level content in deeper layers, make it well-suited for separating and recombining content and style information.

Object Detection

VGG-16 served as the backbone for several prominent object detection frameworks, including Faster R-CNN (Ren et al., 2015) and the Single Shot MultiBox Detector (SSD, Liu et al., 2016). Its rich feature representations provided a strong foundation for detecting and localizing objects in images.

Perceptual Loss Functions

Pre-trained VGG networks are widely used to define perceptual loss functions for image generation tasks, including super-resolution, image inpainting, and generative adversarial network (GAN) training. Instead of measuring pixel-level differences between images, perceptual loss computes the distance between feature representations extracted by VGG, producing results that are more perceptually similar to human vision.

Influence on Later Architectures

VGG's demonstration that depth matters directly influenced the development of subsequent architectures. ResNet (2015) built on this insight by introducing skip connections that enabled training networks with over 100 layers. The principle of using small 3x3 filters has been adopted by nearly all modern CNN architectures.

The RepVGG architecture (Ding et al., 2021) revisited the VGG-style plain architecture, using structural reparameterization to achieve competitive performance with modern architectures while maintaining VGG's simple, inference-efficient design.

Educational Value

VGG's uniform, straightforward design makes it one of the most commonly used architectures for teaching deep learning and convolutional neural networks. Its simplicity allows students and practitioners to understand the fundamental building blocks of CNNs without being overwhelmed by the complexity of later architectures like Inception or Transformer-based models.

Pre-trained Model Availability

Pre-trained VGG models are widely available in major deep learning frameworks:

Framework	Models Available	Pre-trained Weights
PyTorch (torchvision)	VGG-11, VGG-13, VGG-16, VGG-19 (with and without batch normalization)	ImageNet-1K
TensorFlow / Keras	VGG-16, VGG-19	ImageNet-1K
ONNX Model Zoo	VGG-16, VGG-19	ImageNet-1K

These pre-trained models enable researchers and practitioners to use VGG as a starting point for new tasks without training from scratch.

References

Simonyan, K. and Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition." Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, May 7-9, 2015. arXiv:1409.1556.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." Advances in Neural Information Processing Systems 25 (NIPS 2012).
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). "Going Deeper with Convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015).
He, K., Zhang, X., Ren, S., and Sun, J. (2016). "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016).
Gatys, L. A., Ecker, A. S., and Bethge, M. (2016). "Image Style Transfer Using Convolutional Neural Networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016).
Ren, S., He, K., Girshick, R., and Sun, J. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." Advances in Neural Information Processing Systems 28 (NIPS 2015).
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C. (2016). "SSD: Single Shot MultiBox Detector." European Conference on Computer Vision (ECCV 2016).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). "ImageNet Large Scale Visual Recognition Challenge." International Journal of Computer Vision, 115(3), 211-252.
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., and Sun, J. (2021). "RepVGG: Making VGG-style ConvNets Great Again." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021).
Visual Geometry Group, University of Oxford. https://www.robots.ox.ac.uk/~vgg/

History and Background

The Visual Geometry Group

Context: The Depth Question

Architecture

Design Philosophy

General Architecture

Configurations (A through E)

Parameter Counts

Computational Cost

Training

Optimization

Weight Initialization

Data Augmentation

Multi-Scale Training

Evaluation

Dense Evaluation

Multi-Crop Evaluation

Combining Dense and Multi-Crop

Results

Single-Scale Evaluation

Multi-Scale Evaluation

Dense vs. Multi-Crop Evaluation

ILSVRC 2014 Competition Results

Comparison with State of the Art

Detailed Layer-by-Layer Analysis of VGG-16

Limitations

Large Memory Footprint

High Computational Cost

Parameter Inefficiency

Vanishing Gradient Problem

Slow Training

Legacy and Impact

Transfer Learning Backbone

Neural Style Transfer

Object Detection

Perceptual Loss Functions

Influence on Later Architectures

Educational Value

Pre-trained Model Availability

See Also

References

Improve this article

Related Articles

LeNet

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

ResNet

EfficientNet

Inception (deep learning)

History and Background

The Visual Geometry Group

Context: The Depth Question

Architecture

Design Philosophy

General Architecture

Configurations (A through E)

Parameter Counts

Computational Cost

Training

Optimization

Weight Initialization

Data Augmentation

Multi-Scale Training

Evaluation

Dense Evaluation

Multi-Crop Evaluation

Combining Dense and Multi-Crop

Results

Single-Scale Evaluation

Multi-Scale Evaluation

Dense vs. Multi-Crop Evaluation

ILSVRC 2014 Competition Results

Comparison with State of the Art

Detailed Layer-by-Layer Analysis of VGG-16

Limitations

Large Memory Footprint

High Computational Cost

Parameter Inefficiency

Vanishing Gradient Problem

Slow Training