VGG, also known as VGGNet, is a convolutional neural network architecture developed by Karen Simonyan and Andrew Zisserman at the Visual Geometry Group (VGG) of the University of Oxford. The architecture was introduced in the 2014 paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" and presented at the 3rd International Conference on Learning Representations (ICLR) in 2015. VGG demonstrated that simply increasing the depth of a network by stacking small 3x3 convolutional filters could significantly improve image classification accuracy, a finding that influenced the entire trajectory of deep learning research.
VGG achieved second place in the classification task and first place in the localization task at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014. Despite being surpassed in raw classification accuracy by GoogLeNet at the same competition, VGG became one of the most widely adopted architectures in computer vision due to its simplicity, uniform design, and effectiveness as a feature extractor for transfer learning.
The Visual Geometry Group (VGG) is a research group within the Department of Engineering Science at the University of Oxford. It was founded by Andrew Zisserman, who serves as Professor of Computer Vision Engineering and is a Royal Society Research Professor. Zisserman is known internationally for his work on multiple view geometry, visual recognition, and large-scale retrieval in images and video. He is the only person to have received the Marr Prize three times (1993, 1998, and 2003) and was elected a Fellow of the Royal Society in 2007.
Karen Simonyan, who co-authored the VGG paper, received his PhD in computer vision from the University of Oxford in 2013 under the supervision of Zisserman. Simonyan went on to co-found Vision Factory, an Oxford University spin-off focused on improving visual recognition systems using deep learning. Vision Factory was acquired by Google DeepMind in October 2014. Simonyan later contributed to several notable projects at DeepMind, including WaveNet, AlphaGo Zero, and AlphaZero.
Before VGG, the winning architecture at ILSVRC 2012, AlexNet, had demonstrated the power of deep convolutional networks for image classification with 8 layers. The ZFNet (Zeiler and Fergus), which won ILSVRC 2013, made modest architectural adjustments but did not fundamentally explore the question of how network depth affects accuracy. Simonyan and Zisserman set out to investigate this question systematically by constructing networks of increasing depth while keeping all other design decisions as simple and uniform as possible.
The resulting paper was submitted to arXiv on September 4, 2014, and went through six revisions before the final version was published on April 10, 2015. It has since become one of the most cited papers in the deep learning literature.
The core idea behind VGG is architectural simplicity combined with increased depth. Rather than experimenting with different filter sizes, complex module designs, or varying layer configurations, the VGG architects chose to use exclusively 3x3 convolutional filters throughout the entire network. This stands in contrast to AlexNet, which used 11x11 filters in its first layer and 5x5 filters in its second layer.
The key insight is that a stack of two 3x3 convolutional layers (without spatial pooling in between) has an effective receptive field of 5x5, while a stack of three 3x3 layers has an effective receptive field of 7x7. Using smaller filters stacked in sequence provides two advantages over a single large filter:
All VGG configurations share the same overall structure:
The number of feature map channels starts at 64 in the first convolutional block and doubles after each max-pooling layer, reaching a maximum of 512.
The original paper defines six configurations of increasing depth, labeled A through E. Configurations A through E correspond to 11, 11, 13, 16, 16, and 19 weight layers respectively. The table below shows the layer structure of each configuration.
| Layer Block | A (VGG-11) | A-LRN | B (VGG-13) | C (VGG-16) | D (VGG-16) | E (VGG-19) |
|---|---|---|---|---|---|---|
| Block 1 | conv3-64 | conv3-64, LRN | conv3-64, conv3-64 | conv3-64, conv3-64 | conv3-64, conv3-64 | conv3-64, conv3-64 |
| Pool | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| Block 2 | conv3-128 | conv3-128 | conv3-128, conv3-128 | conv3-128, conv3-128 | conv3-128, conv3-128 | conv3-128, conv3-128 |
| Pool | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| Block 3 | conv3-256, conv3-256 | conv3-256, conv3-256 | conv3-256, conv3-256 | conv3-256, conv3-256, conv1-256 | conv3-256, conv3-256, conv3-256 | conv3-256, conv3-256, conv3-256, conv3-256 |
| Pool | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| Block 4 | conv3-512, conv3-512 | conv3-512, conv3-512 | conv3-512, conv3-512 | conv3-512, conv3-512, conv1-512 | conv3-512, conv3-512, conv3-512 | conv3-512, conv3-512, conv3-512, conv3-512 |
| Pool | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| Block 5 | conv3-512, conv3-512 | conv3-512, conv3-512 | conv3-512, conv3-512 | conv3-512, conv3-512, conv1-512 | conv3-512, conv3-512, conv3-512 | conv3-512, conv3-512, conv3-512, conv3-512 |
| Pool | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| FC Layers | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 |
| Softmax | softmax | softmax | softmax | softmax | softmax | softmax |
| Weight layers | 11 | 11 | 13 | 16 | 16 | 19 |
All convolutional layers use 3x3 filters except for the 1x1 filters in Configuration C. Configuration A-LRN is identical to Configuration A but includes a Local Response Normalization layer after the first convolutional layer. Configurations D and E (commonly called VGG-16 and VGG-19) are the most widely used variants.
Despite the large depth, the number of parameters in VGG networks is dominated by the fully connected layers rather than the convolutional layers. The first fully connected layer alone, which takes the 7x7x512 output of the last convolutional block and maps it to 4,096 channels, contains 7 x 7 x 512 x 4,096 = 102,760,448 parameters. The three fully connected layers together account for roughly 124 million of the total parameters.
| Configuration | Common Name | Weight Layers | Parameters (Millions) |
|---|---|---|---|
| A | VGG-11 | 11 | ~133M |
| A-LRN | VGG-11 (LRN) | 11 | ~133M |
| B | VGG-13 | 13 | ~133M |
| C | VGG-16 (1x1 conv) | 16 | ~134M |
| D | VGG-16 | 16 | ~138M |
| E | VGG-19 | 19 | ~144M |
The relatively small difference in parameter count between configurations reflects the fact that most parameters reside in the fully connected layers, which are identical across all configurations. The convolutional layers, despite being the defining feature of each variant, contribute only a modest number of additional parameters as depth increases.
VGG networks are computationally expensive. VGG-16 requires approximately 15.5 billion floating point operations (FLOPs) for a single forward pass on a 224x224 input image, and VGG-19 requires approximately 19.6 billion FLOPs. The model weights for VGG-16 occupy about 528 MB of storage, and VGG-19 occupies about 549 MB.
By comparison, GoogLeNet achieves comparable or better accuracy with only about 1.5 billion FLOPs and 6.8 million parameters. This disparity in efficiency is one of the main limitations of the VGG architecture.
The VGG networks were trained using mini-batch stochastic gradient descent (SGD) with the following hyperparameters:
| Hyperparameter | Value |
|---|---|
| Batch size | 256 |
| Momentum | 0.9 |
| Weight decay | 5 x 10^-4 |
| Initial learning rate | 0.01 |
| Learning rate schedule | Decreased by factor of 10 when validation accuracy stopped improving |
| Number of LR decreases | 3 (during training) |
| Dropout | 0.5 (applied to first two FC layers) |
| Epochs | ~74 (370K iterations) |
Training was conducted on a system with four NVIDIA Titan Black GPUs and took approximately two to three weeks per network, depending on the configuration.
Initializing very deep networks is challenging because poor initialization can stall learning due to instability of gradients. To address this, the authors used a staged training approach. They first trained Configuration A (the shallowest network, with 11 layers), which was shallow enough to be initialized with random weights drawn from a normal distribution with zero mean and 0.01 variance. The biases were initialized to zero.
Once Configuration A converged, its learned weights were used to initialize the first four convolutional blocks and three fully connected layers of the deeper configurations. The remaining layers that did not have a corresponding layer in Configuration A were initialized randomly. This approach allowed the deeper networks to begin training from a reasonable starting point rather than from scratch.
The authors later noted that it was possible to initialize weights using the procedure of Glorot and Bengio (2010) without pre-training, though they used the staged approach in their experiments.
The training images were augmented using the following techniques:
A key aspect of VGG training was the use of scale jittering. Training images were rescaled so that the shorter side equaled a value S, and then 224 x 224 crops were extracted. Two approaches to setting S were evaluated:
Multi-scale training consistently outperformed fixed-scale training across all configurations.
At test time, the fully connected layers were converted into convolutional layers (the first FC layer to a 7x7 convolutional layer, and the remaining two to 1x1 convolutional layers). This converted the classification network into a fully convolutional network that could accept inputs of any size. The resulting "class score map" was spatially averaged (sum-pooled) to produce a fixed-size vector of class scores. The image was also horizontally flipped, and the scores from the original and flipped versions were averaged to produce the final prediction.
As an alternative to dense evaluation, the authors also tested multi-crop evaluation following the approach from GoogLeNet. Each test image was resized to multiple scales, and 150 crops were extracted per scale. The softmax class posteriors were averaged across all crops.
The authors found that combining dense and multi-crop evaluation by averaging their softmax outputs produced the best results, as the two methods are complementary. Dense evaluation captures fine-grained spatial information, while multi-crop evaluation provides better sampling of the input at the boundary regions due to the use of padding in convolutions.
The single-scale evaluation results on the ImageNet validation set demonstrated the consistent benefit of increasing depth. The table below shows results for each configuration with Q (test scale) set equal to S (training scale) for fixed S, or Q = 384 for multi-scale trained models (S in [256, 512]).
| Configuration | Train Scale (S) | Test Scale (Q) | Top-1 Error (%) | Top-5 Error (%) |
|---|---|---|---|---|
| A (VGG-11) | 256 | 256 | 29.6 | 10.4 |
| A-LRN | 256 | 256 | 29.7 | 10.5 |
| B (VGG-13) | 256 | 256 | 28.7 | 9.9 |
| C | 256 | 256 | 28.1 | 9.4 |
| C | 384 | 384 | 28.1 | 9.3 |
| C | [256; 512] | 384 | 27.3 | 8.8 |
| D (VGG-16) | 256 | 256 | 27.0 | 8.8 |
| D (VGG-16) | 384 | 384 | 26.8 | 8.7 |
| D (VGG-16) | [256; 512] | 384 | 25.6 | 8.1 |
| E (VGG-19) | 256 | 256 | 27.3 | 9.0 |
| E (VGG-19) | 384 | 384 | 26.9 | 8.7 |
| E (VGG-19) | [256; 512] | 384 | 25.5 | 8.0 |
Several observations stand out from these results:
When testing at multiple scales and averaging the results, performance improved across all configurations.
| Configuration | Train Scale (S) | Test Scales (Q) | Top-1 Error (%) | Top-5 Error (%) |
|---|---|---|---|---|
| B (VGG-13) | 256 | 224, 256, 288 | 28.2 | 9.6 |
| C | 256 | 224, 256, 288 | 27.7 | 9.2 |
| C | 384 | 352, 384, 416 | 27.8 | 9.2 |
| C | [256; 512] | 256, 384, 512 | 26.3 | 8.2 |
| D (VGG-16) | 256 | 224, 256, 288 | 26.6 | 8.6 |
| D (VGG-16) | 384 | 352, 384, 416 | 26.5 | 8.6 |
| D (VGG-16) | [256; 512] | 256, 384, 512 | 24.8 | 7.5 |
| E (VGG-19) | 256 | 224, 256, 288 | 26.9 | 8.7 |
| E (VGG-19) | 384 | 352, 384, 416 | 26.7 | 8.6 |
| E (VGG-19) | [256; 512] | 256, 384, 512 | 24.8 | 7.5 |
The combination of dense and multi-crop evaluation methods yielded the best single-model results.
| Configuration | Evaluation Method | Top-1 Error (%) | Top-5 Error (%) |
|---|---|---|---|
| D (VGG-16) | Dense | 24.8 | 7.5 |
| D (VGG-16) | Multi-crop | 24.6 | 7.5 |
| D (VGG-16) | Multi-crop + Dense | 24.4 | 7.2 |
| E (VGG-19) | Dense | 24.8 | 7.5 |
| E (VGG-19) | Multi-crop | 24.6 | 7.4 |
| E (VGG-19) | Multi-crop + Dense | 24.4 | 7.1 |
At the ILSVRC 2014 competition, the VGG team submitted an ensemble of models and achieved the following results:
| Task | VGG Result | Placement |
|---|---|---|
| Classification (top-5 error) | 7.3% | 2nd place |
| Localization (error) | 25.3% | 1st place |
For classification, the winning entry was GoogLeNet with a top-5 error of 6.7%. However, the VGG team noted that a single VGG-16 model achieved 7.0% top-5 test error, outperforming a single GoogLeNet model (7.9% top-5 error). GoogLeNet's advantage came from its ensemble of seven networks and more sophisticated multi-crop evaluation.
The table below compares VGG with contemporary and subsequent architectures.
| Architecture | Year | Top-5 Error (%) | Parameters | Depth (Layers) | FLOPs |
|---|---|---|---|---|---|
| AlexNet | 2012 | 16.4 | ~60M | 8 | ~720M |
| ZFNet | 2013 | 11.7 | ~60M | 8 | ~720M |
| VGG-16 (single model) | 2014 | 7.0 | ~138M | 16 | ~15.5B |
| VGG-19 (single model) | 2014 | 7.1 | ~144M | 19 | ~19.6B |
| GoogLeNet (single model) | 2014 | 7.9 | ~6.8M | 22 | ~1.5B |
| GoogLeNet (ensemble) | 2014 | 6.7 | ~6.8M | 22 | ~1.5B |
| ResNet-152 | 2015 | 3.6 | ~60M | 152 | ~11.3B |
This comparison highlights VGG's position in the evolution of deep learning architectures. It significantly outperformed AlexNet and ZFNet in accuracy but at the cost of much higher parameter counts and computational requirements. GoogLeNet achieved similar accuracy with far fewer parameters through its Inception modules. ResNet, introduced the following year, surpassed all previous architectures by using skip connections to enable training of much deeper networks.
VGG-16 (Configuration D) is the most commonly used variant. The following table provides a layer-by-layer breakdown.
| Layer | Type | Filter Size | Stride | Output Size | Parameters |
|---|---|---|---|---|---|
| Input | - | - | - | 224 x 224 x 3 | 0 |
| conv1_1 | Convolution | 3 x 3 | 1 | 224 x 224 x 64 | 1,792 |
| conv1_2 | Convolution | 3 x 3 | 1 | 224 x 224 x 64 | 36,928 |
| pool1 | Max Pooling | 2 x 2 | 2 | 112 x 112 x 64 | 0 |
| conv2_1 | Convolution | 3 x 3 | 1 | 112 x 112 x 128 | 73,856 |
| conv2_2 | Convolution | 3 x 3 | 1 | 112 x 112 x 128 | 147,584 |
| pool2 | Max Pooling | 2 x 2 | 2 | 56 x 56 x 128 | 0 |
| conv3_1 | Convolution | 3 x 3 | 1 | 56 x 56 x 256 | 295,168 |
| conv3_2 | Convolution | 3 x 3 | 1 | 56 x 56 x 256 | 590,080 |
| conv3_3 | Convolution | 3 x 3 | 1 | 56 x 56 x 256 | 590,080 |
| pool3 | Max Pooling | 2 x 2 | 2 | 28 x 28 x 256 | 0 |
| conv4_1 | Convolution | 3 x 3 | 1 | 28 x 28 x 512 | 1,180,160 |
| conv4_2 | Convolution | 3 x 3 | 1 | 28 x 28 x 512 | 2,359,808 |
| conv4_3 | Convolution | 3 x 3 | 1 | 28 x 28 x 512 | 2,359,808 |
| pool4 | Max Pooling | 2 x 2 | 2 | 14 x 14 x 512 | 0 |
| conv5_1 | Convolution | 3 x 3 | 1 | 14 x 14 x 512 | 2,359,808 |
| conv5_2 | Convolution | 3 x 3 | 1 | 14 x 14 x 512 | 2,359,808 |
| conv5_3 | Convolution | 3 x 3 | 1 | 14 x 14 x 512 | 2,359,808 |
| pool5 | Max Pooling | 2 x 2 | 2 | 7 x 7 x 512 | 0 |
| fc6 | Fully Connected | - | - | 4096 | 102,764,544 |
| fc7 | Fully Connected | - | - | 4096 | 16,781,312 |
| fc8 | Fully Connected | - | - | 1000 | 4,097,000 |
| softmax | Softmax | - | - | 1000 | 0 |
| Total | ~138,357,544 |
The convolutional layers account for approximately 14.7 million parameters, while the fully connected layers account for approximately 123.6 million parameters. This means that roughly 89% of VGG-16's parameters are concentrated in the fully connected layers.
While VGG was a significant advancement when it was introduced, the architecture has several well-known limitations.
VGG-16's 138 million parameters require approximately 528 MB of storage. During training with a batch size of 128, the model can require upward of 14 GB of GPU memory. This made VGG difficult to train on the hardware available at the time and remains a concern even with modern GPUs when working with larger batch sizes or higher-resolution inputs.
With approximately 15.5 billion FLOPs per forward pass, VGG-16 is significantly more expensive to run than architectures that achieve comparable or better accuracy. For example, GoogLeNet uses roughly 10x fewer FLOPs while matching VGG in accuracy. This makes VGG impractical for many real-time and edge computing applications without model compression techniques.
The vast majority of VGG's parameters reside in the three fully connected layers, which contribute relatively little to the network's representational power compared to the convolutional layers. Later architectures like GoogLeNet addressed this by replacing fully connected layers with global average pooling, which dramatically reduced parameter counts.
Although VGG successfully trained networks up to 19 layers deep, the authors found that going deeper did not yield substantial improvements. Configuration E (19 layers) only marginally outperformed Configuration D (16 layers). Deeper variants would have suffered from the vanishing gradient problem, where gradients diminish as they are backpropagated through many layers, making learning difficult. This limitation was later addressed by ResNet's skip connections.
Training a single VGG network required two to three weeks on four NVIDIA Titan Black GPUs. The staged initialization approach (training shallow networks first to initialize deeper ones) added even more time to the overall process.
Despite its limitations, VGG has had a lasting influence on the field of deep learning and computer vision.
VGG-16 and VGG-19 became standard feature extraction backbones in the years following their release. Pre-trained VGG models, available through frameworks like PyTorch, TensorFlow, and Keras, were widely used for transfer learning on tasks with limited labeled data. The features learned by VGG's convolutional layers proved highly transferable to domains including medical image analysis, satellite imagery classification, and fine-grained visual recognition.
VGG-19 became the standard network for neural style transfer, as popularized by Gatys et al. in their 2015 paper "A Neural Algorithm of Artistic Style." The hierarchical features captured by different layers of VGG, from low-level textures in early layers to high-level content in deeper layers, make it well-suited for separating and recombining content and style information.
VGG-16 served as the backbone for several prominent object detection frameworks, including Faster R-CNN (Ren et al., 2015) and the Single Shot MultiBox Detector (SSD, Liu et al., 2016). Its rich feature representations provided a strong foundation for detecting and localizing objects in images.
Pre-trained VGG networks are widely used to define perceptual loss functions for image generation tasks, including super-resolution, image inpainting, and generative adversarial network (GAN) training. Instead of measuring pixel-level differences between images, perceptual loss computes the distance between feature representations extracted by VGG, producing results that are more perceptually similar to human vision.
VGG's demonstration that depth matters directly influenced the development of subsequent architectures. ResNet (2015) built on this insight by introducing skip connections that enabled training networks with over 100 layers. The principle of using small 3x3 filters has been adopted by nearly all modern CNN architectures.
The RepVGG architecture (Ding et al., 2021) revisited the VGG-style plain architecture, using structural reparameterization to achieve competitive performance with modern architectures while maintaining VGG's simple, inference-efficient design.
VGG's uniform, straightforward design makes it one of the most commonly used architectures for teaching deep learning and convolutional neural networks. Its simplicity allows students and practitioners to understand the fundamental building blocks of CNNs without being overwhelmed by the complexity of later architectures like Inception or Transformer-based models.
Pre-trained VGG models are widely available in major deep learning frameworks:
| Framework | Models Available | Pre-trained Weights |
|---|---|---|
| PyTorch (torchvision) | VGG-11, VGG-13, VGG-16, VGG-19 (with and without batch normalization) | ImageNet-1K |
| TensorFlow / Keras | VGG-16, VGG-19 | ImageNet-1K |
| ONNX Model Zoo | VGG-16, VGG-19 | ImageNet-1K |
These pre-trained models enable researchers and practitioners to use VGG as a starting point for new tasks without training from scratch.