# VGG

> Source: https://aiwiki.ai/wiki/vgg
> Updated: 2026-06-21
> Categories: Computer Vision, Deep Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**VGG** (also called **VGGNet**) is a deep [convolutional neural network](/wiki/convolutional_neural_network) architecture, introduced in 2014 by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group at the University of Oxford, that showed image classification accuracy improves substantially when network depth is pushed to 16-19 weight layers built entirely from small 3x3 convolution filters.[1] Its two best-known variants, VGG-16 and VGG-19, take a fixed 224x224 RGB image as input and stack 3x3 convolutions into 16 and 19 weight layers respectively, with roughly 138 million and 144 million parameters.[1] The architecture was presented in the paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" at the 3rd International Conference on Learning Representations ([ICLR](/wiki/iclr)) in 2015.[1]

VGG achieved second place in the classification task and first place in the localization task at the [ImageNet](/wiki/imagenet) Large Scale Visual Recognition Challenge (ILSVRC) 2014, with a 7.3% top-5 classification error and a 25.3% localization error.[1][8] The paper's abstract summarizes the central finding directly: its "main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers."[1] Despite being surpassed in raw classification accuracy by [GoogLeNet](/wiki/googlenet) at the same competition, VGG became one of the most widely adopted architectures in [computer vision](/wiki/computer_vision) due to its simplicity, uniform design, and effectiveness as a feature extractor for [transfer learning](/wiki/transfer_learning).[3]

## History and Background

### The Visual Geometry Group

The Visual Geometry Group (VGG) is a research group within the Department of Engineering Science at the University of Oxford.[10] It was founded by Andrew Zisserman, who serves as Professor of Computer Vision Engineering and is a Royal Society Research Professor. Zisserman is known internationally for his work on multiple view geometry, visual recognition, and large-scale retrieval in images and video. He is the only person to have received the Marr Prize three times (1993, 1998, and 2003) and was elected a Fellow of the Royal Society in 2007.

Karen Simonyan, who co-authored the VGG paper, received his PhD in computer vision from the University of Oxford in 2013 under the supervision of Zisserman. Simonyan went on to co-found Vision Factory, an Oxford University spin-off focused on improving visual recognition systems using deep learning, together with Zisserman and Max Jaderberg.[11] Vision Factory was acquired by [Google DeepMind](/wiki/deepmind) in October 2014 as part of a pair of Oxford acqui-hires announced on October 23, 2014.[11] Simonyan later contributed to several notable projects at DeepMind, including WaveNet, [AlphaGo](/wiki/alphago) Zero, and [AlphaZero](/wiki/alphazero).

### Context: The Depth Question

Before VGG, the winning architecture at ILSVRC 2012, [AlexNet](/wiki/alexnet), had demonstrated the power of deep convolutional networks for image classification with 8 layers.[2] The ZFNet (Zeiler and Fergus), which won ILSVRC 2013, made modest architectural adjustments but did not fundamentally explore the question of how network depth affects accuracy. Simonyan and Zisserman set out to investigate this question systematically by constructing networks of increasing depth while keeping all other design decisions as simple and uniform as possible.[1]

The resulting paper was submitted to arXiv on September 4, 2014, and went through six revisions before the final version (v6) was published on April 10, 2015.[1] It has since become one of the most cited papers in the deep learning literature.

## Architecture

### Design Philosophy

The core idea behind VGG is architectural simplicity combined with increased depth.[1] Rather than experimenting with different filter sizes, complex module designs, or varying layer configurations, the VGG architects chose to use exclusively 3x3 convolutional filters throughout the entire network.[1] This stands in contrast to AlexNet, which used 11x11 filters in its first layer and 5x5 filters in its second layer.[2]

The key insight is that a stack of two 3x3 convolutional layers (without spatial pooling in between) has an effective receptive field of 5x5, while a stack of three 3x3 layers has an effective receptive field of 7x7.[1] Using smaller filters stacked in sequence provides two advantages over a single large filter:

1. **More non-linearity**: Each convolutional layer is followed by a [ReLU](/wiki/relu) activation function, so stacking three 3x3 layers introduces three non-linear transformations instead of one. This makes the decision function more discriminative.[1]
2. **Fewer parameters**: Three 3x3 convolution layers with C channels each have 3 x (3 x 3 x C x C) = 27C^2 parameters, while a single 7x7 layer has 7 x 7 x C x C = 49C^2 parameters. This represents a 45% reduction in parameters.[1]

### General Architecture

All VGG configurations share the same overall structure:

- **Input**: A fixed-size 224 x 224 RGB image.[1]
- **Preprocessing**: Subtraction of the mean RGB value computed on the training set from each pixel.[1]
- **Convolutional layers**: All convolutional filters are 3x3 with stride 1 and padding of 1 pixel to preserve spatial resolution. Configuration C also uses some 1x1 convolutional filters, which act as linear transformations of the input channels followed by non-linearity.[1]
- **Pooling**: Max-pooling is performed over 2x2 windows with stride 2. Five max-pooling layers are used throughout the network, placed after certain convolutional blocks.[1]
- **Fully connected layers**: Three fully connected layers follow the convolutional stack. The first two have 4,096 channels each, and the third has 1,000 channels (one per [ImageNet](/wiki/imagenet) class).[1]
- **Output**: A [softmax](/wiki/softmax) layer for classification.[1]
- **Activation**: All hidden layers use [ReLU](/wiki/relu) non-linearity.[1]
- **No Local Response Normalization (LRN)**: The authors found that LRN, as used in AlexNet, did not improve performance on VGG and only increased memory consumption and computation time.[1]

The number of feature map channels starts at 64 in the first convolutional block and doubles after each max-pooling layer, reaching a maximum of 512.[1]

### Configurations (A through E)

The original paper defines six configurations of increasing depth, labeled A through E.[1] Configurations A through E correspond to 11, 11, 13, 16, 16, and 19 weight layers respectively.[1] The table below shows the layer structure of each configuration.

| Layer Block | A (VGG-11) | A-LRN | B (VGG-13) | C (VGG-16) | D (VGG-16) | E (VGG-19) |
|---|---|---|---|---|---|---|
| **Block 1** | conv3-64 | conv3-64, LRN | conv3-64, conv3-64 | conv3-64, conv3-64 | conv3-64, conv3-64 | conv3-64, conv3-64 |
| **Pool** | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| **Block 2** | conv3-128 | conv3-128 | conv3-128, conv3-128 | conv3-128, conv3-128 | conv3-128, conv3-128 | conv3-128, conv3-128 |
| **Pool** | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| **Block 3** | conv3-256, conv3-256 | conv3-256, conv3-256 | conv3-256, conv3-256 | conv3-256, conv3-256, conv1-256 | conv3-256, conv3-256, conv3-256 | conv3-256, conv3-256, conv3-256, conv3-256 |
| **Pool** | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| **Block 4** | conv3-512, conv3-512 | conv3-512, conv3-512 | conv3-512, conv3-512 | conv3-512, conv3-512, conv1-512 | conv3-512, conv3-512, conv3-512 | conv3-512, conv3-512, conv3-512, conv3-512 |
| **Pool** | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| **Block 5** | conv3-512, conv3-512 | conv3-512, conv3-512 | conv3-512, conv3-512 | conv3-512, conv3-512, conv1-512 | conv3-512, conv3-512, conv3-512 | conv3-512, conv3-512, conv3-512, conv3-512 |
| **Pool** | maxpool | maxpool | maxpool | maxpool | maxpool | maxpool |
| **FC Layers** | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 | FC-4096, FC-4096, FC-1000 |
| **Softmax** | softmax | softmax | softmax | softmax | softmax | softmax |
| **Weight layers** | 11 | 11 | 13 | 16 | 16 | 19 |

All convolutional layers use 3x3 filters except for the 1x1 filters in Configuration C.[1] Configuration A-LRN is identical to Configuration A but includes a Local Response Normalization layer after the first convolutional layer.[1] Configurations D and E (commonly called VGG-16 and VGG-19) are the most widely used variants.

### Parameter Counts

Despite the large depth, the number of parameters in VGG networks is dominated by the fully connected layers rather than the convolutional layers.[1] The first fully connected layer alone, which takes the 7x7x512 output of the last convolutional block and maps it to 4,096 channels, contains 7 x 7 x 512 x 4,096 = 102,760,448 parameters. The three fully connected layers together account for roughly 124 million of the total parameters.

| Configuration | Common Name | Weight Layers | Parameters (Millions) |
|---|---|---|---|
| A | VGG-11 | 11 | ~133M |
| A-LRN | VGG-11 (LRN) | 11 | ~133M |
| B | VGG-13 | 13 | ~133M |
| C | VGG-16 (1x1 conv) | 16 | ~134M |
| D | VGG-16 | 16 | ~138M |
| E | VGG-19 | 19 | ~144M |

The relatively small difference in parameter count between configurations reflects the fact that most parameters reside in the fully connected layers, which are identical across all configurations.[1] The convolutional layers, despite being the defining feature of each variant, contribute only a modest number of additional parameters as depth increases.

### Computational Cost

VGG networks are computationally expensive. VGG-16 requires approximately 15.5 billion floating point operations (FLOPs) for a single forward pass on a 224x224 input image, and VGG-19 requires approximately 19.6 billion FLOPs. The model weights for VGG-16 occupy about 528 MB of storage, and VGG-19 occupies about 549 MB.

By comparison, [GoogLeNet](/wiki/googlenet) achieves comparable or better accuracy with only about 1.5 billion FLOPs and 6.8 million parameters.[3] This disparity in efficiency is one of the main limitations of the VGG architecture.

## Training

### Optimization

The VGG networks were trained using mini-batch [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) with the following hyperparameters:[1]

| Hyperparameter | Value |
|---|---|
| Batch size | 256 |
| Momentum | 0.9 |
| Weight decay | 5 x 10^-4 |
| Initial learning rate | 0.01 |
| Learning rate schedule | Decreased by factor of 10 when validation accuracy stopped improving |
| Number of LR decreases | 3 (during training) |
| [Dropout](/wiki/dropout) | 0.5 (applied to first two FC layers) |
| Epochs | ~74 (370K iterations) |

Training was conducted on a system with four NVIDIA Titan Black GPUs and took approximately two to three weeks per network, depending on the configuration.[1]

### Weight Initialization

Initializing very deep networks is challenging because poor initialization can stall learning due to instability of gradients.[1] To address this, the authors used a staged training approach. They first trained Configuration A (the shallowest network, with 11 layers), which was shallow enough to be initialized with random weights drawn from a normal distribution with zero mean and 0.01 variance. The biases were initialized to zero.[1]

Once Configuration A converged, its learned weights were used to initialize the first four convolutional blocks and three fully connected layers of the deeper configurations. The remaining layers that did not have a corresponding layer in Configuration A were initialized randomly. This approach allowed the deeper networks to begin training from a reasonable starting point rather than from scratch.[1]

The authors later noted that it was possible to initialize weights using the procedure of Glorot and Bengio (2010) without pre-training, though they used the staged approach in their experiments.[1]

### Data Augmentation

The training images were augmented using the following techniques:

- **Random cropping**: 224 x 224 patches were randomly cropped from the rescaled training images.[1]
- **Random horizontal flipping**: Each crop was randomly flipped horizontally with a 50% probability.[1]
- **Random RGB color shift**: The PCA-based color augmentation scheme from the AlexNet paper was applied, where principal components of the RGB pixel values across the training set were computed and random multiples of the principal components were added to each training image.[2]

### Multi-Scale Training

A key aspect of VGG training was the use of scale jittering.[1] Training images were rescaled so that the shorter side equaled a value S, and then 224 x 224 crops were extracted. Two approaches to setting S were evaluated:

1. **Fixed-scale training**: S was fixed to either 256 or 384 pixels. When training at S = 384, the weights were initialized from the model trained at S = 256 to speed up convergence.[1]
2. **Multi-scale training (scale jittering)**: S was randomly sampled from the range [256, 512] for each training image. This allowed the model to see objects at different scales during training, acting as a form of data augmentation by scale jittering. The multi-scale model was initialized with the S = 384 fixed-scale model.[1]

Multi-scale training consistently outperformed fixed-scale training across all configurations.[1]

## Evaluation

### Dense Evaluation

At test time, the fully connected layers were converted into convolutional layers (the first FC layer to a 7x7 convolutional layer, and the remaining two to 1x1 convolutional layers).[1] This converted the classification network into a fully convolutional network that could accept inputs of any size. The resulting "class score map" was spatially averaged (sum-pooled) to produce a fixed-size vector of class scores. The image was also horizontally flipped, and the scores from the original and flipped versions were averaged to produce the final prediction.[1]

### Multi-Crop Evaluation

As an alternative to dense evaluation, the authors also tested multi-crop evaluation following the approach from GoogLeNet.[3] Each test image was resized to multiple scales, and 150 crops were extracted per scale. The softmax class posteriors were averaged across all crops.[1]

### Combining Dense and Multi-Crop

The authors found that combining dense and multi-crop evaluation by averaging their softmax outputs produced the best results, as the two methods are complementary.[1] Dense evaluation captures fine-grained spatial information, while multi-crop evaluation provides better sampling of the input at the boundary regions due to the use of padding in convolutions.[1]

## Results

### Single-Scale Evaluation

The single-scale evaluation results on the ImageNet validation set demonstrated the consistent benefit of increasing depth.[1] The table below shows results for each configuration with Q (test scale) set equal to S (training scale) for fixed S, or Q = 384 for multi-scale trained models (S in [256, 512]).

| Configuration | Train Scale (S) | Test Scale (Q) | Top-1 Error (%) | Top-5 Error (%) |
|---|---|---|---|---|
| A (VGG-11) | 256 | 256 | 29.6 | 10.4 |
| A-LRN | 256 | 256 | 29.7 | 10.5 |
| B (VGG-13) | 256 | 256 | 28.7 | 9.9 |
| C | 256 | 256 | 28.1 | 9.4 |
| C | 384 | 384 | 28.1 | 9.3 |
| C | [256; 512] | 384 | 27.3 | 8.8 |
| D (VGG-16) | 256 | 256 | 27.0 | 8.8 |
| D (VGG-16) | 384 | 384 | 26.8 | 8.7 |
| D (VGG-16) | [256; 512] | 384 | 25.6 | 8.1 |
| E (VGG-19) | 256 | 256 | 27.3 | 9.0 |
| E (VGG-19) | 384 | 384 | 26.9 | 8.7 |
| E (VGG-19) | [256; 512] | 384 | 25.5 | 8.0 |

Several observations stand out from these results:

- **LRN does not help**: Configuration A-LRN performed slightly worse than A without LRN, confirming that Local Response Normalization is unnecessary.[1]
- **Deeper is better**: Error decreased consistently from A (11 layers) to D (16 layers). Configuration E (19 layers) achieved a marginal improvement over D.[1]
- **1x1 convolutions help, but 3x3 is better**: Configuration C (with 1x1 convolutions) outperformed B, but D (which replaces those 1x1 layers with 3x3 layers) performed even better, confirming the value of spatial context captured by 3x3 filters.[1]
- **Multi-scale training helps**: Scale jittering (S in [256; 512]) consistently improved performance over fixed-scale training.[1]

### Multi-Scale Evaluation

When testing at multiple scales and averaging the results, performance improved across all configurations.[1]

| Configuration | Train Scale (S) | Test Scales (Q) | Top-1 Error (%) | Top-5 Error (%) |
|---|---|---|---|---|
| B (VGG-13) | 256 | 224, 256, 288 | 28.2 | 9.6 |
| C | 256 | 224, 256, 288 | 27.7 | 9.2 |
| C | 384 | 352, 384, 416 | 27.8 | 9.2 |
| C | [256; 512] | 256, 384, 512 | 26.3 | 8.2 |
| D (VGG-16) | 256 | 224, 256, 288 | 26.6 | 8.6 |
| D (VGG-16) | 384 | 352, 384, 416 | 26.5 | 8.6 |
| D (VGG-16) | [256; 512] | 256, 384, 512 | 24.8 | 7.5 |
| E (VGG-19) | 256 | 224, 256, 288 | 26.9 | 8.7 |
| E (VGG-19) | 384 | 352, 384, 416 | 26.7 | 8.6 |
| E (VGG-19) | [256; 512] | 256, 384, 512 | 24.8 | 7.5 |

### Dense vs. Multi-Crop Evaluation

The combination of dense and multi-crop evaluation methods yielded the best single-model results.[1]

| Configuration | Evaluation Method | Top-1 Error (%) | Top-5 Error (%) |
|---|---|---|---|
| D (VGG-16) | Dense | 24.8 | 7.5 |
| D (VGG-16) | Multi-crop | 24.6 | 7.5 |
| D (VGG-16) | Multi-crop + Dense | 24.4 | 7.2 |
| E (VGG-19) | Dense | 24.8 | 7.5 |
| E (VGG-19) | Multi-crop | 24.6 | 7.4 |
| E (VGG-19) | Multi-crop + Dense | 24.4 | 7.1 |

### ILSVRC 2014 Competition Results

At the ILSVRC 2014 competition, the VGG team submitted an ensemble of models and achieved the following results:[1][8]

| Task | VGG Result | Placement |
|---|---|---|
| Classification (top-5 error) | 7.3% | 2nd place |
| Localization (error) | 25.3% | 1st place |

For classification, the winning entry was [GoogLeNet](/wiki/googlenet) with a top-5 error of 6.7%.[3] However, the VGG team noted that a single VGG-16 model achieved 7.0% top-5 test error, outperforming a single GoogLeNet model (7.9% top-5 error).[1] GoogLeNet's advantage came from its ensemble of seven networks and more sophisticated multi-crop evaluation.[3]

### Comparison with State of the Art

The table below compares VGG with contemporary and subsequent architectures.

| Architecture | Year | Top-5 Error (%) | Parameters | Depth (Layers) | FLOPs |
|---|---|---|---|---|---|
| [AlexNet](/wiki/alexnet) | 2012 | 16.4 | ~60M | 8 | ~720M |
| ZFNet | 2013 | 11.7 | ~60M | 8 | ~720M |
| VGG-16 (single model) | 2014 | 7.0 | ~138M | 16 | ~15.5B |
| VGG-19 (single model) | 2014 | 7.1 | ~144M | 19 | ~19.6B |
| [GoogLeNet](/wiki/googlenet) (single model) | 2014 | 7.9 | ~6.8M | 22 | ~1.5B |
| [GoogLeNet](/wiki/googlenet) (ensemble) | 2014 | 6.7 | ~6.8M | 22 | ~1.5B |
| [ResNet](/wiki/resnet)-152 | 2015 | 3.6 | ~60M | 152 | ~11.3B |

This comparison highlights VGG's position in the evolution of deep learning architectures. It significantly outperformed AlexNet and ZFNet in accuracy but at the cost of much higher parameter counts and computational requirements.[2] GoogLeNet achieved similar accuracy with far fewer parameters through its [Inception](/wiki/inception) modules.[3] [ResNet](/wiki/resnet), introduced the following year, surpassed all previous architectures by using skip connections to enable training of much deeper networks.[4]

## Detailed Layer-by-Layer Analysis of VGG-16

VGG-16 (Configuration D) is the most commonly used variant. The following table provides a layer-by-layer breakdown.

| Layer | Type | Filter Size | Stride | Output Size | Parameters |
|---|---|---|---|---|---|
| Input | - | - | - | 224 x 224 x 3 | 0 |
| conv1_1 | Convolution | 3 x 3 | 1 | 224 x 224 x 64 | 1,792 |
| conv1_2 | Convolution | 3 x 3 | 1 | 224 x 224 x 64 | 36,928 |
| pool1 | Max Pooling | 2 x 2 | 2 | 112 x 112 x 64 | 0 |
| conv2_1 | Convolution | 3 x 3 | 1 | 112 x 112 x 128 | 73,856 |
| conv2_2 | Convolution | 3 x 3 | 1 | 112 x 112 x 128 | 147,584 |
| pool2 | Max Pooling | 2 x 2 | 2 | 56 x 56 x 128 | 0 |
| conv3_1 | Convolution | 3 x 3 | 1 | 56 x 56 x 256 | 295,168 |
| conv3_2 | Convolution | 3 x 3 | 1 | 56 x 56 x 256 | 590,080 |
| conv3_3 | Convolution | 3 x 3 | 1 | 56 x 56 x 256 | 590,080 |
| pool3 | Max Pooling | 2 x 2 | 2 | 28 x 28 x 256 | 0 |
| conv4_1 | Convolution | 3 x 3 | 1 | 28 x 28 x 512 | 1,180,160 |
| conv4_2 | Convolution | 3 x 3 | 1 | 28 x 28 x 512 | 2,359,808 |
| conv4_3 | Convolution | 3 x 3 | 1 | 28 x 28 x 512 | 2,359,808 |
| pool4 | Max Pooling | 2 x 2 | 2 | 14 x 14 x 512 | 0 |
| conv5_1 | Convolution | 3 x 3 | 1 | 14 x 14 x 512 | 2,359,808 |
| conv5_2 | Convolution | 3 x 3 | 1 | 14 x 14 x 512 | 2,359,808 |
| conv5_3 | Convolution | 3 x 3 | 1 | 14 x 14 x 512 | 2,359,808 |
| pool5 | Max Pooling | 2 x 2 | 2 | 7 x 7 x 512 | 0 |
| fc6 | Fully Connected | - | - | 4096 | 102,764,544 |
| fc7 | Fully Connected | - | - | 4096 | 16,781,312 |
| fc8 | Fully Connected | - | - | 1000 | 4,097,000 |
| softmax | Softmax | - | - | 1000 | 0 |
| **Total** | | | | | **~138,357,544** |

The convolutional layers account for approximately 14.7 million parameters, while the fully connected layers account for approximately 123.6 million parameters. This means that roughly 89% of VGG-16's parameters are concentrated in the fully connected layers.

## What are the limitations of VGG?

While VGG was a significant advancement when it was introduced, the architecture has several well-known limitations.

### Large Memory Footprint

VGG-16's 138 million parameters require approximately 528 MB of storage. During training with a batch size of 128, the model can require upward of 14 GB of GPU memory. This made VGG difficult to train on the hardware available at the time and remains a concern even with modern GPUs when working with larger batch sizes or higher-resolution inputs.

### High Computational Cost

With approximately 15.5 billion FLOPs per forward pass, VGG-16 is significantly more expensive to run than architectures that achieve comparable or better accuracy. For example, GoogLeNet uses roughly 10x fewer FLOPs while matching VGG in accuracy.[3] This makes VGG impractical for many real-time and edge computing applications without model compression techniques.

### Parameter Inefficiency

The vast majority of VGG's parameters reside in the three fully connected layers, which contribute relatively little to the network's representational power compared to the convolutional layers. Later architectures like GoogLeNet addressed this by replacing fully connected layers with global average pooling, which dramatically reduced parameter counts.[3]

### Vanishing Gradient Problem

Although VGG successfully trained networks up to 19 layers deep, the authors found that going deeper did not yield substantial improvements.[1] Configuration E (19 layers) only marginally outperformed Configuration D (16 layers).[1] Deeper variants would have suffered from the [vanishing gradient problem](/wiki/vanishing_gradient_problem), where gradients diminish as they are backpropagated through many layers, making learning difficult. This limitation was later addressed by [ResNet](/wiki/resnet)'s skip connections.[4]

### Slow Training

Training a single VGG network required two to three weeks on four NVIDIA Titan Black GPUs.[1] The staged initialization approach (training shallow networks first to initialize deeper ones) added even more time to the overall process.

## Why was VGG so influential?

Despite its limitations, VGG has had a lasting influence on the field of deep learning and computer vision.

### Transfer Learning Backbone

VGG-16 and VGG-19 became standard feature extraction backbones in the years following their release. Pre-trained VGG models, available through frameworks like [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), and [Keras](/wiki/keras), were widely used for transfer learning on tasks with limited labeled data. The features learned by VGG's convolutional layers proved highly transferable to domains including medical image analysis, satellite imagery classification, and fine-grained visual recognition.

### Neural Style Transfer

VGG-19 became the standard network for [neural style transfer](/wiki/neural_style_transfer), as popularized by Gatys et al. in their 2015 paper "A Neural Algorithm of Artistic Style."[5] The hierarchical features captured by different layers of VGG, from low-level textures in early layers to high-level content in deeper layers, make it well-suited for separating and recombining content and style information.[5]

### Object Detection

VGG-16 served as the backbone for several prominent object detection frameworks, including Faster R-CNN (Ren et al., 2015) and the Single Shot MultiBox Detector (SSD, Liu et al., 2016).[6][7] Its rich feature representations provided a strong foundation for detecting and localizing objects in images.

### Perceptual Loss Functions

Pre-trained VGG networks are widely used to define perceptual loss functions for image generation tasks, including super-resolution, image inpainting, and generative adversarial network ([GAN](/wiki/generative_adversarial_network)) training. Instead of measuring pixel-level differences between images, perceptual loss computes the distance between feature representations extracted by VGG, producing results that are more perceptually similar to human vision.

### Influence on Later Architectures

VGG's demonstration that depth matters directly influenced the development of subsequent architectures.[1] [ResNet](/wiki/resnet) (2015) built on this insight by introducing skip connections that enabled training networks with over 100 layers.[4] The principle of using small 3x3 filters has been adopted by nearly all modern [CNN](/wiki/convolutional_neural_network) architectures.

The RepVGG architecture (Ding et al., 2021) revisited the VGG-style plain architecture, using structural reparameterization to achieve competitive performance with modern architectures while maintaining VGG's simple, inference-efficient design.[9] RepVGG reported reaching "over 80% top-1 accuracy" on ImageNet, which its authors described as "the first time for a plain model," and ran 83% faster than ResNet-50 on an NVIDIA 1080Ti GPU.[9]

### Educational Value

VGG's uniform, straightforward design makes it one of the most commonly used architectures for teaching deep learning and convolutional neural networks. Its simplicity allows students and practitioners to understand the fundamental building blocks of CNNs without being overwhelmed by the complexity of later architectures like Inception or [Transformer](/wiki/transformer)-based models.

## Is VGG available pre-trained?

Pre-trained VGG models are widely available in major deep learning frameworks:

| Framework | Models Available | Pre-trained Weights |
|---|---|---|
| [PyTorch](/wiki/pytorch) (torchvision) | VGG-11, VGG-13, VGG-16, VGG-19 (with and without batch normalization) | ImageNet-1K |
| [TensorFlow](/wiki/tensorflow) / [Keras](/wiki/keras) | VGG-16, VGG-19 | ImageNet-1K |
| ONNX Model Zoo | VGG-16, VGG-19 | ImageNet-1K |

These pre-trained models enable researchers and practitioners to use VGG as a starting point for new tasks without training from scratch.

## See Also

- [Convolutional Neural Network](/wiki/convolutional_neural_network)
- [AlexNet](/wiki/alexnet)
- [GoogLeNet](/wiki/googlenet)
- [ResNet](/wiki/resnet)
- [ImageNet](/wiki/imagenet)
- [Transfer Learning](/wiki/transfer_learning)
- [Neural Style Transfer](/wiki/neural_style_transfer)

## References

1. Simonyan, K. and Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition." Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, May 7-9, 2015. arXiv:1409.1556.
2. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." Advances in Neural Information Processing Systems 25 (NIPS 2012).
3. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). "Going Deeper with Convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015).
4. He, K., Zhang, X., Ren, S., and Sun, J. (2016). "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016).
5. Gatys, L. A., Ecker, A. S., and Bethge, M. (2016). "Image Style Transfer Using Convolutional Neural Networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). Originally released as "A Neural Algorithm of Artistic Style," arXiv:1508.06576 (2015).
6. Ren, S., He, K., Girshick, R., and Sun, J. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." Advances in Neural Information Processing Systems 28 (NIPS 2015).
7. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C. (2016). "SSD: Single Shot MultiBox Detector." European Conference on Computer Vision (ECCV 2016).
8. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). "ImageNet Large Scale Visual Recognition Challenge." International Journal of Computer Vision, 115(3), 211-252.
9. Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., and Sun, J. (2021). "RepVGG: Making VGG-style ConvNets Great Again." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021). arXiv:2101.03697.
10. Visual Geometry Group, University of Oxford. https://www.robots.ox.ac.uk/~vgg/
11. Lardinois, F. (2014). "Google's DeepMind Acqui-Hires Two AI Teams In The UK, Partners With Oxford." TechCrunch, October 23, 2014.

