BigGAN
Last reviewed
May 1, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 3,702 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 3,702 words
Add missing citations, update stale details, or suggest a clearer explanation.
BigGAN is a class-conditional generative adversarial network introduced by researchers at DeepMind in September 2018. It was the first GAN to generate convincingly photorealistic, class-labeled images from ImageNet at resolutions up to 512x512 pixels, and it improved on the previous best Inception Score by roughly a factor of three. The work demonstrated that generative adversarial networks scale well with model size, batch size, and compute, anticipating the same lesson that large language models would teach a few years later.
The paper, "Large Scale GAN Training for High Fidelity Natural Image Synthesis," was written by Andrew Brock, Jeff Donahue, and Karen Simonyan. It appeared on arXiv as 1809.11096 in September 2018 and was accepted as an oral presentation at ICLR 2019. The released models, hosted on TensorFlow Hub, also produced one of the most viral Colab notebook demos of the era and pushed conditional image synthesis from a research curiosity to a tool that hobbyists could play with in a browser.
Andrew Brock was the lead author. At the time he was a PhD student at Heriot-Watt University in Edinburgh and had carried out the work as an intern at DeepMind. Jeff Donahue and Karen Simonyan, both then at DeepMind in London, were the co-authors. Simonyan is also known for the VGG family of convolutional networks; Donahue had previously worked on Caffe and on the original BiGAN (Bidirectional GAN) framework, which would later resurface as the encoder-equipped BigBiGAN follow-up.
The paper was framed not as a single new architecture but as an empirical study of what happens when you take an already-strong GAN, the self-attention GAN (SAGAN) of Zhang et al., and push it as hard as available compute would allow. The team trained on Google TPU v3 pods, used batches eight times larger than SAGAN had used, widened the channels by 50 percent, and added a small set of stabilization tricks. The combination produced a leap in image quality that took the field by surprise.
Before BigGAN, class-conditional generation on the full 1,000-class ImageNet was widely considered an open problem. The state of the art at 128x128 resolution was SAGAN, with an Inception Score (IS) of about 52.5 and a Fréchet Inception Distance (FID) of about 18.7. Earlier conditional GANs produced recognizable but blurry samples whose FIDs sat above 25, and many failure modes (mode collapse, blob-like textures, semantic confusion) were treated as fundamental limits of adversarial training.
BigGAN reported an Inception Score of 166.5 and an FID of 7.4 on ImageNet 128x128. The IS jump from roughly 52 to 166 is a 3x improvement on a metric that everyone had assumed was approaching saturation. Sample fidelity at 512x512 was also unprecedented for any GAN and arguably for any generative model at the time. The takeaway, summarized informally afterward as "GANs scale," reframed the research agenda. It was no longer obvious that better losses or more clever architectures were needed; many gains seemed to come from simply using more data, larger batches, and more parameters with the right stabilization. In retrospect, the same scaling logic would dominate diffusion and language modelling work over the next several years.
The release also normalized publishing pretrained image generators that could be queried interactively. The official Colab notebook, backed by a free GPU, let anyone pick an ImageNet class, drag a slider for a truncation parameter, and watch goldfish, cheeseburgers, and church towers materialize.
BigGAN's generator and discriminator are residual networks built on the SAGAN backbone. The generator takes a 128-dimensional latent noise vector z and a one-hot class label y as input, and it produces an RGB image at the target resolution. The discriminator is a class-conditional projection discriminator: it takes an image plus a class label and outputs a single scalar.
| Component | Details |
|---|---|
| Backbone | Residual generator and discriminator built from the SAGAN architecture, with spectral normalization on every weight matrix in both networks |
| Adversarial loss | Hinge loss for both G and D, the same loss used by SAGAN |
| Conditioning in G | Class-conditional batch normalization where the gain (gamma) and bias (beta) are linear projections of a shared class embedding combined with a slice of the latent vector |
| Conditioning in D | Projection discriminator (Miyato and Koyama 2018), which dot-products an embedding of y with the discriminator's image features |
| Self-attention | A non-local self-attention block at the 64x64 feature-map resolution in both G and D, inherited from SAGAN |
| Latent code | 128-dimensional z drawn from a standard normal at training time |
| Hierarchical latents ("skip-z") | The latent vector is split into chunks; each chunk is concatenated with the shared class embedding and fed to a different residual block, so noise is injected at every resolution rather than only at the bottom |
| Shared class embedding | A single learned embedding per class is reused across all conditional batch-norm layers, which the authors report cuts compute and memory by roughly 37 percent compared to per-layer embeddings |
| Orthogonal regularization | A modified orthogonality penalty applied to generator weights, which the paper shows is what makes the truncation trick work cleanly |
| Orthogonal initialization | Weight matrices are initialized with orthogonal matrices, which helps stabilize early training at the chosen scale |
| Channel multiplier | 96 in the standard BigGAN configuration; BigGAN-deep uses 128 with a deeper residual stack |
The choice to keep the SAGAN backbone instead of inventing a new one is part of the paper's argument: the authors wanted to isolate the effect of scale and stabilization rather than confound it with a new architecture.
The paper's most quoted contribution is the truncation trick. At training time z is sampled from a standard normal. At sampling time, BigGAN samples z from a truncated normal: any value with absolute magnitude greater than a threshold psi is resampled until it falls inside the truncation range, and the result is then scaled by psi.
Lowering psi (more aggressive truncation) collapses the latent distribution toward zero and produces sharper, more class-typical samples at the cost of diversity. Raising psi toward 1.0 restores diversity but also reintroduces noisier and weirder outputs. Plotting IS against FID across a sweep of truncation values traces out a clean precision-recall curve, and that curve allowed the BigGAN team to report IS or FID at whatever operating point they preferred. The headline IS of 166.5 at 128x128 and IS of 232.5 at 256x256 sit at low truncation values, while the unbiased (psi = 1, no truncation) numbers are noticeably worse.
The authors observed that truncation is much more effective for models trained with their orthogonal regularization. Without it, many models simply produce saturated or distorted outputs at low psi values rather than higher-quality ones, because the generator's output manifold is not smooth enough.
The "big" in BigGAN refers above all to scale. The paper studied what happens as batch size, channel width, and depth grew. The headline numbers:
| Setting | Standard BigGAN | BigGAN-deep |
|---|---|---|
| Channel multiplier | 96 | 128 |
| Latent dimension | 128 | 128 |
| Generator parameters (128x128) | ~158 million | ~50 million (deeper but narrower per-block) |
| Batch size | 2,048 | 2,048 |
| Hardware | 128 to 512 cores of Google TPU v3 | 128 to 512 TPU v3 cores |
| Training time | Roughly 24 to 48 hours per model | Roughly 24 to 48 hours per model |
| Resolutions trained | 128x128, 256x256, 512x512 | 128x128, 256x256, 512x512 |
The batch size of 2,048 is eight times what SAGAN used. The paper reports that this single change improved IS by roughly 46 percent over the SAGAN baseline before any other modifications were applied. Increasing the channel width by another 50 percent produced a further ~21 percent IS bump. The combination of bigger batches, wider networks, and the new stabilization tricks compounded into the final result.
BigGAN-deep is the variant that uses a deeper residual stack with bottleneck blocks. Its channel multiplier is 128, but each residual block carries fewer features inside, so the parameter counts of the released BigGAN-deep checkpoints (50.4M, 55.9M, and 56.2M for 128, 256, and 512) are smaller than the standard BigGAN configuration. Despite the smaller parameter count, BigGAN-deep typically reaches better FID, suggesting that depth and bottleneck structure matter more than raw width past a certain point.
All numbers below are class-conditional ImageNet results, reported by the original BigGAN paper or by widely used reimplementations (the OpenMMLab MMagic conversions of the official DeepMind weights).
| Model | Resolution | Inception Score | FID | Source |
|---|---|---|---|---|
| Best prior to BigGAN (SAGAN) | 128x128 | 52.5 | 18.65 | Zhang et al. 2018 (SAGAN) |
| BigGAN | 128x128 | 166.5 | 7.4 | Brock et al. 2018, Table 1 |
| BigGAN | 256x256 | 232.5 | 8.1 | Brock et al. 2018 (with truncation) |
| BigGAN | 512x512 | 241.5 | 11.5 | Brock et al. 2018 (with truncation) |
| BigGAN-deep (converted weights) | 128x128 | ~107 | ~5.95 | OpenMMLab MMagic |
| BigGAN-deep (converted weights) | 256x256 | ~135 | ~11.3 | OpenMMLab MMagic |
| BigGAN-deep (converted weights) | 512x512 | ~124 | ~16.9 | OpenMMLab MMagic |
A few caveats are worth flagging. Inception Score and FID depend heavily on the exact evaluation pipeline (which Inception network, what preprocessing, how many samples), and reported values vary noticeably between sources. The 166.5 IS at 128x128 is the canonical headline number from the paper's main table; the OpenReview abstract version of the paper at one point listed an IS of 166.3 and FID of 9.6 for the same resolution, computed with a different protocol. Differences of one or two IS points or one FID point should be read as evaluation noise, not as model changes.
The BigGAN-deep numbers above come from the publicly converted weights as evaluated by MMagic; the paper itself reports somewhat better numbers under its own evaluation protocol.
Because truncation gives a knob that trades quality for diversity, the paper devotes substantial space to analyzing it.
| Truncation psi | Behavior |
|---|---|
| 1.0 (no truncation) | Maximum diversity; FID is best (lowest); samples include odd colors, distorted textures, and rare class members |
| ~0.5 | Sweet spot for many classes; sharp images with reasonable variety |
| ~0.2 to 0.4 | Headline IS numbers in the paper; very sharp, very class-typical samples; many classes lose diversity |
| Very low (~0.04) | Samples collapse toward a single prototype per class, often saturated and unnatural |
The authors also show that some classes benefit from truncation more than others. Classes with simple, dominant visual structure (single-object centered photographs, like a goldfish on a black background) tolerate aggressive truncation well. Classes with cluttered scenes or multiple objects (groceries, indoor scenes with many participants) degrade more quickly: the generator produces saturated single-color blobs rather than a recognizable scene at low psi.
BigGAN was good enough that its specific failure modes became a topic of study in their own right.
| Failure mode | Description |
|---|---|
| Class leakage | Some samples mix concepts from related classes. A "monarch butterfly" sample might pick up flower textures from co-occurring training images, or a dog breed might bleed in features of a visually similar breed. |
| Training collapse | At sufficiently long training, BigGAN runs spontaneously go bad: a single discriminator update produces a huge gradient, the generator's weights spike, and sample quality drops sharply within a few thousand iterations. The paper documents that singular values of weight matrices grow unboundedly through training and recommends checkpoint reloading and early stopping. |
| Multi-object scenes | Dense scenes with many small objects (grocery stores, computer keyboards, crowds) remain hard, even at 512x512. |
| Faces and humans | Faces in unusual poses, hands, and full human bodies are noticeably worse than animals, foods, and landscapes. |
| Unbalanced fidelity-diversity | At low truncation, individual classes can collapse to one or two prototypes; at high truncation, FID looks fine but individual samples are noisier and less appealing. |
The "training collapse" finding was particularly important for the field. The paper showed empirically that scaling up does not magically stabilize training, and that the standard regularizers known at the time (gradient penalty, R1) only delayed rather than prevented collapse at this scale. The pragmatic recommendation was to monitor weight singular values and stop or roll back before collapse hits.
BigGAN spawned a small ecosystem of derivatives and motivated several major follow-ups.
| Name | Year | Authors | Relationship to BigGAN |
|---|---|---|---|
| BigGAN-deep | 2018 | Brock, Donahue, Simonyan | Deeper bottlenecked variant introduced in the same paper |
| BigBiGAN | 2019 | Donahue, Simonyan (DeepMind) | Adds an encoder to BigGAN to do bidirectional adversarial learning, used for unsupervised representation learning on ImageNet |
| LOGAN | 2019 | Wu et al. (DeepMind) | Latent optimization on top of BigGAN, improves IS / FID further |
| StyleGAN-XL | 2022 | Sauer, Schwarz, Geiger | Scales StyleGAN3 to ImageNet using projected GAN tricks; first model to push past BigGAN at higher resolutions on the same dataset |
| GigaGAN | 2023 | Kang et al. (Adobe, CMU, POSTECH) | A 1B-parameter scaled GAN extended to text-to-image generation, the spiritual successor to BigGAN's "GANs scale" thesis |
BigBiGAN deserves a special mention. By bolting an encoder onto a BigGAN-style discriminator, Donahue and Simonyan showed that the same scale that drove sample quality also produced strong unsupervised image features: BigBiGAN representations from a pretrained BigGAN-based encoder hit competitive ImageNet linear-probe accuracy, briefly putting GANs back in the conversation for self-supervised representation learning before contrastive methods (SimCLR, MoCo) and later masked autoencoders pulled ahead.
The table below compares BigGAN to other notable generative models evaluated on class-conditional ImageNet. Numbers are taken from each paper or from widely cited reimplementations; FID is the standard 50K-sample FID. Class-conditional ImageNet at 256x256 is the most directly comparable setting across these models.
| Model | Type | Year | Parameters | Resolution | FID |
|---|---|---|---|---|---|
| SAGAN | GAN | 2018 | ~80M | 128x128 | 18.7 |
| BigGAN | GAN | 2018 | ~158M (G, 128) | 128x128 | 7.4 |
| BigGAN | GAN | 2018 | (256 config) | 256x256 | 8.1 |
| BigGAN-deep | GAN | 2018 | ~56M (G, 256) | 256x256 | ~6.95 (paper) |
| ADM (Dhariwal and Nichol) | Diffusion | 2021 | ~552M | 256x256 | 4.59 |
| ADM-G (with classifier guidance) | Diffusion | 2021 | ~552M | 256x256 | 3.94 |
| StyleGAN-XL | GAN | 2022 | ~166M | 256x256 | 2.30 |
| MaskGIT | Masked AR | 2022 | ~227M | 256x256 | 6.18 |
| DiT-XL/2 | Diffusion Transformer | 2022 | ~675M | 256x256 | 2.27 |
| GigaGAN | GAN | 2023 | ~1B | 512x512 | 3.45 |
| VAR | Visual autoregressive | 2024 | ~2B | 256x256 | 1.80 |
A few observations stand out. BigGAN held the class-conditional ImageNet FID record for almost three years, until ADM and classifier guidance dethroned it in 2021. StyleGAN-XL was the first GAN to surpass BigGAN convincingly on ImageNet at high resolution. GigaGAN is the largest scaled GAN to date and the closest direct continuation of BigGAN's scaling lineage. Modern diffusion transformers and visual autoregressive models now hold the FID frontier, but they spend much more compute per generated image than BigGAN does.
| Resource | Where | Notes |
|---|---|---|
| Original TF Hub models | tfhub.dev/deepmind/biggan-128, biggan-256, biggan-512 | Released by DeepMind in 2018; standard BigGAN generators |
| BigGAN-deep TF Hub models | tfhub.dev/deepmind/biggan-deep-128, biggan-deep-256, biggan-deep-512 | Higher quality variants released alongside |
| Colab demo | The TF Hub BigGAN generation notebook | The viral 2018 demo that introduced many people to GAN-generated images |
| Author's PyTorch port | github.com/ajbrock/BigGAN-PyTorch | "Officially unofficial" port by the lead author; supports up to 8 GPUs and gradient accumulation to reach a 2,048 batch size |
| Hugging Face PyTorch port | github.com/huggingface/pytorch-pretrained-BigGAN | Inference-only; converts the DeepMind TF weights and ships pre-computed batch-norm statistics for 51 truncation values |
| MMagic / OpenMMLab | github.com/open-mmlab/mmagic | Reference implementations and converted weights for both BigGAN and BigGAN-deep |
| BigBiGAN models | tfhub.dev/deepmind/bigbigan-resnet50 etc. | The encoder-equipped follow-up; useful for unsupervised representation learning |
The combination of free Colab access and the truncation slider made BigGAN one of the first widely shared neural-image-generation experiences. Artists like Memo Akten and Mario Klingemann built early generative-art series on top of BigGAN, and many of the techniques that later spread through diffusion-based art (latent walks, class interpolation, controlled sampling) were first popularized in this period.
BigGAN sits at an inflection point. Before it, GAN research was dominated by loss-function debates (Wasserstein GAN, R1, hinge), tricks for mode collapse, and architecture searches that often shaved a couple of FID points off the previous best. After BigGAN, the dominant story was scale: bigger batches, more parameters, more compute, with a small stabilization toolkit. That story carried through to GPT-3 in language modelling and to Stable Diffusion in image generation.
The specific architectural choices BigGAN inherited or popularized (spectral normalization in both networks, class-conditional batch normalization with shared embeddings, hinge loss, self-attention at intermediate resolutions, the truncation trick for sampling) became default ingredients in subsequent class-conditional GAN work. Even projects that ultimately moved past GANs, such as DALL-E and Stable Diffusion, often cite BigGAN as the proof point that conditional generation at ImageNet scale was tractable in the first place.
The paper has been cited tens of thousands of times since publication. It is regularly used as a reference baseline for new conditional image-generation work, even though direct comparisons are now made primarily against diffusion and visual autoregressive models. For practitioners building production image-generation systems, BigGAN itself is rarely deployed today (its complexity and training cost made it impractical, and StyleGAN-family models proved easier to deploy for unconditional or text-driven use). Its conceptual influence remains everywhere.