BigGAN

Computer Vision Generative AI Google DeepMind

19 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v2 · 3,868 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BigGAN is a class-conditional generative adversarial network that, when introduced by DeepMind researchers Andrew Brock, Jeff Donahue, and Karen Simonyan in 2018, set a new state of the art for AI image generation by scaling GANs to large batch sizes and parameter counts. Trained on ImageNet at 128x128 resolution, BigGAN reached an Inception Score (IS) of 166.5 and a Frechet Inception Distance (FID) of 7.4, improving over the previous best IS of 52.52 and FID of 18.6, roughly a threefold jump in Inception Score ^[1]. Its core lesson, that adversarial generation scales with model size, batch size, and compute, anticipated the scaling story that large language models would later make famous.

The paper, "Large Scale GAN Training for High Fidelity Natural Image Synthesis," was written by Andrew Brock, Jeff Donahue, and Karen Simonyan at DeepMind ^[1]. It appeared on arXiv as 1809.11096 on September 28, 2018, and was accepted as an oral presentation at ICLR 2019 ^[1]. The released models, hosted on TensorFlow Hub, also produced one of the most viral Colab notebook demos of the era and pushed conditional image synthesis from a research curiosity to a tool that hobbyists could play with in a browser ^[11].

Who created BigGAN?

Andrew Brock was the lead author. At the time he was a PhD student at Heriot-Watt University in Edinburgh and had carried out the work as an intern at DeepMind. Jeff Donahue and Karen Simonyan, both then at DeepMind in London, were the co-authors ^[1]. Simonyan is also known for the VGG family of convolutional networks; Donahue had previously worked on Caffe and on the original BiGAN (Bidirectional GAN) framework, which would later resurface as the encoder-equipped BigBiGAN follow-up ^[5].

The paper was framed not as a single new architecture but as an empirical study of what happens when you take an already-strong GAN, the self-attention GAN (SAGAN) of Zhang et al., and push it as hard as available compute would allow ^[3]. The team trained on Google TPU v3 pods, used batches eight times larger than SAGAN had used, widened the channels by 50 percent, and added a small set of stabilization tricks ^[1]. The combination produced a leap in image quality that took the field by surprise.

Why does BigGAN matter?

Before BigGAN, class-conditional generation on the full 1,000-class ImageNet was widely considered an open problem. The state of the art at 128x128 resolution was SAGAN, with an Inception Score (IS) of about 52.5 and a Fréchet Inception Distance (FID) of about 18.7 ^[3]. Earlier conditional GANs produced recognizable but blurry samples whose FIDs sat above 25, and many failure modes (mode collapse, blob-like textures, semantic confusion) were treated as fundamental limits of adversarial training.

BigGAN reported an Inception Score of 166.5 and an FID of 7.4 on ImageNet 128x128 ^[1]. The IS jump from roughly 52 to 166 is a 3x improvement on a metric that everyone had assumed was approaching saturation. Sample fidelity at 512x512 was also unprecedented for any GAN and arguably for any generative model at the time. The takeaway, summarized informally afterward as "GANs scale," reframed the research agenda. It was no longer obvious that better losses or more clever architectures were needed; many gains seemed to come from simply using more data, larger batches, and more parameters with the right stabilization. In retrospect, the same scaling logic would dominate diffusion and language modelling work over the next several years.

The release also normalized publishing pretrained image generators that could be queried interactively. The official Colab notebook, backed by a free GPU, let anyone pick an ImageNet class, drag a slider for a truncation parameter, and watch goldfish, cheeseburgers, and church towers materialize ^[11].

Architecture

BigGAN's generator and discriminator are residual networks built on the SAGAN backbone. The generator takes a 128-dimensional latent noise vector z and a one-hot class label y as input, and it produces an RGB image at the target resolution. The discriminator is a class-conditional projection discriminator: it takes an image plus a class label and outputs a single scalar ^[1]^[4].

Key design choices

Component	Details
Backbone	Residual generator and discriminator built from the SAGAN architecture, with spectral normalization on every weight matrix in both networks
Adversarial loss	Hinge loss for both G and D, the same loss used by SAGAN
Conditioning in G	Class-conditional batch normalization where the gain (gamma) and bias (beta) are linear projections of a shared class embedding combined with a slice of the latent vector
Conditioning in D	Projection discriminator (Miyato and Koyama 2018), which dot-products an embedding of y with the discriminator's image features
Self-attention	A non-local self-attention block at the 64x64 feature-map resolution in both G and D, inherited from SAGAN
Latent code	128-dimensional z drawn from a standard normal at training time
Hierarchical latents ("skip-z")	The latent vector is split into chunks; each chunk is concatenated with the shared class embedding and fed to a different residual block, so noise is injected at every resolution rather than only at the bottom
Shared class embedding	A single learned embedding per class is reused across all conditional batch-norm layers, which the authors report cuts compute and memory by roughly 37 percent compared to per-layer embeddings
Orthogonal regularization	A modified orthogonality penalty applied to generator weights, which the paper shows is what makes the truncation trick work cleanly
Orthogonal initialization	Weight matrices are initialized with orthogonal matrices, which helps stabilize early training at the chosen scale
Channel multiplier	96 in the standard BigGAN configuration; BigGAN-deep uses 128 with a deeper residual stack

The choice to keep the SAGAN backbone instead of inventing a new one is part of the paper's argument: the authors wanted to isolate the effect of scale and stabilization rather than confound it with a new architecture ^[1].

What is the truncation trick?

The paper's most quoted contribution is the truncation trick. The authors state that "applying orthogonal regularization to the generator renders it amenable to a simple 'truncation trick,' allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator's input" ^[1]. At training time z is sampled from a standard normal. At sampling time, BigGAN samples z from a truncated normal: any value with absolute magnitude greater than a threshold psi is resampled until it falls inside the truncation range, and the result is then scaled by psi.

Lowering psi (more aggressive truncation) collapses the latent distribution toward zero and produces sharper, more class-typical samples at the cost of diversity. Raising psi toward 1.0 restores diversity but also reintroduces noisier and weirder outputs. Plotting IS against FID across a sweep of truncation values traces out a clean precision-recall curve, and that curve allowed the BigGAN team to report IS or FID at whatever operating point they preferred. The headline IS of 166.5 at 128x128 and IS of 232.5 at 256x256 sit at low truncation values, while the unbiased (psi = 1, no truncation) numbers are noticeably worse ^[1].

The authors observed that truncation is much more effective for models trained with their orthogonal regularization. Without it, many models simply produce saturated or distorted outputs at low psi values rather than higher-quality ones, because the generator's output manifold is not smooth enough ^[1].

How big is BigGAN?

The "big" in BigGAN refers above all to scale. The paper studied what happens as batch size, channel width, and depth grew. The headline numbers:

Setting	Standard BigGAN	BigGAN-deep
Channel multiplier	96	128
Latent dimension	128	128
Generator parameters (128x128)	~158 million	~50 million (deeper but narrower per-block)
Batch size	2,048	2,048
Hardware	128 to 512 cores of Google TPU v3	128 to 512 TPU v3 cores
Training time	Roughly 24 to 48 hours per model	Roughly 24 to 48 hours per model
Resolutions trained	128x128, 256x256, 512x512	128x128, 256x256, 512x512

The batch size of 2,048 is eight times what SAGAN used. The paper reports that this single change improved IS by roughly 46 percent over the SAGAN baseline before any other modifications were applied ^[1]. Increasing the channel width by another 50 percent produced a further ~21 percent IS bump ^[1]. The combination of bigger batches, wider networks, and the new stabilization tricks compounded into the final result.

BigGAN-deep is the variant that uses a deeper residual stack with bottleneck blocks. Its channel multiplier is 128, but each residual block carries fewer features inside, so the parameter counts of the released BigGAN-deep checkpoints (50.4M, 55.9M, and 56.2M for 128, 256, and 512) are smaller than the standard BigGAN configuration ^[1]^[14]. Despite the smaller parameter count, BigGAN-deep typically reaches better FID, suggesting that depth and bottleneck structure matter more than raw width past a certain point.

How good are BigGAN's results?

All numbers below are class-conditional ImageNet results, reported by the original BigGAN paper or by widely used reimplementations (the OpenMMLab MMagic conversions of the official DeepMind weights).

Model	Resolution	Inception Score	FID	Source
Best prior to BigGAN (SAGAN)	128x128	52.5	18.65	Zhang et al. 2018 (SAGAN) ^[3]
BigGAN	128x128	166.5	7.4	Brock et al. 2018, Table 1 ^[1]
BigGAN	256x256	232.5	8.1	Brock et al. 2018 (with truncation) ^[1]
BigGAN	512x512	241.5	11.5	Brock et al. 2018 (with truncation) ^[1]
BigGAN-deep (converted weights)	128x128	~107	~5.95	OpenMMLab MMagic ^[14]
BigGAN-deep (converted weights)	256x256	~135	~11.3	OpenMMLab MMagic ^[14]
BigGAN-deep (converted weights)	512x512	~124	~16.9	OpenMMLab MMagic ^[14]

A few caveats are worth flagging. Inception Score and FID depend heavily on the exact evaluation pipeline (which Inception network, what preprocessing, how many samples), and reported values vary noticeably between sources ^[9]^[10]. The 166.5 IS at 128x128 is the canonical headline number from the arXiv version of the paper's main table; the OpenReview camera-ready abstract instead lists an IS of 166.3 and FID of 9.6 for the same resolution, computed with a different protocol ^[1]. Differences of one or two IS points or one FID point should be read as evaluation noise, not as model changes.

The BigGAN-deep numbers above come from the publicly converted weights as evaluated by MMagic; the paper itself reports somewhat better numbers under its own evaluation protocol ^[14].

How does the truncation trick trade quality for diversity?

Because truncation gives a knob that trades quality for diversity, the paper devotes substantial space to analyzing it ^[1].

Truncation psi	Behavior
1.0 (no truncation)	Maximum diversity; FID is best (lowest); samples include odd colors, distorted textures, and rare class members
~0.5	Sweet spot for many classes; sharp images with reasonable variety
~0.2 to 0.4	Headline IS numbers in the paper; very sharp, very class-typical samples; many classes lose diversity
Very low (~0.04)	Samples collapse toward a single prototype per class, often saturated and unnatural

The authors also show that some classes benefit from truncation more than others. Classes with simple, dominant visual structure (single-object centered photographs, like a goldfish on a black background) tolerate aggressive truncation well. Classes with cluttered scenes or multiple objects (groceries, indoor scenes with many participants) degrade more quickly: the generator produces saturated single-color blobs rather than a recognizable scene at low psi ^[1].

What are BigGAN's failure modes?

BigGAN was good enough that its specific failure modes became a topic of study in their own right.

Failure mode	Description
Class leakage	Some samples mix concepts from related classes. A "monarch butterfly" sample might pick up flower textures from co-occurring training images, or a dog breed might bleed in features of a visually similar breed.
Training collapse	At sufficiently long training, BigGAN runs spontaneously go bad: a single discriminator update produces a huge gradient, the generator's weights spike, and sample quality drops sharply within a few thousand iterations. The paper documents that singular values of weight matrices grow unboundedly through training and recommends checkpoint reloading and early stopping.
Multi-object scenes	Dense scenes with many small objects (grocery stores, computer keyboards, crowds) remain hard, even at 512x512.
Faces and humans	Faces in unusual poses, hands, and full human bodies are noticeably worse than animals, foods, and landscapes.
Unbalanced fidelity-diversity	At low truncation, individual classes can collapse to one or two prototypes; at high truncation, FID looks fine but individual samples are noisier and less appealing.

The "training collapse" finding was particularly important for the field. The paper showed empirically that scaling up does not magically stabilize training, and that the standard regularizers known at the time (gradient penalty, R1) only delayed rather than prevented collapse at this scale. The pragmatic recommendation was to monitor weight singular values and stop or roll back before collapse hits ^[1].

Variants and follow-up work

BigGAN spawned a small ecosystem of derivatives and motivated several major follow-ups.

Name	Year	Authors	Relationship to BigGAN
BigGAN-deep	2018	Brock, Donahue, Simonyan	Deeper bottlenecked variant introduced in the same paper ^[1]
BigBiGAN	2019	Donahue, Simonyan (DeepMind)	Adds an encoder to BigGAN to do bidirectional adversarial learning, used for unsupervised representation learning on ImageNet ^[5]
LOGAN	2019	Wu et al. (DeepMind)	Latent optimization on top of BigGAN, improves IS / FID further
StyleGAN-XL	2022	Sauer, Schwarz, Geiger	Scales StyleGAN3 to ImageNet using projected GAN tricks; first model to push past BigGAN at higher resolutions on the same dataset ^[6]
GigaGAN	2023	Kang et al. (Adobe, CMU, POSTECH)	A 1B-parameter scaled GAN extended to text-to-image generation, the spiritual successor to BigGAN's "GANs scale" thesis ^[7]

BigBiGAN deserves a special mention. By bolting an encoder onto a BigGAN-style discriminator, Donahue and Simonyan showed that the same scale that drove sample quality also produced strong unsupervised image features: BigBiGAN representations from a pretrained BigGAN-based encoder hit competitive ImageNet linear-probe accuracy, briefly putting GANs back in the conversation for self-supervised representation learning before contrastive methods (SimCLR, MoCo) and later masked autoencoders pulled ahead ^[5].

How does BigGAN compare with other generative models on ImageNet?

The table below compares BigGAN to other notable generative models evaluated on class-conditional ImageNet. Numbers are taken from each paper or from widely cited reimplementations; FID is the standard 50K-sample FID. Class-conditional ImageNet at 256x256 is the most directly comparable setting across these models.

Model	Type	Year	Parameters	Resolution	FID
SAGAN	GAN	2018	~80M	128x128	18.7
BigGAN	GAN	2018	~158M (G, 128)	128x128	7.4
BigGAN	GAN	2018	(256 config)	256x256	8.1
BigGAN-deep	GAN	2018	~56M (G, 256)	256x256	~6.95 (paper)
ADM (Dhariwal and Nichol)	Diffusion	2021	~552M	256x256	4.59
ADM-G (with classifier guidance)	Diffusion	2021	~552M	256x256	3.94
StyleGAN-XL	GAN	2022	~166M	256x256	2.30
MaskGIT	Masked AR	2022	~227M	256x256	6.18
DiT-XL/2	Diffusion Transformer	2022	~675M	256x256	2.27
GigaGAN	GAN	2023	~1B	512x512	3.45
VAR	Visual autoregressive	2024	~2B	256x256	1.80

A few observations stand out. BigGAN held the class-conditional ImageNet FID record for almost three years, until ADM and classifier guidance dethroned it in 2021 ^[8]. StyleGAN-XL was the first GAN to surpass BigGAN convincingly on ImageNet at high resolution ^[6]. GigaGAN is the largest scaled GAN to date and the closest direct continuation of BigGAN's scaling lineage; its 1B-parameter model reported lower FID than Stable Diffusion v1.5 and DALL-E 2 while generating a 512px image in about 0.13 seconds ^[7]. Modern diffusion transformers and visual autoregressive models now hold the FID frontier ^[15]^[16], but they spend much more compute per generated image than BigGAN does.

Strengths

Sample quality at scale. BigGAN is still the cleanest demonstration that GANs can produce convincing photorealistic images conditioned on a thousand ImageNet classes.
Class control. The class label is a clean, discrete conditioning signal. Picking a target class is just selecting an integer, which makes evaluation and analysis straightforward.
Fast inference. A BigGAN forward pass costs a single network evaluation. On a modern GPU, generating a 512x512 image takes tens of milliseconds, orders of magnitude faster than a 50-step diffusion model.
Smooth latent space. Linear interpolations in z space and class space produce smooth visual transitions, which made BigGAN a popular tool for art and exploration. The truncation knob also gives a natural quality-diversity dial.

Limitations

Compute intensity. Training a single 512x512 BigGAN required hundreds of TPU cores and a day or two of wall time ^[1]. Mario Klingemann famously estimated the public-cloud cost at around US$59,000 per training run, which put reproduction out of reach for most academic labs.
Training instability. Even with the paper's stabilization tricks, BigGAN runs collapse at long training horizons. Practitioners had to babysit checkpoints and roll back when collapse hit ^[1].
Class-conditional only. BigGAN takes a 1-of-1000 ImageNet label, not free text. Adapting it to text-to-image required entirely new architectures (eventually GigaGAN) ^[7].
Diversity loss under truncation. The headline IS numbers depend on aggressive truncation, which sacrifices diversity. Comparing BigGAN samples to diffusion samples at the same FID, BigGAN often shows narrower per-class variety.
Surpassed by diffusion. Since 2021, diffusion models (ADM, Imagen, Stable Diffusion, Flux) have set the state of the art for high-fidelity image generation, and class-conditional ImageNet FIDs are now dominated by diffusion transformers and visual autoregressive models ^[8]^[15]^[16]. BigGAN remains a milestone, not a current SOTA system.

Code and resources

Resource	Where	Notes
Original TF Hub models	tfhub.dev/deepmind/biggan-128, biggan-256, biggan-512	Released by DeepMind in 2018; standard BigGAN generators
BigGAN-deep TF Hub models	tfhub.dev/deepmind/biggan-deep-128, biggan-deep-256, biggan-deep-512	Higher quality variants released alongside
Colab demo	The TF Hub BigGAN generation notebook	The viral 2018 demo that introduced many people to GAN-generated images ^[11]
Author's PyTorch port	github.com/ajbrock/BigGAN-PyTorch	"Officially unofficial" port by the lead author; supports up to 8 GPUs and gradient accumulation to reach a 2,048 batch size ^[12]
Hugging Face PyTorch port	github.com/huggingface/pytorch-pretrained-BigGAN	Inference-only; converts the DeepMind TF weights and ships pre-computed batch-norm statistics for 51 truncation values ^[13]
MMagic / OpenMMLab	github.com/open-mmlab/mmagic	Reference implementations and converted weights for both BigGAN and BigGAN-deep ^[14]
BigBiGAN models	tfhub.dev/deepmind/bigbigan-resnet50 etc.	The encoder-equipped follow-up; useful for unsupervised representation learning ^[5]

The combination of free Colab access and the truncation slider made BigGAN one of the first widely shared neural-image-generation experiences. Artists like Memo Akten and Mario Klingemann built early generative-art series on top of BigGAN, and many of the techniques that later spread through diffusion-based art (latent walks, class interpolation, controlled sampling) were first popularized in this period.

Where does BigGAN sit in generative-modelling history?

BigGAN sits at an inflection point. Before it, GAN research was dominated by loss-function debates (Wasserstein GAN, R1, hinge), tricks for mode collapse, and architecture searches that often shaved a couple of FID points off the previous best. After BigGAN, the dominant story was scale: bigger batches, more parameters, more compute, with a small stabilization toolkit. That story carried through to GPT-3 in language modelling and to Stable Diffusion in image generation.

The specific architectural choices BigGAN inherited or popularized (spectral normalization in both networks, class-conditional batch normalization with shared embeddings, hinge loss, self-attention at intermediate resolutions, the truncation trick for sampling) became default ingredients in subsequent class-conditional GAN work ^[2]^[3]^[4]. Even projects that ultimately moved past GANs, such as DALL-E and Stable Diffusion, often cite BigGAN as the proof point that conditional generation at ImageNet scale was tractable in the first place.

The paper has been cited tens of thousands of times since publication. It is regularly used as a reference baseline for new conditional image-generation work, even though direct comparisons are now made primarily against diffusion and visual autoregressive models. For practitioners building production image-generation systems, BigGAN itself is rarely deployed today (its complexity and training cost made it impractical, and StyleGAN-family models proved easier to deploy for unconditional or text-driven use). Its conceptual influence remains everywhere.

References

Brock, A., Donahue, J., Simonyan, K. (2019). "Large Scale GAN Training for High Fidelity Natural Image Synthesis." ICLR 2019. arXiv:1809.11096. https://arxiv.org/abs/1809.11096 ; https://openreview.net/forum?id=B1xsqj09Fm ↩
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y. (2018). "Spectral Normalization for Generative Adversarial Networks." ICLR 2018. arXiv:1802.05957. https://arxiv.org/abs/1802.05957 ↩
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A. (2019). "Self-Attention Generative Adversarial Networks." ICML 2019. https://proceedings.mlr.press/v97/zhang19d.html ↩
Miyato, T., Koyama, M. (2018). "cGANs with Projection Discriminator." ICLR 2018. https://openreview.net/forum?id=ByS1VpgRZ ↩
Donahue, J., Simonyan, K. (2019). "Large Scale Adversarial Representation Learning" (BigBiGAN). NeurIPS 2019. arXiv:1907.02544. https://arxiv.org/abs/1907.02544 ↩
Sauer, A., Schwarz, K., Geiger, A. (2022). "StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets." SIGGRAPH 2022. arXiv:2202.00273. https://arxiv.org/abs/2202.00273 ↩
Kang, M., Zhu, J.-Y., Zhang, R., Park, J., Shechtman, E., Paris, S., Park, T. (2023). "Scaling up GANs for Text-to-Image Synthesis" (GigaGAN). CVPR 2023. arXiv:2303.05511. https://arxiv.org/abs/2303.05511 ↩
Dhariwal, P., Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021. arXiv:2105.05233. https://arxiv.org/abs/2105.05233 ↩
Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O. (2018). "Are GANs Created Equal? A Large-Scale Study." NeurIPS 2018. https://arxiv.org/abs/1711.10337 ↩
Kurach, K., Lucic, M., Zhai, X., Michalski, M., Gelly, S. (2019). "A Large-Scale Study on Regularization and Normalization in GANs." ICML 2019. https://arxiv.org/abs/1807.04720 ↩
TensorFlow Hub. "Generating Images with BigGAN." https://www.tensorflow.org/hub/tutorials/biggan_generation_with_tf_hub ↩
Brock, A. "BigGAN-PyTorch (officially unofficial)." https://github.com/ajbrock/BigGAN-PyTorch ↩
Hugging Face. "pytorch-pretrained-BigGAN." https://github.com/huggingface/pytorch-pretrained-BigGAN ↩
OpenMMLab. "BigGAN config in MMagic." https://github.com/open-mmlab/mmagic/blob/main/configs/biggan/README.md ↩
Peebles, W., Xie, S. (2023). "Scalable Diffusion Models with Transformers" (DiT). ICCV 2023. arXiv:2212.09748. https://arxiv.org/abs/2212.09748 ↩
Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L. (2024). "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" (VAR). NeurIPS 2024. arXiv:2404.02905. https://arxiv.org/abs/2404.02905 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AI art CycleGAN DCGAN (Deep Convolutional GAN)GAN Generative adversarial network Karén Simonyan MaskGIT StyleGAN Unconditional Image Generation Models VQGAN (Taming Transformers)

Who created BigGAN?

Why does BigGAN matter?

Architecture

Key design choices

What is the truncation trick?

How big is BigGAN?

How good are BigGAN's results?

How does the truncation trick trade quality for diversity?

What are BigGAN's failure modes?

Variants and follow-up work

How does BigGAN compare with other generative models on ImageNet?

Strengths

Limitations

Code and resources

Where does BigGAN sit in generative-modelling history?

References

Improve this article

Related Articles

SigLIP

Gemini (language model)

Imagen (text-to-image model)

Veo

Genie 3

Imagen 3

What links here

Related Articles

SigLIP

Gemini (language model)

Imagen (text-to-image model)

Veo

Genie 3

Imagen 3

What links here