CycleGAN
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,637 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,637 words
Add missing citations, update stale details, or suggest a clearer explanation.
CycleGAN (Cycle-Consistent Generative Adversarial Network) is a deep learning architecture for unpaired image-to-image translation. It learns a mapping G: X to Y between two image domains X and Y from training samples that are not aligned in pairs, by coupling the forward generator G with an inverse generator F: Y to X and enforcing a cycle-consistency loss so that F(G(x)) is approximately equal to x and G(F(y)) is approximately equal to y. The method was introduced in March 2017 in the arXiv preprint "Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks" by Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros at the Berkeley AI Research (BAIR) Lab, University of California, Berkeley, and was published at the IEEE International Conference on Computer Vision (ICCV) 2017.
CycleGAN is one of the most cited works in the generative adversarial network (GAN) literature and is the foundational paper that opened the subfield of unpaired translation. It allowed practitioners to learn translation tasks such as horse to zebra, summer to winter, and photograph to painting, where collecting aligned input-output pairs is either impossible or prohibitively expensive.
Before CycleGAN, supervised image translation relied on aligned image pairs. The pix2pix framework by Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros, presented at CVPR 2017, used a conditional GAN on paired data to learn mappings such as edge maps to photographs, semantic labels to street scenes, and grayscale to color (Isola et al. 2017, arXiv:1611.07004). pix2pix worked well, but it required datasets where every input image came with a corresponding ground-truth output. For most interesting translation tasks, such pairs simply do not exist. Nobody can photograph the same scene as both a Monet painting and a real landscape, and there is no aligned dataset of horses standing in the exact pose of zebras in the same field.
The CycleGAN authors framed the problem as learning the joint distribution of two domains given only the marginals. That problem is highly under-constrained. An infinite family of joint distributions match any two marginals, so adversarial loss alone is not enough; the generator can permute outputs arbitrarily and still satisfy the discriminator. To narrow the search, the authors borrowed the idea of cycle consistency from work in language translation, visual tracking, and 3D shape matching, and turned it into a differentiable loss. If a horse is mapped to a zebra and then back to a horse, the result should resemble the original horse. This single constraint, together with adversarial losses in both directions, was enough to produce convincing translations on a wide range of unpaired tasks.
Two other groups proposed essentially the same idea at almost the same time. Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong introduced DualGAN ("DualGAN: Unsupervised Dual Learning for Image-to-Image Translation," arXiv:1704.02510, ICCV 2017), and Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim introduced DiscoGAN ("Learning to Discover Cross-Domain Relations with Generative Adversarial Networks," arXiv:1703.05192, ICML 2017). All three works rely on the same dual-mapping plus reconstruction-loss principle. CycleGAN became the most cited of the three, partly because of the strength of the experiments and partly because of the open-source PyTorch repository released alongside the paper.
The paper was written by four researchers associated with UC Berkeley:
The project was funded in part by NSF, the Berkeley Deep Drive program, and Adobe and Nvidia hardware donations. The first arXiv version of the paper appeared on 30 March 2017, and the work was presented at ICCV in Venice in October 2017.
CycleGAN trains four neural networks at once: two generators and two discriminators.
Let X and Y be two image domains, with training samples drawn independently from each. The goal is to learn:
The full objective combines three terms.
Adversarial losses. Each generator-discriminator pair has its own adversarial loss in the style of GAN training. CycleGAN uses the least-squares formulation from Mao et al. 2017 (LSGAN, "Least Squares Generative Adversarial Networks," ICCV 2017) instead of the original log-loss from Goodfellow et al. 2014, because least-squares loss is more stable and produces higher quality images. The two adversarial terms are written L_GAN(G, D_Y, X, Y) and L_GAN(F, D_X, Y, X).
Cycle-consistency loss. This is the central contribution. Both compositions G then F and F then G must approximately reconstruct the input:
L_cyc(G, F) = E_x [ || F(G(x)) - x ||_1 ] + E_y [ || G(F(y)) - y ||_1 ]
The loss is measured in L1 because the authors found that L1 produced sharper outputs than L2, mirroring observations from the pix2pix paper.
Identity loss. An optional identity term encourages the generators to behave like the identity function when they are fed an image that is already in the target domain:
L_id(G, F) = E_y [ || G(y) - y ||_1 ] + E_x [ || F(x) - x ||_1 ]
Identity loss helps preserve color composition and prevents tinting artifacts. It was used for the painting-to-photograph experiments and switched off elsewhere.
Total objective. The full loss is a weighted sum:
L = L_GAN(G, D_Y, X, Y)
+ L_GAN(F, D_X, Y, X)
+ lambda_cyc * L_cyc(G, F)
+ lambda_id * L_id(G, F)
In the paper, lambda_cyc is set to 10 and lambda_id is set to 0.5 when the identity loss is used.
Generators. The generator network is adapted from the architecture used by Justin Johnson, Alexandre Alahi, and Li Fei-Fei in "Perceptual Losses for Real-Time Style Transfer and Super-Resolution" (ECCV 2016). It consists of three downsampling convolutions, a stack of residual blocks, three upsampling fractionally-strided convolutions, and a final convolution mapping back to RGB. For 128 by 128 inputs the network uses 6 residual blocks; for 256 by 256 and larger inputs it uses 9 residual blocks. Instance normalization (Ulyanov et al. 2016) is used throughout instead of batch normalization, which suits the batch size of 1 used during training.
Discriminators. Both D_X and D_Y are PatchGAN discriminators in the style introduced by Isola et al. for pix2pix. The PatchGAN classifies overlapping 70 by 70 patches of the image as real or fake and averages the responses across the image, producing a single scalar output for the loss. PatchGANs have far fewer parameters than full-image discriminators, run faster, and focus the model on local texture and structure rather than global layout, which is exactly what is needed for translation.
The full training of a single CycleGAN model takes on the order of one to two days on a single GPU for the standard 256 by 256 datasets reported in the paper.
The authors evaluated CycleGAN on a wide spectrum of tasks. The most widely circulated images come from a small set of domain pairs.
| Task | Source domain | Target domain | Notes |
|---|---|---|---|
| Object transfiguration | Horse | Zebra | The single most reproduced CycleGAN demo |
| Object transfiguration | Apple | Orange | Both directions |
| Season transfer | Yosemite summer | Yosemite winter | Snowfall and color shifts |
| Collection style transfer | Photograph | Monet painting | Also Cezanne, Van Gogh, Ukiyo-e |
| Photo enhancement | iPhone snapshot | DSLR-quality bokeh | Shallow depth of field |
| Map translation | Aerial photo | Google Maps style | And the reverse |
| Cityscapes | Semantic labels | Street photographs | Compared head-to-head with pix2pix |
On the Cityscapes label-to-photo task, where paired data is available, the authors used pix2pix as a paired baseline. pix2pix produced sharper and more accurate results, as expected, but CycleGAN closed much of the gap without ever seeing aligned pairs. Quantitative evaluation used the Frechet Inception Distance (FID), AMT human perceptual studies on Mechanical Turk, and segmentation-based scores (FCN-score) on the Cityscapes task.
The authors were unusually candid about the failure modes of their method, devoting an entire section of the paper to limitations. The main ones are:
CycleGAN spawned a large family of follow-up architectures. The table below summarizes the most influential variants and how they differ from the original.
| Method | Year | Venue | Authors | Key idea relative to CycleGAN |
|---|---|---|---|---|
| DualGAN | 2017 | ICCV | Yi et al. | Concurrent work with the same dual-generator and reconstruction-loss design |
| DiscoGAN | 2017 | ICML | Kim et al. | Concurrent work, also uses cross-domain reconstruction |
| UNIT | 2017 | NeurIPS | Liu, Breuel, Kautz | Adds a shared latent-space assumption between domains using weight-tied encoders |
| BicycleGAN | 2017 | NeurIPS | Zhu et al. | Multimodal output for the paired setting |
| pix2pixHD | 2018 | CVPR | Wang et al. | High-resolution paired translation |
| MUNIT | 2018 | ECCV | Huang, Liu, Belongie, Kautz | Disentangles content and style codes for multimodal unpaired translation |
| DRIT | 2018 | ECCV | Lee et al. | Disentangled representation for diverse outputs |
| StarGAN | 2018 | CVPR | Choi et al. | Single generator covers many domains using a domain label |
| StarGAN v2 | 2020 | CVPR | Choi et al. | Multi-domain plus multimodal |
| FUNIT | 2019 | ICCV | Liu et al. | Few-shot unsupervised translation |
| U-GAT-IT | 2020 | ICLR | Kim et al. | Attention modules for selfie-to-anime style change |
| CUT | 2020 | ECCV | Park, Efros, Zhang, Zhu | Replaces cycle consistency with a patch-level contrastive loss; one-sided translation, faster training |
| CycleGAN-VC | 2018 | EUSIPCO | Kaneko, Kameoka | Voice conversion with the same cycle-consistency principle |
| CycleGAN-VC2 | 2019 | ICASSP | Kaneko et al. | Improved generator and two-step adversarial loss |
More recent work has begun to replace GANs with diffusion models for unpaired translation (for example UNIT-DDPM and SDEdit-style methods), but the cycle-consistency principle continues to appear as a regularizer in many of these models.
The table below compares CycleGAN to other major GAN architectures. The pairing column indicates whether aligned (input, output) pairs are required at training time. The multi-domain column indicates whether a single trained model handles many target domains. The multimodal column indicates whether the model can produce diverse outputs for the same input.
| Architecture | Year | Pairs required | Multi-domain | Multimodal | Key paper |
|---|---|---|---|---|---|
| Vanilla GAN | 2014 | n/a | n/a | n/a | Goodfellow et al., NeurIPS 2014 |
| DCGAN | 2015 | n/a | n/a | n/a | Radford, Metz, Chintala, ICLR 2016 |
| Conditional GAN | 2014 | depends | yes via label | no | Mirza and Osindero, arXiv 1411.1784 |
| pix2pix | 2017 | yes | no | no | Isola et al., CVPR 2017 |
| CycleGAN | 2017 | no | no | no | Zhu et al., ICCV 2017 |
| UNIT | 2017 | no | no | no | Liu et al., NeurIPS 2017 |
| StarGAN | 2018 | no | yes | no | Choi et al., CVPR 2018 |
| MUNIT | 2018 | no | no | yes | Huang et al., ECCV 2018 |
| BicycleGAN | 2017 | yes | no | yes | Zhu et al., NeurIPS 2017 |
| CUT | 2020 | no | no | no | Park et al., ECCV 2020 |
| StyleGAN | 2018 | n/a (unconditional) | n/a | yes | Karras, Laine, Aila, CVPR 2019 |
| BigGAN | 2018 | n/a (class-conditional) | yes | yes | Brock, Donahue, Simonyan, ICLR 2019 |
The Wasserstein GAN (WGAN) loss can be substituted for the LSGAN loss in CycleGAN, and several follow-up papers have done so to gain training stability on harder datasets. CycleGAN sits firmly in the family of conditional generative models for image-to-image models.
The table below organizes the most common deployment areas for CycleGAN and the cycle-consistency idea.
| Application area | Description | Representative work |
|---|---|---|
| Artistic style transfer | Photograph to Monet, Van Gogh, Ukiyo-e, Cezanne and back | Original CycleGAN paper, 2017 |
| Domain adaptation for self-driving | Synthetic GTA-V renders translated to real Cityscapes-style images for training perception models | Hoffman et al., CyCADA, ICML 2018 |
| Medical image modality transfer | CT to MR and MR to CT for treatment planning, segmentation, and dose calculation | Wolterink et al. 2017; Hiasa et al. 2018; many follow-ups |
| Sim-to-real for robotics | Translating rendered camera images to photorealistic ones, or vice versa | Various Berkeley and Google Brain papers, 2018 onward |
| Voice conversion | Speaker identity transfer without parallel utterance pairs | CycleGAN-VC, CycleGAN-VC2, MaskCycleGAN-VC by Kaneko and Kameoka |
| Aerial and satellite imagery | Map style transfer, day-to-night, season change, cross-sensor adaptation | Multiple remote sensing papers |
| Data augmentation | Synthesizing extra training images in the minority class to balance datasets, especially in medical imaging | Multiple medical AI papers |
| Privacy and de-identification | Translating face images to anonymized but realistic substitutes | Various face anonymization papers |
| Text style transfer | Cycle consistency adapted to sequence models for politeness, formality, sentiment changes | Shen et al. 2017; many follow-ups |
| Art and design tooling | Powering creative tools like Runway ML, Replicate.com demos, and many web apps | Community projects |
CycleGAN remains in active production use for stylization tasks, and pretrained CycleGAN models are still distributed on Hugging Face, Replicate, and the official PyTorch repository, more than eight years after the original paper.
The CycleGAN paper has been cited tens of thousands of times. According to Google Scholar, the citation count passed 30,000 in 2024 and continues to climb. The cycle-consistency principle has been adapted well beyond images: language pairs in unsupervised machine translation (Lample et al. 2018), graph-to-graph translation, and even cross-modal embedding alignment all use variants of the same idea.
More broadly, CycleGAN demonstrated that adversarial training plus a self-supervised consistency constraint can solve problems that previously seemed to require strong supervision. It established "unpaired image-to-image translation" as a recognized task with its own benchmarks and evaluation protocols. The horse-to-zebra demo became a canonical example used in textbooks and courses to introduce GANs.
From an engineering point of view, the open-sourcing of the official PyTorch repository (junyanz/pytorch-CycleGAN-and-pix2pix) had an outsized effect. The repo combines the pix2pix and CycleGAN codebases, ships pretrained models for the most common domain pairs, and remains one of the most starred computer vision repositories on GitHub. It has been ported to TensorFlow, Keras, MXNet, JAX, and most other frameworks. Many later projects, including pix2pixHD, BicycleGAN, MUNIT, and CUT, were built directly on the same code structure.
Despite the rise of diffusion models and large pretrained image models for general-purpose translation tasks, CycleGAN and its descendants are still widely used in practice. They are smaller, faster, and easier to train than diffusion alternatives, and for narrow tasks with limited data they often remain the most cost-effective choice.
The official implementation lives at github.com/junyanz/pytorch-CycleGAN-and-pix2pix. The repository ships:
The repository was originally released alongside the paper in 2017 and has been actively maintained since. As of 2024 it supports Python 3.11 and PyTorch 2.4. A separate Lua/Torch repository (junyanz/CycleGAN) preserves the original implementation used in the ICCV submission.
Unofficial ports and pretrained checkpoints are available on Hugging Face Hub, Replicate, and many community GitHub repositories. The model weights are small enough to run on a single consumer GPU at inference time, and runtime is dominated by the residual-block stack rather than memory bandwidth, so even older hardware can run translation in real time at 256 by 256 resolution.