# CycleGAN

> Source: https://aiwiki.ai/wiki/cyclegan
> Updated: 2026-05-01
> Categories: Computer Vision, Generative AI, Image Generation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# CycleGAN

**CycleGAN** (Cycle-Consistent Generative Adversarial Network) is a deep learning architecture for **unpaired image-to-image translation**. It learns a mapping G: X to Y between two image domains X and Y from training samples that are not aligned in pairs, by coupling the forward generator G with an inverse generator F: Y to X and enforcing a *cycle-consistency loss* so that F(G(x)) is approximately equal to x and G(F(y)) is approximately equal to y. The method was introduced in March 2017 in the arXiv preprint "Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks" by Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros at the Berkeley AI Research (BAIR) Lab, University of California, Berkeley, and was published at the IEEE International Conference on Computer Vision (ICCV) 2017.

CycleGAN is one of the most cited works in the [generative adversarial network (GAN)](generative_adversarial_network_gan) literature and is the foundational paper that opened the subfield of unpaired translation. It allowed practitioners to learn translation tasks such as horse to zebra, summer to winter, and photograph to painting, where collecting aligned input-output pairs is either impossible or prohibitively expensive.

## Background and motivation

Before CycleGAN, supervised image translation relied on aligned image pairs. The pix2pix framework by Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros, presented at CVPR 2017, used a conditional [GAN](gan) on paired data to learn mappings such as edge maps to photographs, semantic labels to street scenes, and grayscale to color (Isola et al. 2017, arXiv:1611.07004). pix2pix worked well, but it required datasets where every input image came with a corresponding ground-truth output. For most interesting translation tasks, such pairs simply do not exist. Nobody can photograph the same scene as both a Monet painting and a real landscape, and there is no aligned dataset of horses standing in the exact pose of zebras in the same field.

The CycleGAN authors framed the problem as learning the joint distribution of two domains given only the marginals. That problem is highly under-constrained. An infinite family of joint distributions match any two marginals, so adversarial loss alone is not enough; the generator can permute outputs arbitrarily and still satisfy the discriminator. To narrow the search, the authors borrowed the idea of *cycle consistency* from work in language translation, visual tracking, and 3D shape matching, and turned it into a differentiable loss. If a horse is mapped to a zebra and then back to a horse, the result should resemble the original horse. This single constraint, together with adversarial losses in both directions, was enough to produce convincing translations on a wide range of unpaired tasks.

Two other groups proposed essentially the same idea at almost the same time. Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong introduced **DualGAN** ("DualGAN: Unsupervised Dual Learning for Image-to-Image Translation," arXiv:1704.02510, ICCV 2017), and Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim introduced **DiscoGAN** ("Learning to Discover Cross-Domain Relations with Generative Adversarial Networks," arXiv:1703.05192, ICML 2017). All three works rely on the same dual-mapping plus reconstruction-loss principle. CycleGAN became the most cited of the three, partly because of the strength of the experiments and partly because of the open-source PyTorch repository released alongside the paper.

## Authors and origin

The paper was written by four researchers associated with UC Berkeley:

- **Jun-Yan Zhu**, lead author, then a PhD student at Berkeley advised by Alexei A. Efros. Zhu later went to MIT CSAIL as a postdoc, worked at Adobe Research, and joined Carnegie Mellon University as an assistant professor in 2020. He has continued to work on generative models and image-to-image translation, including pix2pixHD, BicycleGAN, GauGAN, and CUT.
- **Taesung Park**, then a PhD student at Berkeley, also advised by Efros. Park went on to first-author the Contrastive Unpaired Translation (CUT) paper at ECCV 2020 and the SPADE/GauGAN architecture.
- **Phillip Isola**, then at Berkeley working with Efros, later an assistant professor at MIT EECS. Isola was the first author of pix2pix (CVPR 2017), the paired counterpart to CycleGAN.
- **Alexei A. Efros**, professor at UC Berkeley, the senior author and advisor on the project.

The project was funded in part by NSF, the Berkeley Deep Drive program, and Adobe and Nvidia hardware donations. The first arXiv version of the paper appeared on 30 March 2017, and the work was presented at ICCV in Venice in October 2017.

## Method

CycleGAN trains four neural networks at once: two generators and two discriminators.

### Notation

Let X and Y be two image domains, with training samples drawn independently from each. The goal is to learn:

- A generator G: X to Y that turns images from domain X into images that look like they came from domain Y.
- An inverse generator F: Y to X.
- A discriminator D_Y that tries to distinguish real samples from Y from translated samples G(x).
- A discriminator D_X that tries to distinguish real samples from X from translated samples F(y).

### Loss function

The full objective combines three terms.

**Adversarial losses.** Each generator-discriminator pair has its own adversarial loss in the style of [GAN](gan) training. CycleGAN uses the least-squares formulation from Mao et al. 2017 (LSGAN, "Least Squares Generative Adversarial Networks," ICCV 2017) instead of the original log-loss from Goodfellow et al. 2014, because least-squares loss is more stable and produces higher quality images. The two adversarial terms are written L_GAN(G, D_Y, X, Y) and L_GAN(F, D_X, Y, X).

**Cycle-consistency loss.** This is the central contribution. Both compositions G then F and F then G must approximately reconstruct the input:

```
L_cyc(G, F) = E_x [ || F(G(x)) - x ||_1 ] + E_y [ || G(F(y)) - y ||_1 ]
```

The loss is measured in L1 because the authors found that L1 produced sharper outputs than L2, mirroring observations from the pix2pix paper.

**Identity loss.** An optional identity term encourages the generators to behave like the identity function when they are fed an image that is already in the target domain:

```
L_id(G, F) = E_y [ || G(y) - y ||_1 ] + E_x [ || F(x) - x ||_1 ]
```

Identity loss helps preserve color composition and prevents tinting artifacts. It was used for the painting-to-photograph experiments and switched off elsewhere.

**Total objective.** The full loss is a weighted sum:

```
L = L_GAN(G, D_Y, X, Y)
  + L_GAN(F, D_X, Y, X)
  + lambda_cyc * L_cyc(G, F)
  + lambda_id  * L_id(G, F)
```

In the paper, lambda_cyc is set to 10 and lambda_id is set to 0.5 when the identity loss is used.

### Architecture

**Generators.** The generator network is adapted from the architecture used by Justin Johnson, Alexandre Alahi, and Li Fei-Fei in "Perceptual Losses for Real-Time Style Transfer and Super-Resolution" (ECCV 2016). It consists of three downsampling convolutions, a stack of residual blocks, three upsampling fractionally-strided convolutions, and a final convolution mapping back to RGB. For 128 by 128 inputs the network uses 6 residual blocks; for 256 by 256 and larger inputs it uses 9 residual blocks. Instance normalization (Ulyanov et al. 2016) is used throughout instead of batch normalization, which suits the batch size of 1 used during training.

**Discriminators.** Both D_X and D_Y are PatchGAN discriminators in the style introduced by Isola et al. for pix2pix. The PatchGAN classifies overlapping 70 by 70 patches of the image as real or fake and averages the responses across the image, producing a single scalar output for the loss. PatchGANs have far fewer parameters than full-image discriminators, run faster, and focus the model on local texture and structure rather than global layout, which is exactly what is needed for translation.

### Training

- **Optimizer**: Adam, with beta_1 = 0.5 and beta_2 = 0.999.
- **Learning rate**: 2e-4 for the first 100 epochs, then linearly decayed to zero over the next 100 epochs, for 200 epochs total.
- **Batch size**: 1 (instance normalization is used because batch normalization with batch size 1 is degenerate).
- **Image buffer**: a history pool of 50 generated images is kept and used for discriminator updates, following Shrivastava, Pfister, Tuzel, Susskind, Wang, and Webb's "Learning from Simulated and Unsupervised Images through Adversarial Training" (CVPR 2017). Sampling old generated images stabilizes the discriminator and reduces oscillation.
- **Initialization**: weights drawn from a Gaussian distribution with mean 0 and standard deviation 0.02.
- **Discriminator updates**: the loss for D is divided by 2, slowing the discriminator relative to the generator.

The full training of a single CycleGAN model takes on the order of one to two days on a single GPU for the standard 256 by 256 datasets reported in the paper.

## Results from the paper

The authors evaluated CycleGAN on a wide spectrum of tasks. The most widely circulated images come from a small set of domain pairs.

| Task | Source domain | Target domain | Notes |
|---|---|---|---|
| Object transfiguration | Horse | Zebra | The single most reproduced CycleGAN demo |
| Object transfiguration | Apple | Orange | Both directions |
| Season transfer | Yosemite summer | Yosemite winter | Snowfall and color shifts |
| Collection style transfer | Photograph | Monet painting | Also Cezanne, Van Gogh, Ukiyo-e |
| Photo enhancement | iPhone snapshot | DSLR-quality bokeh | Shallow depth of field |
| Map translation | Aerial photo | Google Maps style | And the reverse |
| Cityscapes | Semantic labels | Street photographs | Compared head-to-head with pix2pix |

On the Cityscapes label-to-photo task, where paired data is available, the authors used pix2pix as a paired baseline. pix2pix produced sharper and more accurate results, as expected, but CycleGAN closed much of the gap without ever seeing aligned pairs. Quantitative evaluation used the Frechet Inception Distance (FID), AMT human perceptual studies on Mechanical Turk, and segmentation-based scores (FCN-score) on the Cityscapes task.

## Limitations

The authors were unusually candid about the failure modes of their method, devoting an entire section of the paper to limitations. The main ones are:

- **Geometric and shape changes are weak.** CycleGAN learns texture and color mappings well but struggles with translations that require large shape changes. The classic failure mode is dog to cat: the body and pose of the source dog are preserved while the network only changes fur color and texture, producing an obviously wrong cat shape.
- **Cycle consistency assumes a one-to-one mapping.** Many real translation tasks are one-to-many. There are many possible winter versions of a given summer photograph (different snowfall, different lighting). The L1 cycle loss collapses to a deterministic mapping and the model cannot capture this multimodality. MUNIT and BicycleGAN later addressed this.
- **Distribution shift at test time.** When the test image is far from the training distribution, for example a horse shown from an unusual angle or a photograph with a person on the horse's back, CycleGAN often produces unrealistic outputs. The horse-to-zebra demo famously misclassifies riders as part of the horse and turns Vladimir Putin's jacket into zebra stripes in well-known online examples.
- **Mode collapse can still occur** despite the cycle loss, especially without the identity term.
- **Quality lags behind paired methods.** When paired data exists, pix2pix and its successors produce sharper, more accurate results.
- **Slow inference relative to feed-forward style transfer networks.** A forward pass through the ResNet-style generator at 256 by 256 takes longer than the small networks used for fixed-style stylization.

## Variants and extensions

CycleGAN spawned a large family of follow-up architectures. The table below summarizes the most influential variants and how they differ from the original.

| Method | Year | Venue | Authors | Key idea relative to CycleGAN |
|---|---|---|---|---|
| DualGAN | 2017 | ICCV | Yi et al. | Concurrent work with the same dual-generator and reconstruction-loss design |
| DiscoGAN | 2017 | ICML | Kim et al. | Concurrent work, also uses cross-domain reconstruction |
| UNIT | 2017 | NeurIPS | Liu, Breuel, Kautz | Adds a shared latent-space assumption between domains using weight-tied encoders |
| BicycleGAN | 2017 | NeurIPS | Zhu et al. | Multimodal output for the paired setting |
| pix2pixHD | 2018 | CVPR | Wang et al. | High-resolution paired translation |
| MUNIT | 2018 | ECCV | Huang, Liu, Belongie, Kautz | Disentangles content and style codes for multimodal unpaired translation |
| DRIT | 2018 | ECCV | Lee et al. | Disentangled representation for diverse outputs |
| StarGAN | 2018 | CVPR | Choi et al. | Single generator covers many domains using a domain label |
| StarGAN v2 | 2020 | CVPR | Choi et al. | Multi-domain plus multimodal |
| FUNIT | 2019 | ICCV | Liu et al. | Few-shot unsupervised translation |
| U-GAT-IT | 2020 | ICLR | Kim et al. | Attention modules for selfie-to-anime style change |
| CUT | 2020 | ECCV | Park, Efros, Zhang, Zhu | Replaces cycle consistency with a patch-level contrastive loss; one-sided translation, faster training |
| CycleGAN-VC | 2018 | EUSIPCO | Kaneko, Kameoka | Voice conversion with the same cycle-consistency principle |
| CycleGAN-VC2 | 2019 | ICASSP | Kaneko et al. | Improved generator and two-step adversarial loss |

More recent work has begun to replace GANs with diffusion models for unpaired translation (for example UNIT-DDPM and SDEdit-style methods), but the cycle-consistency principle continues to appear as a regularizer in many of these models.

## Relation to other GAN architectures

The table below compares CycleGAN to other major GAN architectures. The pairing column indicates whether aligned (input, output) pairs are required at training time. The multi-domain column indicates whether a single trained model handles many target domains. The multimodal column indicates whether the model can produce diverse outputs for the same input.

| Architecture | Year | Pairs required | Multi-domain | Multimodal | Key paper |
|---|---|---|---|---|---|
| Vanilla GAN | 2014 | n/a | n/a | n/a | Goodfellow et al., NeurIPS 2014 |
| DCGAN | 2015 | n/a | n/a | n/a | Radford, Metz, Chintala, ICLR 2016 |
| Conditional GAN | 2014 | depends | yes via label | no | Mirza and Osindero, arXiv 1411.1784 |
| pix2pix | 2017 | yes | no | no | Isola et al., CVPR 2017 |
| CycleGAN | 2017 | no | no | no | Zhu et al., ICCV 2017 |
| UNIT | 2017 | no | no | no | Liu et al., NeurIPS 2017 |
| StarGAN | 2018 | no | yes | no | Choi et al., CVPR 2018 |
| MUNIT | 2018 | no | no | yes | Huang et al., ECCV 2018 |
| BicycleGAN | 2017 | yes | no | yes | Zhu et al., NeurIPS 2017 |
| CUT | 2020 | no | no | no | Park et al., ECCV 2020 |
| StyleGAN | 2018 | n/a (unconditional) | n/a | yes | Karras, Laine, Aila, CVPR 2019 |
| BigGAN | 2018 | n/a (class-conditional) | yes | yes | Brock, Donahue, Simonyan, ICLR 2019 |

The [Wasserstein GAN (WGAN)](wgan) loss can be substituted for the LSGAN loss in CycleGAN, and several follow-up papers have done so to gain training stability on harder datasets. CycleGAN sits firmly in the family of conditional generative models for [image-to-image models](image-to-image_models).

## Applications

The table below organizes the most common deployment areas for CycleGAN and the cycle-consistency idea.

| Application area | Description | Representative work |
|---|---|---|
| Artistic style transfer | Photograph to Monet, Van Gogh, Ukiyo-e, Cezanne and back | Original CycleGAN paper, 2017 |
| Domain adaptation for self-driving | Synthetic GTA-V renders translated to real Cityscapes-style images for training perception models | Hoffman et al., CyCADA, ICML 2018 |
| Medical image modality transfer | CT to MR and MR to CT for treatment planning, segmentation, and dose calculation | Wolterink et al. 2017; Hiasa et al. 2018; many follow-ups |
| Sim-to-real for robotics | Translating rendered camera images to photorealistic ones, or vice versa | Various Berkeley and Google Brain papers, 2018 onward |
| Voice conversion | Speaker identity transfer without parallel utterance pairs | CycleGAN-VC, CycleGAN-VC2, MaskCycleGAN-VC by Kaneko and Kameoka |
| Aerial and satellite imagery | Map style transfer, day-to-night, season change, cross-sensor adaptation | Multiple remote sensing papers |
| Data augmentation | Synthesizing extra training images in the minority class to balance datasets, especially in medical imaging | Multiple medical AI papers |
| Privacy and de-identification | Translating face images to anonymized but realistic substitutes | Various face anonymization papers |
| Text style transfer | Cycle consistency adapted to sequence models for politeness, formality, sentiment changes | Shen et al. 2017; many follow-ups |
| Art and design tooling | Powering creative tools like Runway ML, Replicate.com demos, and many web apps | Community projects |

CycleGAN remains in active production use for stylization tasks, and pretrained CycleGAN models are still distributed on Hugging Face, Replicate, and the official PyTorch repository, more than eight years after the original paper.

## Influence and legacy

The CycleGAN paper has been cited tens of thousands of times. According to Google Scholar, the citation count passed 30,000 in 2024 and continues to climb. The cycle-consistency principle has been adapted well beyond images: language pairs in unsupervised machine translation (Lample et al. 2018), graph-to-graph translation, and even cross-modal embedding alignment all use variants of the same idea.

More broadly, CycleGAN demonstrated that adversarial training plus a self-supervised consistency constraint can solve problems that previously seemed to require strong supervision. It established "unpaired image-to-image translation" as a recognized task with its own benchmarks and evaluation protocols. The horse-to-zebra demo became a canonical example used in textbooks and courses to introduce GANs.

From an engineering point of view, the open-sourcing of the official PyTorch repository (junyanz/pytorch-CycleGAN-and-pix2pix) had an outsized effect. The repo combines the pix2pix and CycleGAN codebases, ships pretrained models for the most common domain pairs, and remains one of the most starred computer vision repositories on GitHub. It has been ported to TensorFlow, Keras, MXNet, JAX, and most other frameworks. Many later projects, including pix2pixHD, BicycleGAN, MUNIT, and CUT, were built directly on the same code structure.

Despite the rise of diffusion models and large pretrained image models for general-purpose translation tasks, CycleGAN and its descendants are still widely used in practice. They are smaller, faster, and easier to train than diffusion alternatives, and for narrow tasks with limited data they often remain the most cost-effective choice.

## Code and reproducibility

The official implementation lives at github.com/junyanz/pytorch-CycleGAN-and-pix2pix. The repository ships:

- Training scripts for both CycleGAN and pix2pix.
- Dataset download scripts for the standard benchmarks (horse2zebra, summer2winter_yosemite, monet2photo, vangogh2photo, ukiyoe2photo, cezanne2photo, cityscapes, facades, maps, apple2orange).
- Pretrained model weights.
- Docker image, Conda environment file, Jupyter notebooks, and Google Colab notebooks.
- Multi-GPU training via torchrun and W&B logging integration.

The repository was originally released alongside the paper in 2017 and has been actively maintained since. As of 2024 it supports Python 3.11 and PyTorch 2.4. A separate Lua/Torch repository (junyanz/CycleGAN) preserves the original implementation used in the ICCV submission.

Unofficial ports and pretrained checkpoints are available on Hugging Face Hub, Replicate, and many community GitHub repositories. The model weights are small enough to run on a single consumer GPU at inference time, and runtime is dominated by the residual-block stack rather than memory bandwidth, so even older hardware can run translation in real time at 256 by 256 resolution.

## References

- Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). "Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks." *Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2017*. arXiv:1703.10593.
- Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017). "Image-to-Image Translation with Conditional Adversarial Networks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017*. arXiv:1611.07004.
- Yi, Z., Zhang, H., Tan, P., and Gong, M. (2017). "DualGAN: Unsupervised Dual Learning for Image-to-Image Translation." *ICCV 2017*. arXiv:1704.02510.
- Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J. (2017). "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks (DiscoGAN)." *ICML 2017*. arXiv:1703.05192.
- Liu, M.-Y., Breuel, T., and Kautz, J. (2017). "Unsupervised Image-to-Image Translation Networks (UNIT)." *NeurIPS 2017*. arXiv:1703.00848.
- Huang, X., Liu, M.-Y., Belongie, S., and Kautz, J. (2018). "Multimodal Unsupervised Image-to-Image Translation (MUNIT)." *ECCV 2018*. arXiv:1804.04732.
- Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., and Choo, J. (2018). "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." *CVPR 2018*. arXiv:1711.09020.
- Park, T., Efros, A. A., Zhang, R., and Zhu, J.-Y. (2020). "Contrastive Learning for Unpaired Image-to-Image Translation (CUT)." *ECCV 2020*. arXiv:2007.15651.
- Goodfellow, I. et al. (2014). "Generative Adversarial Nets." *NeurIPS 2014*. arXiv:1406.2661.
- Mao, X., Li, Q., Xie, H., Lau, R. Y. K., Wang, Z., and Smolley, S. P. (2017). "Least Squares Generative Adversarial Networks (LSGAN)." *ICCV 2017*. arXiv:1611.04076.
- Johnson, J., Alahi, A., and Fei-Fei, L. (2016). "Perceptual Losses for Real-Time Style Transfer and Super-Resolution." *ECCV 2016*. arXiv:1603.08155.
- Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., and Webb, R. (2017). "Learning from Simulated and Unsupervised Images through Adversarial Training." *CVPR 2017*. arXiv:1612.07828.
- Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). "Instance Normalization: The Missing Ingredient for Fast Stylization." arXiv:1607.08022.
- Kaneko, T. and Kameoka, H. (2018). "CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks." *EUSIPCO 2018*.
- Kaneko, T., Kameoka, H., Tanaka, K., and Hojo, N. (2019). "CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion." *ICASSP 2019*. arXiv:1904.04631.
- Hoffman, J. et al. (2018). "CyCADA: Cycle-Consistent Adversarial Domain Adaptation." *ICML 2018*. arXiv:1711.03213.
- CycleGAN project page: https://junyanz.github.io/CycleGAN/
- Official PyTorch implementation: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix
- Jun-Yan Zhu personal homepage, Carnegie Mellon University: https://www.cs.cmu.edu/~junyanz/

