# AI Image Generation

> Source: https://aiwiki.ai/wiki/ai_image_generation
> Updated: 2026-06-21
> Categories: Artificial Intelligence, Computer Vision, Generative AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

AI image generation is the use of [artificial intelligence](/wiki/artificial_intelligence) systems to create visual content, including photographs, illustrations, paintings, concept art, and graphic designs, from text descriptions, reference images, or other inputs. Modern AI image generators are built primarily on [diffusion models](/wiki/diffusion_model), which learn to create images by reversing a noise-addition process, and are conditioned on text using language-vision encoders like [CLIP](/wiki/clip) or T5 so that users can generate an image simply by typing what they want to see. The technology has advanced from producing blurry, incoherent outputs to generating photorealistic images that are often indistinguishable from real photographs. Since 2022, AI image generation has become one of the most widely used and most controversial applications of [generative AI](/wiki/generative_ai), reshaping creative industries while raising fundamental questions about copyright, artistic authorship, and the nature of creativity itself.

The scale of adoption is large. In the two weeks after Google launched its Gemini 2.5 Flash Image model ("Nano Banana") on August 26, 2025, the Gemini app gained more than 23 million new users and people generated more than 500 million images [18]. The results from frontier models can be stunning: coherent scenes with accurate lighting, realistic textures, legible text, and complex compositions involving multiple objects and characters.

## History

The history of AI image generation spans over a decade, progressing through several distinct technological eras before reaching the current state of the art.

### DeepDream (2015)

One of the earliest demonstrations of neural networks producing visual art came in June 2015, when Google engineer Alexander Mordvintsev published DeepDream [1]. The technique, officially called "Inceptionism," worked by reversing the normal function of a [convolutional neural network](/wiki/convolutional_neural_network). Instead of using the network to classify images, Mordvintsev ran it in reverse, asking it to amplify whatever patterns it detected in an input image. A network trained to recognize dogs would enhance dog-like features; one trained on architectural features would produce building-like hallucinations.

The results were surreal, psychedelic images filled with eyes, animal faces, and fractal patterns layered over ordinary photographs. While DeepDream was more of an artistic curiosity than a practical generation tool, it captured public imagination and demonstrated that neural networks contained latent visual knowledge that could be extracted and amplified.

### Generative Adversarial Networks (2014-2021)

The first serious approach to AI image generation came through [generative adversarial networks](/wiki/generative_adversarial_network) (GANs), introduced by Ian Goodfellow and colleagues in 2014 [2]. A GAN consists of two neural networks trained in opposition: a generator that creates images and a discriminator that tries to distinguish generated images from real ones. The original paper framed this as a game: "We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G" [2]. Through this adversarial training, the generator progressively improves until its outputs become difficult to tell apart from real data.

GANs evolved rapidly through several major variants:

| Model | Year | Key Innovation |
|-------|------|---------------|
| Original GAN | 2014 | Adversarial training framework |
| DCGAN | 2015 | Deep convolutional architecture for stable training |
| Progressive GAN | 2017 | Gradually increasing resolution during training |
| StyleGAN | 2018 | Style-based generator with fine-grained control |
| StyleGAN2 | 2019 | Eliminated artifacts, improved quality |
| StyleGAN3 | 2021 | Alias-free generation |

StyleGAN, published in December 2018 by NVIDIA researchers Tero Karras, Samuli Laine, and Timo Aila, could produce photorealistic human faces at 1024x1024 resolution that were virtually indistinguishable from real photographs, trained on the Flickr-Faces-HQ (FFHQ) dataset of 70,000 high-quality face images [19]. In February 2019, software engineer Phillip Wang used the model to build the website "This Person Does Not Exist," which displayed a new StyleGAN-generated face on every page reload and became a viral sensation that brought public awareness to the capabilities and risks of AI-generated imagery.

However, GANs had significant limitations. They were difficult to train (suffering from mode collapse and training instability), struggled with complex multi-object scenes, and could not be easily conditioned on text descriptions. Generating a specific scene described in natural language was not feasible with GAN architectures.

### CLIP and Text-Image Understanding (2021)

A critical enabler of modern text-to-image generation was CLIP (Contrastive Language-Image [Pre-training](/wiki/pre-training)), introduced by [OpenAI](/wiki/openai) in January 2021 [3]. CLIP was trained on 400 million image-text pairs scraped from the internet, learning to map images and text descriptions into a shared embedding space [3]. Given an image and a set of text descriptions, CLIP could determine which description best matched the image, and vice versa.

CLIP's significance for image generation was that it provided a bridge between language and vision. By combining CLIP with a generative model, researchers could guide image generation using natural language. Early CLIP-guided methods like CLIP+VQGAN (2021) achieved basic text-to-image generation, though the results were often abstract and lacked coherence.

### DALL-E (January 2021)

OpenAI announced DALL-E in January 2021, the first major text-to-image model to demonstrate convincing generation from complex text prompts [4]. Named as a portmanteau of Salvador Dali and the Pixar character WALL-E, the original DALL-E used a modified version of [GPT-3](/wiki/gpt-3) to generate image tokens from text descriptions. It could produce creative compositions like "an armchair in the shape of an avocado" with reasonable quality, demonstrating that large language models could learn visual generation.

### The Diffusion Revolution (2022)

The year 2022 was the inflection point for AI image generation, driven by the adoption of [diffusion models](/wiki/diffusion_model) as the dominant generation paradigm.

**DALL-E 2 (April 2022)** was announced by OpenAI on April 6, 2022, replacing the original's token-based approach with a diffusion model conditioned on CLIP image embeddings [20]. The quality improvement was dramatic: DALL-E 2 produced photorealistic images at up to 1024x1024 resolution, a fourfold increase over the original DALL-E, with accurate composition and lighting [20]. OpenAI initially restricted access through a waitlist before gradually expanding availability.

**Midjourney (July 2022)** entered open beta and quickly gained a reputation for producing highly artistic, aesthetically refined images [5]. Operating through a Discord bot interface, Midjourney focused on visual beauty rather than photorealism, attracting artists, designers, and creatives. Its distinctive aesthetic style set it apart from competing tools.

**Stable Diffusion (August 2022)** was released by [Stability AI](/wiki/stability_ai) on August 22, 2022, in collaboration with the CompVis research group at LMU Munich, Runway, and the LAION dataset effort, under the permissive CreativeML Open RAIL-M license [21]. This was a watershed moment: Stable Diffusion was the first capable text-to-image model with publicly downloadable weights, in contrast to the closed DALL-E 2 and Imagen. The base model carried roughly 860 million parameters in its U-Net denoiser and could be downloaded and run on consumer-grade GPUs with as little as 8GB of VRAM. The open release democratized AI image generation, spawning a massive ecosystem of fine-tuned models, community extensions, custom UIs ([ComfyUI](/wiki/comfyui), Automatic1111), and specialized applications.

### GPT-4o Native Image Generation (March 2025)

On March 25, 2025, OpenAI enabled native image generation in GPT-4o, its multimodal model, for [ChatGPT](/wiki/chatgpt) users [7]. Unlike previous systems where image generation was handled by a separate model (DALL-E), GPT-4o generated images as a native capability of the language model itself. This architectural integration meant the model could leverage its full knowledge base and conversation context when creating images, resulting in superior prompt adherence, especially for complex multi-object scenes. According to OpenAI, the model can generate images with "up to 10-20 different objects" while maintaining tight binding between objects and their described attributes, though it may struggle to accurately render more [7]. The release triggered the viral "Ghiblification" trend and brought AI image generation to an unprecedented number of users.

## How does AI image generation work?

Modern AI image generation is built on diffusion models, with text conditioning provided by language-vision encoders. Understanding the pipeline requires examining several components.

### Diffusion Models

Diffusion models generate images by learning to reverse a noise-addition process [8]. During training, the model takes a real image, progressively adds Gaussian noise to it over many steps until it becomes pure random noise, and then learns to reverse this process. The model is trained as a denoiser: given a noisy image at any step in the process, it predicts and removes the noise to recover a slightly cleaner version. The foundational Denoising Diffusion Probabilistic Models paper by Ho, Jain, and Abbeel (2020) described the approach as "a parameterized Markov chain trained using variational inference to produce samples matching the data after finite time" [8].

During generation, the model starts with pure random noise and iteratively denoises it over many steps (typically 20 to 50), gradually transforming the noise into a coherent image. Each denoising step moves the image slightly closer to the learned distribution of real images.

### Latent Diffusion

A key practical innovation, introduced in the Latent Diffusion Models paper by Rombach et al. (2022), is performing the diffusion process in a compressed latent space rather than in pixel space [6]. An encoder compresses the image into a lower-dimensional latent representation, the diffusion process operates in this latent space (which is computationally much cheaper), and a decoder reconstructs the final image from the denoised latent representation. This approach, used by Stable Diffusion and most subsequent models, dramatically reduced the computational cost of image generation while maintaining high quality.

### Text Conditioning

Text-to-image models are conditioned on text descriptions through one of several approaches:

**CLIP conditioning.** DALL-E 2 and early Stable Diffusion versions use CLIP text embeddings to guide the diffusion process. The text prompt is encoded into a vector by CLIP's text encoder, and this vector is injected into the denoising network via cross-attention layers, steering the generation toward images that match the text description.

**T5 conditioning.** Newer models like [Imagen](/wiki/imagen) (Google) and Stable Diffusion 3 use the T5 text encoder, a large language model that provides richer semantic understanding of text prompts. T5's deeper language understanding improves the model's ability to follow complex, multi-clause prompts.

**Dual encoder conditioning.** State-of-the-art models like Stable Diffusion 3 and Flux use both CLIP and T5 encoders simultaneously, combining CLIP's visual-semantic alignment with T5's linguistic depth for superior prompt adherence.

### Classifier-Free Guidance

Classifier-free guidance (CFG) is a technique that controls how strongly the text prompt influences the generated image. During training, the model is occasionally trained without the text condition (using a null prompt). During generation, the model produces both a conditioned prediction (guided by the text) and an unconditioned prediction (without text). The final output is a weighted combination that amplifies the difference between the two, effectively strengthening the influence of the text prompt. Higher CFG values produce images that more closely match the prompt but may sacrifice diversity and naturalness.

## Major Models (2025-2026)

The following table summarizes the leading AI image generation models as of early 2026.

| Model | Developer | Release | Architecture | Key Strengths |
|-------|-----------|---------|-------------|---------------|
| [Midjourney](/wiki/midjourney) v8 | Midjourney | March 2026 | Proprietary diffusion | 5x faster, native 2K images, improved text rendering |
| [Midjourney](/wiki/midjourney) v7 | Midjourney | April 2025 | Proprietary diffusion | Artistic quality, aesthetic refinement |
| GPT Image 1 / GPT-4o | [OpenAI](/wiki/openai) | March 2025 | Native multimodal transformer | Complex prompts, text rendering, conversation context |
| Gemini 2.5 Flash Image (Nano Banana) | [Google DeepMind](/wiki/google_deepmind) | August 2025 | Native multimodal transformer | Character consistency, conversational editing, $0.039/image |
| [Stable Diffusion](/wiki/stable_diffusion) 3.5 | Stability AI | Late 2025 | MMDiT (multimodal diffusion transformer) | Open-source, customizable, strong text rendering |
| Flux 1.1 Pro | Black Forest Labs | 2024-2025 | 12B parameter transformer | Photorealism, commercial quality, fast (4.5s) |
| Imagen 4 | [Google DeepMind](/wiki/google_deepmind) | GA February 2026 | Diffusion with T5-XXL | Photorealism, 2K resolution, improved faces |
| Adobe Firefly Image 5 | Adobe | 2025 | Proprietary diffusion | Commercial safety, Photoshop integration, 4MP native |
| DALL-E 3 | OpenAI | October 2023 | Diffusion with improved captioning | Prompt following, safety features |

### Midjourney v8

Released in alpha on March 17, 2026, Midjourney v8 significantly advances on v7 with five times faster generation, native 2K resolution output, and substantially improved text rendering within images [5]. A follow-up v8.1 Alpha launched April 14, 2026, adding further speed improvements and making HD mode three times faster and cheaper. The model handles complex multi-element prompts with higher fidelity to specified color palettes, spatial arrangements, and material textures.

### Midjourney v7

Released in April 2025, Midjourney v7 brought major aesthetic improvements over prior versions, producing images with distinctive lighting, composition, and stylistic coherence that appeal to artists and designers. The service continued to expand its web application alongside its original Discord-based interface.

### GPT-4o Native Generation

OpenAI's approach of embedding image generation directly within its multimodal language model represents an architectural departure from dedicated image generation systems. Because GPT-4o integrates visual generation with its language understanding, it excels at complex prompts involving specific spatial relationships, accurate text rendering, and scene compositions with many distinct objects [7]. The model's ability to reference conversation history and uploaded images for visual inspiration makes it particularly versatile.

### Nano Banana (Gemini 2.5 Flash Image)

Gemini 2.5 Flash Image, widely known by its codename "Nano Banana," is Google's native image generation and editing model, launched on August 26, 2025 [18]. Like GPT-4o, it integrates image generation directly into a multimodal model rather than relying on a separate diffusion system, and Google positioned it as a "state-of-the-art image model" specializing in blending multiple input images into one, maintaining character consistency across a series of edits, and making targeted natural-language transformations [18]. It is available through the Gemini API, Google AI Studio, and Vertex AI, priced at $30 per million output tokens, with each image consuming 1,290 output tokens, or about $0.039 per image [18]. Adoption was rapid: between launch on August 26 and September 9, 2025, the Gemini app added more than 23 million new users and generated more than 500 million images [18].

### Flux

Flux, developed by [Black Forest Labs](/wiki/black_forest_labs) (founded by former Stability AI researchers, including key architects of the original Stable Diffusion), emerged as a leading model for photorealism and commercial use. Built on a 12-billion-parameter transformer architecture, Flux 1.1 Pro generates high-quality images in approximately 4.5 seconds and competes directly with proprietary models while offering open-weight variants for local deployment [9].

### Stable Diffusion 3.5

Released in late 2025, Stable Diffusion 3.5 brought major improvements in image quality, prompt adherence, and text rendering, significantly closing the gap with proprietary models [6]. It uses a Multimodal Diffusion [Transformer](/wiki/transformer) (MMDiT) architecture with joint attention over image and text tokens. The open-source release maintains Stable Diffusion's position as the foundation of the community-driven image generation ecosystem.

### Imagen 4

Google DeepMind's Imagen 4 was announced at Google I/O in May 2025 and reached general availability in February 2026 through the Gemini API, Google AI Studio, and Vertex AI [10b]. The model comes in three tiers: Imagen 4 Ultra (maximum prompt fidelity), Imagen 4 Flagship ($0.04 per image), and Imagen 4 Fast ($0.02 per image, speed-optimized). Imagen 4 and Imagen 4 Ultra support generation at up to 2K resolution. The release brought notable improvements in human face generation and natural scene photorealism compared to Imagen 3.

### Adobe Firefly

[Adobe Firefly](/wiki/adobe_firefly) distinguishes itself through its focus on commercial safety. Adobe trains Firefly exclusively on licensed content (Adobe Stock, public domain works, and content Adobe has rights to use), providing clear intellectual property provenance. Firefly Image Model 5, released in 2025, generates photorealistic images at native 4-megapixel resolution [10]. Adobe has also integrated third-party models including Google Imagen, Flux, and GPT image generation into its Firefly platform, allowing creators to switch between models.

## Techniques

Beyond basic text-to-image generation, a rich ecosystem of techniques provides fine-grained control over the generation process.

### Image-to-Image (img2img)

Image-to-image generation takes an existing image as input along with a text prompt, and generates a new image that preserves the overall structure of the input while applying the changes described in the prompt. The technique works by adding noise to the input image (partially destroying it) and then denoising it with text guidance. A "denoising strength" parameter controls how much the output differs from the input: low values produce subtle modifications, while high values allow dramatic transformation.

### Inpainting

Inpainting allows selective editing of specific regions within an image. The user provides a mask indicating which part of the image to regenerate and a text prompt describing what should appear in the masked region. The model generates new content for the masked area while preserving the unmasked portions and maintaining visual coherence at the boundaries. Inpainting is used for removing unwanted objects, adding new elements, and fixing imperfections.

### ControlNet

ControlNet, introduced by Lvmin Zhang and Maneesh Agrawala in 2023, adds structural control to diffusion models [11]. It accepts additional conditioning inputs such as edge maps (Canny edges), depth maps, human pose skeletons (OpenPose), segmentation maps, or normal maps. These inputs guide the spatial composition of the generated image while the text prompt controls the style and content. For example, a developer can provide a pose skeleton of a person sitting and a text prompt describing a businessman in a suit, and the model will generate an image matching both the pose and the description. ControlNet has become essential for professional workflows that require precise compositional control.

### LoRA Fine-Tuning

[LoRA](/wiki/lora) (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that allows users to train a diffusion model on a small set of images to learn specific styles, subjects, or concepts [12]. Rather than retraining the entire model (which requires enormous compute), LoRA trains only a small number of additional parameters (typically a few megabytes) that are injected into the model's attention layers. A user can train a LoRA on 10 to 20 images of a specific person, art style, or product, then use it to generate new images featuring that subject or style. LoRA models are lightweight, stackable (multiple LoRAs can be combined), and have become the primary method for personalizing image generation.

### IP-Adapter

IP-Adapter (Image [Prompt](/wiki/prompt) Adapter), developed by [Tencent AI](/wiki/tencent_ai) Lab, enables image-based prompting for diffusion models [13]. Instead of describing a desired style or subject in text, the user provides a reference image, and IP-Adapter extracts visual features that guide generation. The adapter uses an image encoder to extract features from the reference image and injects them into the diffusion model's cross-attention layers. This approach is particularly effective for style transfer (generating new images in the style of a reference), face consistency (maintaining a character's appearance across multiple images), and combining visual references with text prompts.

### Technique Comparison

| Technique | Input | Control Type | Use Case |
|-----------|-------|-------------|----------|
| Text-to-image | Text prompt | Semantic content and style | General image creation |
| img2img | Image + text | Structure preservation with modifications | Variations, style transfer |
| Inpainting | Image + mask + text | Selective region editing | Object removal, addition, fixing |
| ControlNet | Structural map + text | Spatial composition (pose, edges, depth) | Professional workflows, precise layouts |
| LoRA | Small training dataset | Subject or style specialization | Character consistency, brand assets |
| IP-Adapter | Reference image + text | Visual style and subject transfer | Style matching, face consistency |

## What is AI image generation used for?

AI image generation has found applications across numerous industries and creative disciplines.

### Graphic Design and Marketing

Designers use AI image generation for rapid prototyping, concept exploration, and producing final assets for marketing campaigns, social media, and advertising. The ability to generate dozens of variations in minutes has accelerated creative workflows that previously took hours or days.

### Concept Art and Entertainment

Game studios, film production companies, and publishers use AI image generation for concept art, character design, environment visualization, and storyboarding. The technology enables rapid exploration of visual ideas in the early stages of production.

### E-Commerce and Product Visualization

Online retailers use AI to generate product images, lifestyle shots, and marketing materials. Products can be placed in different settings, shown from multiple angles, and combined with various backgrounds without physical photography.

### Architecture and Interior Design

Architects and designers use AI to visualize spaces, generate design variations, and present concepts to clients. Techniques like img2img and ControlNet are particularly useful for transforming sketches and floor plans into rendered visualizations.

### Education and Research

AI image generation supports educational content creation, scientific visualization, and research illustration. Researchers use the technology to generate synthetic training data for computer vision systems, reducing reliance on expensive manual data collection and annotation.

## Controversies

AI image generation has sparked intense debate about copyright, artistic rights, and the economic impact on creative professionals.

### Artist Lawsuits

In January 2023, three visual artists, Sarah Andersen, Kelly McKernan, and Karla Ortiz, filed a class-action lawsuit against Stability AI, [Midjourney](/wiki/midjourney), and DeviantArt, alleging that these companies used their copyrighted artworks without permission to train image generation models [14]. The lawsuit, which represents millions of artists, argues that AI-generated images in the style of specific artists constitute derivative works that infringe on the original artists' copyrights. The case remains actively litigated as of 2026.

### Getty Images v. Stability AI

Getty Images filed lawsuits against Stability AI in both the United States and the United Kingdom, alleging that Stable Diffusion was trained on over 12 million Getty images scraped from its websites without authorization [15]. The UK case produced a landmark ruling on November 4, 2025, the first UK judgment addressing copyright infringement in the context of generative AI training [15].

The UK High Court largely rejected Getty's claims. Getty accepted there was no evidence that the training of Stable Diffusion took place in the UK, and shortly before closing submissions it abandoned its primary copyright infringement and database right claims [15]. The court held that the model's weights were not an "infringing copy" of Getty's works because the model did not store copies of the underlying images. However, the court found limited trademark infringements under the Trade Marks Act 1994 where certain Stable Diffusion versions (1.x through 2.1) had reproduced watermarks resembling Getty's and iStock's registered marks, while finding no such infringement for SD XL or v1.6 [15]. The ruling established that territoriality matters profoundly: if AI training occurs outside the UK, a UK court may not consider primary copyright infringement claims related to that training.

### The Ghiblification Controversy

When GPT-4o's native image generation launched in March 2025, one of the first viral trends was "Ghiblification," the practice of transforming ordinary photos into images resembling the hand-drawn animation style of Studio Ghibli [16]. The trend ignited intense debate. Many artists viewed it as an attack on Studio Ghibli's painstaking craft and hand-drawn traditions. Studio Ghibli co-founder Hayao Miyazaki has long opposed AI-generated art. Shown an AI animation demo in a 2016 NHK documentary, he said, "I am utterly disgusted," adding, "Whoever creates this stuff has no idea what pain is whatsoever," and calling the technology "an insult to life itself" [22].

In November 2025, Studio Ghibli joined other Japanese publishers through the Content Overseas Distribution Association (CODA) to formally request that OpenAI refrain from using their content for machine learning without permission [16].

### Copyright Status of AI-Generated Images

The legal status of AI-generated images remains largely unresolved across jurisdictions. The U.S. Copyright Office has generally held that copyright requires human authorship, and purely AI-generated images without meaningful human creative input cannot be copyrighted. However, images that involve substantial human creative direction (selecting prompts, curating outputs, making modifications) exist in a legal gray area. The question of whether using copyrighted works for AI training constitutes fair use remains unsettled in U.S. courts.

### Impact on Artists and Illustrators

Commercial illustrators, stock photographers, and concept artists have reported significant declines in work as clients increasingly turn to AI-generated alternatives. Stock photography platforms have seen an influx of [AI-generated content](/wiki/ai_generated_content). Some platforms, including Getty Images, initially banned AI-generated uploads, while others have created separate categories for AI content. Adobe's approach of training on licensed content and offering built-in AI tools within Creative Suite represents an attempt to balance AI capabilities with respect for creators' rights.

## Current State (2025-2026)

As of early 2026, AI image generation has entered a phase of maturation and consolidation.

### Quality Convergence

The gap between the top models has narrowed considerably. While Midjourney v7 leads in artistic aesthetics and GPT-4o excels at complex prompt following, most leading models now produce high-quality, photorealistic results. The days of obvious AI artifacts (mangled hands, incoherent backgrounds, distorted text) are largely past for frontier models, though lower-tier and older models still exhibit these issues.

### Integration Into Professional Tools

AI image generation has moved from standalone tools into integrated features within existing creative software. Adobe has embedded Firefly capabilities throughout Photoshop, Illustrator, and other Creative Suite applications. Canva, Figma, and other design platforms have added AI generation features. This integration is normalizing AI as part of the creative toolkit rather than a separate, disruptive technology.

### Video and 3D Extension

Image generation models are increasingly serving as foundations for video and 3D generation. Techniques developed for image diffusion, including ControlNet, LoRA, and classifier-free guidance, have been adapted for [AI video generation](/wiki/ai_video_generation) and 3D asset creation. The boundaries between still image generation and other visual media continue to blur.

### Regulatory Developments

The [EU AI Act](/wiki/eu_ai_act), which reaches full application in August 2026, includes requirements for labeling AI-generated content and transparency about training data. The EU's Code of Practice on marking and labeling of AI-generated content is expected to be finalized in May to June 2026, establishing shared standards including secured metadata, watermarking protocols, and detection verification APIs. California's AI Transparency Act (effective January 1, 2026) also mandates disclosure requirements for AI-generated content from covered providers. The UK government was required to publish a full report on the use of copyright works in AI development by March 2026. These regulatory developments are beginning to shape how image generation models are trained, deployed, and used commercially.

## See Also

- [Diffusion Model](/wiki/diffusion_model)
- [Generative Adversarial Network](/wiki/generative_adversarial_network)
- [Generative AI](/wiki/generative_ai)
- [Midjourney](/wiki/midjourney)
- [Stable Diffusion](/wiki/stable_diffusion)
- [DALL-E](/wiki/dall-e)
- [AI Video Generation](/wiki/ai_video_generation)

## References

[1] Mordvintsev, A., Olah, C., and Tyka, M. (2015). "Inceptionism: Going Deeper into Neural Networks." Google Research Blog. https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html

[2] Goodfellow, I., et al. (2014). "Generative Adversarial Nets." Proceedings of [NeurIPS](/wiki/neurips) 2014. https://arxiv.org/abs/1406.2661

[3] Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." arXiv:2103.00020. https://arxiv.org/abs/2103.00020

[4] Ramesh, A., et al. (2021). "Zero-Shot Text-to-Image Generation." arXiv:2102.12092. https://arxiv.org/abs/2102.12092

[5] "Midjourney vs DALL-E vs Stable Diffusion vs Flux 2026: Complete AI Image Generator Comparison." Free Academy, 2026. https://freeacademy.ai/blog/midjourney-vs-dalle-vs-stable-diffusion-vs-flux-comparison-2026

[6] Rombach, R., et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. https://arxiv.org/abs/2112.10752

[7] "Introducing 4o Image Generation." OpenAI, March 2025. https://openai.com/index/introducing-4o-image-generation/

[8] Ho, J., Jain, A., and Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." arXiv:2006.11239. https://arxiv.org/abs/2006.11239

[9] "The 9 Best AI Image Generation Models in 2026." Gradually.ai. https://www.gradually.ai/en/ai-image-models/

[10] "Adobe Firefly: The next evolution of creative AI is here." Adobe Blog, April 2025. https://blog.adobe.com/en/publish/2025/04/24/adobe-firefly-next-evolution-creative-ai-is-here

[11] Zhang, L. and Agrawala, M. (2023). "Adding Conditional Control to Text-to-Image Diffusion Models." arXiv:2302.05543. https://arxiv.org/abs/2302.05543

[12] Hu, E.J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685. https://arxiv.org/abs/2106.09685

[13] Ye, H., et al. (2023). "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models." Tencent AI Lab. https://github.com/tencent-ailab/IP-Adapter

[14] Andersen et al. v. Stability AI et al. (2023). U.S. District Court, Northern District of California. https://www.saverilawfirm.com

[15] "Getty Images v Stability AI: English High Court Rejects Secondary Copyright Claim." Latham & Watkins, November 2025. https://www.lw.com/en/insights/getty-images-v-stability-ai-english-high-court-rejects-secondary-copyright-claim

[16] "The Ghiblification Controversy." The Science Survey, July 2025. https://thesciencesurvey.com/editorial/2025/07/03/the-ghiblification-controversy/

[10b] Google Developers Blog. "Announcing Imagen 4 Fast and the general availability of the Imagen 4 family in the Gemini API." 2026. https://developers.googleblog.com/announcing-imagen-4-fast-and-imagen-4-family-generally-available-in-the-gemini-api/

[17] Midjourney. "V8 Alpha." Midjourney Updates, March 2026. https://updates.midjourney.com/v8-alpha/

[18] "Introducing Gemini 2.5 Flash Image, our state-of-the-art image model." Google Developers Blog, August 26, 2025. https://developers.googleblog.com/en/introducing-gemini-2-5-flash-image/

[19] Karras, T., Laine, S., and Aila, T. (2019). "A Style-Based Generator Architecture for Generative Adversarial Networks." CVPR 2019. arXiv:1812.04948. https://arxiv.org/abs/1812.04948

[20] "DALL-E 2." OpenAI, April 6, 2022. https://openai.com/index/dall-e-2/

[21] "Stable Diffusion Public Release." Stability AI, August 22, 2022. https://stability.ai/news/stable-diffusion-public-release

[22] "Hayao Miyazaki Calls AI Animation an Insult to Life Itself." IndieWire, December 2016. https://www.indiewire.com/features/general/hayao-miyazaki-artificial-intelligence-animation-insult-to-life-studio-ghibli-1201757617/

