Text-to-Image Models
Last reviewed
May 13, 2026
Sources
36 citations
Review status
Source-backed
Revision
v2 ยท 5,788 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
36 citations
Review status
Source-backed
Revision
v2 ยท 5,788 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Multimodal Models and Tasks
Text-to-image models are a class of generative artificial intelligence systems that synthesize an image from a natural-language description, often called a prompt. The task is a form of conditional generation: given a string of text, the model produces a 2D raster image whose visual content depicts what the text describes. Modern text-to-image systems sit at the intersection of computer vision, natural language processing, and generative AI, and they are among the most visible consumer applications of deep learning.
The field moved through three broad technical eras in roughly a decade. Early systems used recurrent networks with attention and produced 32x32 thumbnails of birds and flowers. A second wave used generative adversarial networks (GANs) to push resolution up to 256x256 on narrow datasets such as CUB and Oxford-102. The current era began in 2021 with the release of OpenAI's DALL-E, shifted decisively toward diffusion in 2022 with GLIDE, DALL-E 2, Imagen, and Stable Diffusion, and consolidated in 2024 around Diffusion Transformers (DiT), Rectified Flow, and Flow Matching as the dominant architectural choices, exemplified by Stable Diffusion 3, Flux.1, Imagen 3, and OpenAI's gpt-image-1.
A text-to-image model learns a conditional distribution p(image | text). At inference time the user supplies a prompt; the model samples from this distribution and returns one or more images. The pipeline almost always has three components: a text encoder that maps the prompt to a sequence of embeddings, a generator that produces image content conditioned on those embeddings, and (for latent diffusion systems) a decoder that maps latent representations back to pixel space. Text encoders are usually pretrained language models, most commonly CLIP from OpenAI (Radford et al., 2021), T5 (Raffel et al., 2020), or Gemma/Llama variants in more recent systems.
Models differ on where the image is generated. Pixel-space diffusion such as Imagen and GLIDE iteratively denoise a noisy image at the target resolution, optionally followed by super-resolution stages. Latent diffusion compresses the image with a variational autoencoder first, then runs diffusion in the compressed latent space, which cuts compute by roughly an order of magnitude; this is the approach popularized by Stable Diffusion. Token-based models such as Parti and Muse quantize the image into discrete tokens and predict them autoregressively or with masked-token objectives, the same way large language models predict text.
Text-to-image is closely related to image-to-image translation, text-to-video, and text-to-3D. Many image models double as image editors when paired with an inpainting mask or a reference image, and the same diffusion backbones power Sora, Stable Video Diffusion, and other text-to-video models.
The first widely cited text-to-image neural network was AlignDRAW, introduced by Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov in the 2016 ICLR paper "Generating Images from Captions with Attention" (arXiv:1511.02793, November 2015). AlignDRAW extended the DRAW recurrent attention model to take a caption as input. The samples were 32x32 or 64x64 and the model could compose objects it had never seen together, such as "a stop sign is flying in blue skies," but the outputs were closer to blurry suggestions than recognizable photographs.
Reed et al. followed in 2016 with "Generative Adversarial Text to Image Synthesis" (Reed, Akata, Yan, Logeswaran, Schiele, Lee, ICML 2016), which trained a conditional GAN on the Caltech-UCSD Birds (CUB) and Oxford-102 flower datasets and produced 64x64 images of birds matching written descriptions.
The most influential GAN-based system was StackGAN by Han Zhang and collaborators ("StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks," ICCV 2017). StackGAN used a two-stage pipeline that first generated a 64x64 sketch from the caption and then refined it to 256x256. StackGAN++ extended the approach with multiple tree-structured generators.
AttnGAN (Tao Xu et al., "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks," CVPR 2018) introduced word-level cross-attention between the caption and image feature maps, allowing the generator to focus on specific words when drawing specific regions. Other GAN-era systems included MirrorGAN (Qiao et al., CVPR 2019), which added a caption-regeneration consistency loss, DM-GAN (Zhu et al., CVPR 2019) with a dynamic memory module, and ControlGAN (Li et al., NeurIPS 2019). These systems achieved impressive results on narrow domains such as birds, flowers, or COCO captions, but failed to generalize to open-ended prompts and rarely exceeded 256x256.
The modern era opened in January 2021 with OpenAI's DALL-E (Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever, "Zero-Shot Text-to-Image Generation," arXiv:2102.12092). DALL-E used a discrete variational autoencoder (dVAE) to compress 256x256 images into a 32x32 grid of 8,192 tokens, then trained a 12-billion-parameter autoregressive Transformer to predict the joint sequence of text tokens followed by image tokens. DALL-E demonstrated zero-shot composition such as an armchair shaped like an avocado, and its release sparked widespread public interest in text-to-image generation. DALL-E used CLIP for re-ranking the top samples.
Google's Parti (Pathways Autoregressive Text-to-Image), introduced by Jiahui Yu and colleagues in "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation" (arXiv:2206.10789, June 2022), pushed autoregressive scaling to 20 billion parameters and showed clean scaling curves on text fidelity. Parti used a ViT-VQGAN image tokenizer.
Meta released Make-A-Scene in 2022 (Gafni et al., "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors," arXiv:2203.13131), which added an optional segmentation map conditioning input alongside the text prompt.
Muse, introduced by Huiwen Chang and collaborators at Google in "Muse: Text-To-Image Generation via Masked Generative Transformers" (arXiv:2301.00704, January 2023), replaced the autoregressive ordering with a masked-token objective borrowed from MaskGIT, making generation much faster than left-to-right autoregression while preserving the discrete-token architecture.
Diffusion models had been studied since Sohl-Dickstein et al. (2015) and Ho, Jain, Abbeel ("Denoising Diffusion Probabilistic Models," NeurIPS 2020), but they entered text-to-image with GLIDE (Alex Nichol et al., "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models," arXiv:2112.10741, December 2021). GLIDE compared CLIP guidance with classifier-free guidance (introduced by Ho and Salimans, 2022) and showed that classifier-free guidance produced more photorealistic results.
DALL-E 2 (Ramesh, Dhariwal, Nichol, Chu, Chen, "Hierarchical Text-Conditional Image Generation with CLIP Latents," arXiv:2204.06125, April 2022) replaced DALL-E's autoregressive backbone with a two-stage diffusion model: a prior that mapped CLIP text embeddings to CLIP image embeddings, and a decoder that produced 64x64 images from those embeddings, plus two super-resolution stages to reach 1024x1024.
Google's Imagen (Chitwan Saharia et al., "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding," arXiv:2205.11487, May 2022) showed that a frozen T5-XXL text encoder gave substantially better text alignment than CLIP, and that scaling the text encoder helped more than scaling the U-Net. Imagen also generated images in pixel space rather than latent space.
Latent Diffusion Models from Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer ("High-Resolution Image Synthesis with Latent Diffusion Models," CVPR 2022, arXiv:2112.10752) introduced the latent-space approach that made Stable Diffusion practical on consumer GPUs. Stability AI released Stable Diffusion 1.4 in August 2022 under the Creative ML OpenRAIL-M license, and the open weights triggered an explosion of community fine-tunes, web UIs, and downstream applications. Stable Diffusion 1.5 followed in October 2022 with a slight quality bump from continued training.
Stable Diffusion 2.0 (November 2022) and 2.1 (December 2022) retrained the model with OpenCLIP ViT-H/14 and a new VAE, but lacked the celebrity and artist coverage of the original LAION-5B training set and never achieved the same community traction as 1.5.
Stable Diffusion XL (SDXL), introduced in the paper "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis" (Podell et al., July 2023), tripled the U-Net parameter count to 2.6 billion, added a second text encoder (OpenCLIP ViT-bigG and CLIP ViT-L), and trained natively at 1024x1024.
Google released Imagen 2 in December 2023 as part of the Vertex AI Imagen API and the Bard image generation feature.
William Peebles and Saining Xie introduced the Diffusion Transformer ("Scalable Diffusion Models with Transformers," arXiv:2212.09748, December 2022). DiT replaced the U-Net backbone with a pure Transformer that operated on patchified noisy latents, similar to a Vision Transformer. The architecture scaled cleanly and underpinned OpenAI's Sora video model announced in February 2024.
Stability AI applied the DiT idea to Stable Diffusion 3, introduced in the paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach (arXiv:2403.03206, March 2024). SD3 introduced MM-DiT (multimodal Diffusion Transformer), which used separate weight matrices for image and text tokens inside each Transformer block while letting attention flow between them. SD3 was trained with rectified flow rather than the standard DDPM objective, eliminating the explicit noise schedule. SD3 Medium (2 billion parameters) shipped in June 2024.
Black Forest Labs, founded by Robin Rombach, Patrick Esser, Andreas Blattmann, Dominik Lorenz, and other former core members of the Stable Diffusion team, released Flux.1 on August 1, 2024. The Flux.1 family used a 12-billion-parameter MM-DiT trained with rectified flow and shipped in three variants: Flux.1 Pro (closed API), Flux.1 Dev (open weights, non-commercial), and Flux.1 Schnell (4-step distilled, Apache 2.0). Flux.1.1 Pro followed in October 2024 with faster generation and the Ultra mode for 4-megapixel output. Flux.1 Kontext, focused on context-aware image editing, was announced in December 2024 and released in 2025.
Stable Diffusion 3.5, released in October 2024, returned to MM-DiT and shipped Large (8B), Large Turbo, and Medium variants. Google released Imagen 3 in August 2024 (generally available December 2024).
Recraft V3 launched in October 2024 and briefly topped the Artificial Analysis text-to-image arena. Ideogram released V1 in August 2023, V2 in August 2024, and V3 in 2025, with a focus on accurate in-image text rendering. NVIDIA's Sana (Xie et al., 2024) introduced an efficient linear-attention DiT.
DeepSeek released Janus-Pro in January 2025, a unified multimodal model that handles text-to-image generation and image understanding in a single autoregressive backbone. HiDream-I1, released in May 2025 by HiDream AI, used a 17-billion-parameter sparse mixture-of-experts DiT and topped several leaderboards on release. Other 2024 to 2025 releases included Hunyuan-DiT from Tencent (May 2024, bilingual Chinese-English DiT), Kolors from Kuaishou (July 2024, bilingual DiT), Lumina-mGPT (Alpha-VLLM), and Stable Cascade (Stability AI, February 2024, based on the Wurstchen architecture).
On the closed-source side, Midjourney shipped v6 (December 2023), v6.1 (July 2024), and v7 (March 2025). OpenAI's ChatGPT image generation, branded internally as gpt-image-1, launched on March 25, 2025 as an autoregressive image model integrated into GPT-4o; it was notable for handling long, structured prompts and producing legible text. Reve Image launched in 2024 as another autoregressive contender.
| Year | Model | Organization | Notes |
|---|---|---|---|
| 2015 | AlignDRAW | University of Toronto | First neural caption-to-image attention model |
| 2016 | Reed et al. GAN | Univ. Michigan, MPI | First conditional GAN on captions |
| 2017 | StackGAN | Rutgers, Lehigh, Microsoft | Two-stage GAN to 256x256 |
| 2018 | AttnGAN | Microsoft Research | Word-level cross-attention |
| Jan 2021 | DALL-E | OpenAI | 12B autoregressive Transformer with dVAE |
| Dec 2021 | GLIDE | OpenAI | Classifier-free guidance for diffusion |
| Apr 2022 | DALL-E 2 | OpenAI | CLIP-latent diffusion, 1024x1024 |
| May 2022 | Imagen | Google Research | T5-XXL text encoder, pixel-space diffusion |
| Jun 2022 | Parti | Google Research | 20B autoregressive Transformer |
| Aug 2022 | Stable Diffusion 1.4 | Stability AI, LMU Munich | First open-weights latent diffusion |
| Nov 2022 | Stable Diffusion 2.0 | Stability AI | OpenCLIP ViT-H/14 |
| Jul 2023 | SDXL | Stability AI | 2.6B U-Net, 1024x1024 native |
| Dec 2023 | Midjourney v6 | Midjourney | Major quality jump |
| Dec 2023 | Imagen 2 | Vertex AI release | |
| Feb 2024 | Stable Cascade | Stability AI | Wurstchen-style cascaded latent diffusion |
| Feb 2024 | Sora (DiT) | OpenAI | DiT applied to text-to-video |
| Mar 2024 | Stable Diffusion 3 paper | Stability AI | MM-DiT plus rectified flow |
| May 2024 | Hunyuan-DiT | Tencent | Bilingual DiT |
| Jun 2024 | SD3 Medium | Stability AI | 2B open weights |
| Jul 2024 | Midjourney v6.1 | Midjourney | Improved detail and coherence |
| Jul 2024 | Kolors | Kuaishou | Bilingual DiT |
| Aug 2024 | Flux.1 | Black Forest Labs | 12B MM-DiT, Schnell distilled |
| Aug 2024 | Imagen 3 | Google DeepMind | T5-XXL plus diffusion transformer |
| Aug 2024 | Ideogram 2.0 | Ideogram | Strong in-image typography |
| Oct 2024 | Recraft V3 | Recraft | Briefly top of Artificial Analysis arena |
| Oct 2024 | SD 3.5 Large | Stability AI | 8B MM-DiT |
| Oct 2024 | Flux.1.1 Pro | Black Forest Labs | Ultra 4MP mode |
| Jan 2025 | Janus-Pro | DeepSeek | Unified autoregressive understanding plus generation |
| Mar 2025 | Midjourney v7 | Midjourney | New personalization defaults |
| Mar 2025 | gpt-image-1 | OpenAI | Autoregressive image generation in ChatGPT |
| May 2025 | HiDream-I1 | HiDream AI | 17B sparse MoE DiT |
| Model | First release | Organization | Backbone | Open weights |
|---|---|---|---|---|
| DALL-E | Jan 2021 | OpenAI | Autoregressive Transformer plus dVAE | No |
| GLIDE | Dec 2021 | OpenAI | U-Net pixel diffusion | Partial (filtered 300M) |
| DALL-E 2 | Apr 2022 | OpenAI | U-Net diffusion with CLIP latents | No |
| Imagen | May 2022 | U-Net pixel diffusion, T5-XXL | No | |
| Parti | Jun 2022 | Autoregressive Transformer (ViT-VQGAN) | No | |
| Stable Diffusion 1.5 | Oct 2022 | Stability AI plus RunwayML | U-Net latent diffusion | Yes (CreativeML OpenRAIL-M) |
| DALL-E 3 | Oct 2023 | OpenAI | Diffusion plus GPT prompt rewriter | No |
| SDXL | Jul 2023 | Stability AI | U-Net latent diffusion, dual text encoder | Yes |
| Midjourney v6 | Dec 2023 | Midjourney | Closed (rumored DiT) | No |
| Imagen 2 | Dec 2023 | Pixel diffusion | No | |
| Stable Cascade | Feb 2024 | Stability AI | Wurstchen v3 cascade | Yes |
| SD3 Medium | Jun 2024 | Stability AI | MM-DiT plus rectified flow | Yes |
| Flux.1 Dev | Aug 2024 | Black Forest Labs | MM-DiT plus rectified flow (12B) | Yes (non-commercial) |
| Flux.1 Schnell | Aug 2024 | Black Forest Labs | 4-step distilled MM-DiT | Yes (Apache 2.0) |
| Imagen 3 | Aug 2024 | Google DeepMind | Diffusion plus T5 | No |
| Recraft V3 | Oct 2024 | Recraft | Closed | No |
| SD 3.5 Large | Oct 2024 | Stability AI | MM-DiT (8B) | Yes |
| Hunyuan-DiT | May 2024 | Tencent | MM-DiT (1.5B), bilingual | Yes |
| Janus-Pro | Jan 2025 | DeepSeek | Unified autoregressive | Yes |
| gpt-image-1 | Mar 2025 | OpenAI | Autoregressive | No |
| HiDream-I1 | May 2025 | HiDream AI | Sparse MoE DiT (17B) | Yes |
The U-Net was introduced by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in 2015 for biomedical segmentation. Ho, Jain, and Abbeel adapted it as the standard denoising backbone for DDPMs in 2020. A diffusion U-Net takes a noisy latent (or pixel) image plus a timestep embedding and a text conditioning vector, and predicts the noise. The U-Net has down-sampling, middle, and up-sampling blocks with skip connections, and text conditioning enters through cross-attention layers inserted at multiple resolutions. Stable Diffusion 1.x, 2.x, and SDXL all use U-Net backbones, as did GLIDE, DALL-E 2, Imagen, and Imagen 2.
Latent diffusion (Rombach et al., 2022) compresses the image with a VAE before running diffusion. For Stable Diffusion the VAE has an 8x downsampling factor, so a 512x512 image becomes a 64x64x4 latent. Diffusion runs on the latent, and the VAE decoder maps the final latent back to pixels. This cut compute roughly tenfold compared with pixel-space diffusion and was the key engineering insight that made high-resolution open-weights image generation feasible on a single consumer GPU.
William Peebles and Saining Xie's Diffusion Transformer ("Scalable Diffusion Models with Transformers," arXiv:2212.09748) replaced the U-Net with a pure Transformer. The noisy latent is split into patches (similar to ViT), each patch becomes a token, and the Transformer processes the full sequence with self-attention. Conditioning enters through adaptive layer norm (adaLN) modulated by timestep and class or text embeddings. DiT showed clean scaling: bigger models, more compute, and lower FID, with no architectural specialization for images beyond the patchification. DiT-XL/2 became the canonical reference model.
DiT was adopted by OpenAI's Sora (announced February 2024) and influenced the move to pure Transformer backbones in image generation. PixArt-Alpha (Chen et al., 2023) was the first widely-used open-source DiT for text-to-image.
MM-DiT, introduced in the Stable Diffusion 3 paper (Esser et al., 2024), keeps separate weight matrices for image tokens and text tokens within each Transformer block, but lets self-attention operate over the concatenated sequence. This avoids the cross-attention bottleneck of U-Net designs and gives text tokens equal status with image tokens. Both SD3 and Flux.1 use MM-DiT.
Autoregressive image models tokenize the image with a VQ-VAE, VQ-GAN, or similar discrete autoencoder, then predict the image-token sequence left-to-right after the text-token sequence. DALL-E (Jan 2021) was the first large autoregressive text-to-image model. Parti scaled this approach to 20 billion parameters. Muse swapped the autoregressive objective for masked-token prediction, which allowed parallel decoding. Janus-Pro and gpt-image-1 returned to autoregressive image generation in 2025 but with multimodal Transformers that share a single backbone with the text model.
Flow Matching was introduced by Yaron Lipman, Ricky Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le in "Flow Matching for Generative Modeling" (arXiv:2210.02747, October 2022). Rather than learning a denoising step at each noise level, flow matching learns a vector field that transports a simple base distribution (such as Gaussian noise) to the data distribution along straight or near-straight paths. This is a simulation-free training objective that can be trained with a regression loss.
Rectified Flow (Liu, Gong, Liu, ICLR 2023, "Flow Straight and Fast") is a specific flow-matching variant that explicitly straightens the trajectories. The 2024 SD3 paper showed that rectified flow on Transformer backbones beat standard DDPM on text-to-image tasks at scale, and the result is now the default training objective for SD3, SD 3.5, and Flux.1.
Measuring text-to-image quality is harder than measuring classifier accuracy because the output is a continuous image whose quality depends on aesthetics, prompt alignment, factual correctness, and absence of artifacts.
| Benchmark | Year | Measures | Notes |
|---|---|---|---|
| FID | 2017 | Distributional distance | Frechet Inception Distance, computes Frechet distance between Inception-v3 feature statistics of real and generated images |
| CLIPScore | 2021 | Prompt alignment | Cosine similarity between CLIP text and image embeddings |
| Inception Score (IS) | 2016 | Diversity and class confidence | Now considered weak |
| DrawBench | 2022 | Prompt alignment | 200 hand-crafted prompts from the Imagen team |
| PartiPrompts (P2) | 2022 | Prompt alignment | 1,632 prompts spanning 11 categories from the Parti team |
| HPS | 2023 | Human preference | Human Preference Score, Wu et al., trained on 98K human-labeled image pairs |
| HPSv2 | 2023 | Human preference | Wu, Sun, Wang, Liu, Tao, Liu, Liu, Tang, ICCV 2023, 98K to 798K labeled pairs |
| ImageReward | 2023 | Human preference | Xu, Liu, Wu et al., NeurIPS 2023, BLIP-based reward model trained on 137K human comparisons |
| PickScore | 2023 | Human preference | Kirstain et al., trained on Pick-a-Pic dataset of 500K user preferences |
| GenEval | 2023 | Compositional skills | Ghosh, Hajishirzi, Schmidt, NeurIPS 2023, object grounding, counting, color, position, attribute binding |
| T2I-CompBench | 2023 | Compositional skills | Huang et al., NeurIPS 2023, attribute binding, object relationships, complex compositions |
| DPG-Bench | 2024 | Dense prompt fidelity | ELLA paper, evaluates long, paragraph-length prompts |
| PaintSkills | 2022 | Visual reasoning | Cho, Zala, Bansal, EMNLP 2022, object, count, spatial, attribute skills |
| Artificial Analysis Image Arena | 2024 | Human Elo | Pairwise crowdsourced votes across many models |
| LMSys Image Arena | 2024 | Human Elo | Side-by-side blind voting |
| Hugging Face Open T2I Arena | 2024 | Human Elo | Community-run arena focused on open-weights models |
FID remains the most cited automated metric, but it is sensitive to dataset choice (typically MS-COCO 30K for text-to-image FID) and does not correlate well with human preference at the top of the leaderboard. Most newer papers report a combination of CLIPScore for prompt alignment, GenEval for compositional skills, and a human-preference score such as HPSv2 or PickScore, alongside Elo from a public arena.
Hugging Face hosts several public leaderboards: the Open Text-to-Image Leaderboard (now archived), the GenAI-Arena, and the Open Image Arena. Artificial Analysis runs its own Image Arena with frequent reranking; HiDream-I1, Flux.1.1 Pro, Imagen 3, and Recraft V3 have each held the top human-Elo slot at various points between 2024 and 2025.
The release of Stable Diffusion's weights in August 2022 spawned an ecosystem of open-source tools that turned text-to-image generation into a hobbyist and small-team activity rather than something gated behind cloud APIs.
Hugging Face Diffusers is the standard Python library for running text-to-image models. It provides a unified API for SD 1.x, SD 2.x, SDXL, SD3, Flux, Stable Cascade, Kandinsky, PixArt, and dozens of community fine-tunes, along with implementations of all common schedulers (DDIM, DPM-Solver, Euler, UniPC, LCM) and pipelines for inpainting, ControlNet, IP-Adapter, and image-to-image.
ONNX Runtime, TensorRT, and Apple Core ML provide hardware-accelerated inference paths for production deployment. Stability AI ships SDXL in Core ML format for on-device generation on Apple Silicon. The Flux.1 Schnell distillation runs in 4 steps and is the fastest open model competitive with closed APIs at consumer prices.
ComfyUI, created by comfyanonymous in early 2023, presents a node-graph editor for diffusion pipelines and is the most flexible of the open frontends. It supports custom nodes, complex pipelines, multi-pass workflows, and ControlNet stacking, and it has become the de facto reference UI for non-commercial Flux deployments.
Automatic1111's stable-diffusion-webui (released August 2022) was the first widely-adopted web frontend and remains popular for its plugin ecosystem and one-click installers. Forge, a fork of A1111 by lllyasviel (the author of ControlNet), focuses on memory efficiency and faster generation. InvokeAI ships a polished interface aimed at professional artists. Fooocus simplifies the UI down to a single prompt box with sensible defaults. Pinokio packages many of these tools as one-click installable apps.
Civitai (launched late 2022) hosts community fine-tunes and LoRAs for Stable Diffusion variants, with millions of downloads per month at peak.
Text prompts alone are a coarse interface. Several extensions add fine-grained spatial, identity, or style conditioning on top of an existing text-to-image model without retraining the base weights.
ControlNet was introduced by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala in "Adding Conditional Control to Text-to-Image Diffusion Models" (arXiv:2302.05543, February 2023). ControlNet trains a copy of the encoder half of the diffusion U-Net to accept an additional conditioning image (Canny edges, depth maps, OpenPose skeletons, segmentation masks, scribbles, normals, line art, MLSD lines, and others). The trainable copy is connected to the frozen base model with zero-initialized convolutions, so the addition starts as a no-op and learns the spatial control without disturbing the base weights. ControlNet enabled accurate pose and layout control and was a key building block for the SD-based animation and design workflows that followed.
T2I-Adapter (Mou et al., 2023) is a lightweight alternative to ControlNet with similar functionality but smaller adapter modules.
LoRA (Low-Rank Adaptation), originally proposed by Hu et al. for language models in 2021 ("LoRA: Low-Rank Adaptation of Large Language Models," arXiv:2106.09685), became the standard way to fine-tune diffusion models on small datasets. A LoRA inserts low-rank trainable matrices into attention layers (and often cross-attention layers in the U-Net or DiT), so a fine-tune for a new character, style, or concept only stores tens of megabytes rather than the full multi-gigabyte checkpoint. Civitai and Hugging Face host tens of thousands of public LoRAs for Stable Diffusion variants and Flux.
IP-Adapter (Ye et al., "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models," arXiv:2308.06721, August 2023) lets a user supply a reference image as a second conditioning signal alongside the text prompt, allowing identity, style, or composition transfer with a single forward pass. The InstantID and PuLID variants specialize IP-Adapter for face identity preservation.
AnimateDiff (Guo et al., "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning," arXiv:2307.04725, July 2023) attaches a temporal motion module to a frozen SD checkpoint, turning any image checkpoint into a short-clip video model. It paved the way for many of the open-source text-to-video and image-to-video pipelines built on Stable Diffusion and SDXL.
DreamBooth (Ruiz et al., 2022) is an earlier fine-tuning method, mostly superseded by LoRA, that personalized a diffusion model to a specific subject from 3 to 5 example images.
In 2025 the major closed-source APIs price image generation roughly as follows (per image, standard resolution):
| Service | Standard price | Notes |
|---|---|---|
| OpenAI gpt-image-1 (low) | about 1.1 cents | 1024x1024 low quality |
| OpenAI gpt-image-1 (medium) | about 4.2 cents | Default consumer tier |
| OpenAI gpt-image-1 (high) | about 16.7 cents | Best text rendering |
| Midjourney Basic | about 2.5 cents | 200 images per month at 10 USD |
| Stability AI SD3 Large API | about 6.5 cents | Pay-per-credit |
| Black Forest Labs Flux.1.1 Pro | about 4.0 cents | Replicate or fal.ai |
| Black Forest Labs Flux.1.1 Ultra | about 6.0 cents | 4 megapixel |
| Google Imagen 3 | about 3.0 cents | Vertex AI |
| Ideogram V3 | about 8.0 cents | High in-image text quality |
| Recraft V3 | about 4.0 cents | Vector and brand presets |
Open-weights models such as Flux.1 Schnell or SDXL on a self-hosted GPU cost only the electricity and amortized hardware, often below 0.1 cents per image once a workflow is set up.
Text-to-image systems are deployed across consumer creativity tools, marketing and advertising production, concept art and pre-visualization, stock-photo replacement, e-commerce product imagery, packaging design, game-asset generation, educational illustration, and personal photo editing. Adobe Firefly integrates an in-house diffusion model into Photoshop's Generative Fill. Canva, Figma, and Microsoft Designer use commercial APIs for text-to-image inside their editors. Game studios use SD or Flux fine-tunes for environment and character ideation. Architecture and product-design teams use ControlNet plus a sketch or 3D render as the conditioning input.
Separately, text-to-image is the conditioning backbone for many text-to-video, text-to-3D, and personalization systems, and synthetic images from these models are used to augment training data for vision classifiers and detection models.
Most large text-to-image models were trained on web-scraped datasets such as LAION-5B (Schuhmann et al., 2022), which contains 5.85 billion image-text pairs scraped from Common Crawl. Several artists filed lawsuits against Stability AI, Midjourney, and DeviantArt in January 2023, arguing that training on their work without consent violated copyright. The UK Getty Images v. Stability AI case began in 2023 and proceeded to trial in 2024. The US Copyright Office has issued several decisions (including Zarya of the Dawn in February 2023 and the Allen v. Perlmutter ruling in 2025) holding that outputs of generative AI without sufficient human creative input are not copyrightable.
LAION-5B was temporarily taken offline in December 2023 after a Stanford Internet Observatory report by David Thiel identified child sexual abuse material in the dataset. A re-released version with the offending content removed was published in mid 2024.
The ability to generate photorealistic images of real people from a prompt has produced documented harms in the form of non-consensual intimate imagery and political deepfakes. The January 2024 incident in which non-consensual deepfake images of Taylor Swift spread on X led to platform-level changes and renewed legislative attention, including the US NO FAKES and DEFIANCE Acts. Many commercial models block named celebrities and apply automated nudity filters; most open-weights models can be fine-tuned to bypass those filters.
Buolamwini and Gebru's earlier work on facial-recognition bias ("Gender Shades," 2018) generalized to image generation: text-to-image systems trained on web data inherit demographic, occupational, and stereotype biases from the training set. Bloomberg's 2023 analysis of Stable Diffusion outputs by Leonardo Nicoletti and Dina Bass found that prompts for "CEO" or "doctor" disproportionately produced light-skinned men, while "social worker" or "fast-food worker" leaned the other way. Google's 2024 Gemini image-generation feature briefly produced historically inaccurate outputs (such as racially diverse Nazi soldiers) after a fairness adjustment, and Google suspended the feature for several weeks.
Training a large text-to-image model takes tens of thousands of GPU-hours; Stable Diffusion 1.5 used roughly 150,000 A100-hours according to the model card. Inference at scale is also non-trivial: a 2023 paper by Luccioni, Jernite, and Strubell estimated that generating 1,000 images with a large model uses about as much energy as fully charging a smartphone several times.