Text-to-Image Models

AI Models Generative AI Multimodal AI

38 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

55 citations

Revision

v6 · 7,595 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Multimodal Models and Tasks

Text-to-image models are generative artificial intelligence systems that synthesize a new image from a natural-language description, called a prompt. Given a string of text, the model produces a 2D raster image whose visual content depicts what the text describes, a form of conditional generation that learns the distribution p(image | text). The first widely used system was OpenAI's DALL-E in January 2021; by 2026 the leading models include OpenAI's GPT Image line, Google's Gemini image models (marketed as "Nano Banana" and "Nano Banana Pro"), Midjourney, Google Imagen 4, the Flux family, Stable Diffusion, and Alibaba's Qwen-Image. Modern text-to-image systems sit at the intersection of computer vision, natural language processing, and generative AI, and they are among the most visible consumer applications of deep learning. When OpenAI integrated image generation natively into GPT-4o in March 2025, ChatGPT added more than one million users in under an hour, a sign-up rate that company CEO Sam Altman said outpaced the original 2022 ChatGPT launch.^[1]^[2]

The field moved through three broad technical eras in roughly a decade. Early systems used recurrent networks with attention and produced 32x32 thumbnails of birds and flowers. A second wave used generative adversarial networks (GANs) to push resolution up to 256x256 on narrow datasets such as CUB and Oxford-102. The current era began in 2021 with the release of OpenAI's DALL-E, shifted decisively toward the diffusion model in 2022 with GLIDE, DALL-E 2, Imagen, and Stable Diffusion, and consolidated in 2024 around Diffusion Transformers (DiT), Rectified Flow, and Flow Matching as the dominant architectural choices, exemplified by Stable Diffusion 3, Flux.1, Imagen 3, and OpenAI's gpt-image-1. In 2025 and 2026 the frontier shifted again toward native multimodal generation, in which a single autoregressive or hybrid model produces both language and image output; examples include OpenAI's gpt-image-1 and its GPT Image successors, Google's Gemini image models (marketed as "Nano Banana" and "Nano Banana Pro"), and Alibaba's Qwen-Image.

The most widely used systems as of 2026 include OpenAI's GPT Image line (the successor to DALL-E 3, which OpenAI deprecated in 2026), Midjourney, Google's Imagen 4 and Gemini image models, Stable Diffusion and its open derivatives, the Flux family from Black Forest Labs, Adobe Firefly, Ideogram, Recraft, and Alibaba's Qwen-Image. This article catalogs notable text-to-image models grouped by family and era, alongside the architectures, benchmarks, and ecosystem that surround them.

What is a text-to-image model? (Overview)

A text-to-image model learns a conditional distribution p(image | text). At inference time the user supplies a prompt; the model samples from this distribution and returns one or more images. The pipeline almost always has three components: a text encoder that maps the prompt to a sequence of embeddings, a generator that produces image content conditioned on those embeddings, and (for latent diffusion systems) a decoder that maps latent representations back to pixel space. Text encoders are usually pretrained language models, most commonly CLIP from OpenAI (Radford et al., 2021), T5 (Raffel et al., 2020), or Gemma/Llama variants in more recent systems.

OpenAI framed the 2025 shift to native generation as a long-held design goal, writing in its 4o image generation announcement: "At OpenAI, we have long believed image generation should be a primary capability of our language models. That's why we've built our most advanced image generator yet into GPT-4o."^[1]

Models differ on where the image is generated. Pixel-space diffusion such as Imagen and GLIDE iteratively denoise a noisy image at the target resolution, optionally followed by super-resolution stages. Latent diffusion compresses the image with a variational autoencoder first, then runs diffusion in the compressed latent space, which cuts compute by roughly an order of magnitude; this is the approach popularized by Stable Diffusion. Token-based models such as Parti and Muse quantize the image into discrete tokens and predict them autoregressively or with masked-token objectives, the same way large language models predict text.

Text-to-image is closely related to image-to-image translation, text-to-video, and text-to-3D. Many image models double as image editors when paired with an inpainting mask or a reference image, and the same diffusion backbones power Sora, Stable Video Diffusion, and other text-to-video models.

History: how did text-to-image models develop?

Early attention-based generation (2015 to 2016)

The first widely cited text-to-image neural network was AlignDRAW, introduced by Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov in the 2016 ICLR paper "Generating Images from Captions with Attention" (arXiv:1511.02793, November 2015). AlignDRAW extended the DRAW recurrent attention model to take a caption as input. The samples were 32x32 or 64x64 and the model could compose objects it had never seen together, such as "a stop sign is flying in blue skies," but the outputs were closer to blurry suggestions than recognizable photographs.[^align]

Reed et al. followed in 2016 with "Generative Adversarial Text to Image Synthesis" (Reed, Akata, Yan, Logeswaran, Schiele, Lee, ICML 2016), which trained a conditional GAN on the Caltech-UCSD Birds (CUB) and Oxford-102 flower datasets and produced 64x64 images of birds matching written descriptions.[^reedgan]

GAN era (2017 to 2020)

The most influential GAN-based system was StackGAN by Han Zhang and collaborators ("StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks," ICCV 2017). StackGAN used a two-stage pipeline that first generated a 64x64 sketch from the caption and then refined it to 256x256. StackGAN++ extended the approach with multiple tree-structured generators.[^stackgan]

AttnGAN (Tao Xu et al., "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks," CVPR 2018) introduced word-level cross-attention between the caption and image feature maps, allowing the generator to focus on specific words when drawing specific regions.[^attngan] Other GAN-era systems included MirrorGAN (Qiao et al., CVPR 2019), which added a caption-regeneration consistency loss, DM-GAN (Zhu et al., CVPR 2019) with a dynamic memory module, and ControlGAN (Li et al., NeurIPS 2019). These systems achieved impressive results on narrow domains such as birds, flowers, or COCO captions, but failed to generalize to open-ended prompts and rarely exceeded 256x256.

Autoregressive era (2021 to 2022)

The modern era opened in January 2021 with OpenAI's DALL-E (Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever, "Zero-Shot Text-to-Image Generation," arXiv:2102.12092). DALL-E used a discrete variational autoencoder (dVAE) to compress 256x256 images into a 32x32 grid of 8,192 tokens, then trained a 12-billion-parameter autoregressive Transformer to predict the joint sequence of text tokens followed by image tokens. DALL-E demonstrated zero-shot composition such as an armchair shaped like an avocado, and its release sparked widespread public interest in text-to-image generation. DALL-E used CLIP for re-ranking the top samples.[^dalle]

Google's Parti (Pathways Autoregressive Text-to-Image), introduced by Jiahui Yu and colleagues in "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation" (arXiv:2206.10789, June 2022), pushed autoregressive scaling to 20 billion parameters and showed clean scaling curves on text fidelity. Parti used a ViT-VQGAN image tokenizer.[^parti]

Meta released Make-A-Scene in 2022 (Gafni et al., "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors," arXiv:2203.13131), which added an optional segmentation map conditioning input alongside the text prompt.

Muse, introduced by Huiwen Chang and collaborators at Google in "Muse: Text-To-Image Generation via Masked Generative Transformers" (arXiv:2301.00704, January 2023), replaced the autoregressive ordering with a masked-token objective borrowed from MaskGIT, making generation much faster than left-to-right autoregression while preserving the discrete-token architecture.

Diffusion era (2021 to 2023)

Diffusion models had been studied since Sohl-Dickstein et al. (2015) and Ho, Jain, Abbeel ("Denoising Diffusion Probabilistic Models," NeurIPS 2020), but they entered text-to-image with GLIDE[^ddpm] (Alex Nichol et al., "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models," arXiv:2112.10741, December 2021). GLIDE compared CLIP guidance with classifier-free guidance (introduced by Ho and Salimans, 2022)[^cfg] and showed that classifier-free guidance produced more photorealistic results.[^glide]

DALL-E 2 (Ramesh, Dhariwal, Nichol, Chu, Chen, "Hierarchical Text-Conditional Image Generation with CLIP Latents," arXiv:2204.06125, April 2022) replaced DALL-E's autoregressive backbone with a two-stage diffusion model: a prior that mapped CLIP text embeddings to CLIP image embeddings, and a decoder that produced 64x64 images from those embeddings, plus two super-resolution stages to reach 1024x1024.[^dalle2]

Google's Imagen (Chitwan Saharia et al., "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding," arXiv:2205.11487, May 2022) showed that a frozen T5-XXL text encoder gave substantially better text alignment than CLIP, and that scaling the text encoder helped more than scaling the U-Net. Imagen also generated images in pixel space rather than latent space. The Imagen paper reported a state-of-the-art zero-shot FID of 7.27 on the MS-COCO dataset "without ever training on COCO," and stated that "human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment," comparing against VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2.[^imagen]^[3]

Latent diffusion models from Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer ("High-Resolution Image Synthesis with Latent Diffusion Models," CVPR 2022, arXiv:2112.10752) introduced the latent-space approach that made Stable Diffusion practical on consumer GPUs.[^ldm] Stability AI released Stable Diffusion 1.4 in August 2022 under the Creative ML OpenRAIL-M license, and the open weights triggered an explosion of community fine-tunes, web UIs, and downstream applications. Stable Diffusion 1.5 followed in October 2022 with a slight quality bump from continued training.

Stable Diffusion 2.0 (November 2022) and 2.1 (December 2022) retrained the model with OpenCLIP ViT-H/14 and a new VAE, but lacked the celebrity and artist coverage of the original LAION-5B training set and never achieved the same community traction as 1.5.

Stable Diffusion XL (SDXL), introduced in the paper "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis" (Podell et al., July 2023), tripled the U-Net parameter count to 2.6 billion, added a second text encoder (OpenCLIP ViT-bigG and CLIP ViT-L), and trained natively at 1024x1024.[^sdxl]

Google released Imagen 2 in December 2023 as part of the Vertex AI Imagen API and the Bard image generation feature.

Diffusion Transformer era (2023 to 2026)

William Peebles and Saining Xie introduced the Diffusion Transformer ("Scalable Diffusion Models with Transformers," arXiv:2212.09748, December 2022). DiT replaced the U-Net backbone with a pure Transformer that operated on patchified noisy latents, similar to a Vision Transformer. The architecture scaled cleanly and underpinned OpenAI's Sora video model announced in February 2024.

Stability AI applied the DiT idea to Stable Diffusion 3, introduced in the paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach (arXiv:2403.03206, March 2024). SD3 introduced MM-DiT (multimodal Diffusion Transformer), which used separate weight matrices for image and text tokens inside each Transformer block while letting attention flow between them. SD3 was trained with rectified flow rather than the standard DDPM objective, eliminating the explicit noise schedule. The SD3 authors reported that their largest model "outperforms current proprietary and open state-of-the-art generative image models" in human-preference evaluation on PartiPrompts for visual aesthetics, prompt following, and typography generation. SD3 Medium (2 billion parameters) shipped in June 2024.[^sd3]^[4]

Black Forest Labs, founded by Robin Rombach, Patrick Esser, Andreas Blattmann, Dominik Lorenz, and other former core members of the Stable Diffusion team, released Flux.1 on August 1, 2024. The Flux.1 family used a 12-billion-parameter MM-DiT trained with rectified flow and shipped in three variants: Flux.1 Pro (closed API), Flux.1 Dev (open weights, non-commercial), and Flux.1 Schnell (4-step distilled, Apache 2.0).[^bfl] Black Forest Labs reported that its largest Flux.1 model outperformed Stable Diffusion 3 Ultra, Midjourney v6.0, and DALL-E 3 HD in internal qualitative tests.^[5] Flux.1.1 Pro followed in October 2024 with faster generation and the Ultra mode for 4-megapixel output. Flux.1 Kontext, focused on context-aware image editing, was announced in December 2024 and released in 2025.

Stable Diffusion 3.5, released in October 2024, returned to MM-DiT and shipped Large (8B), Large Turbo, and Medium variants.[^sd35] Google released Imagen 3 in August 2024 (generally available December 2024) and Imagen 4 at Google I/O on May 20, 2025; the Imagen 4 family (Fast, standard, and Ultra) reached general availability in the Gemini API on August 15, 2025.[^imagen4]

Recraft V3 launched in October 2024 and briefly topped the Artificial Analysis text-to-image arena. Ideogram released V1 in August 2023, V2 in August 2024, and Ideogram 3.0 on March 26, 2025 (refined May 1, 2025), with a focus on accurate in-image text rendering.[^ideogram3] NVIDIA's Sana (Xie et al., 2024) introduced an efficient linear-attention DiT.

DeepSeek released Janus-Pro in January 2025, a unified multimodal model that handles text-to-image generation and image understanding in a single autoregressive backbone. HiDream-I1, released in May 2025 by HiDream AI, used a 17-billion-parameter sparse mixture-of-experts DiT and topped several leaderboards on release. Other 2024 to 2025 releases included Hunyuan-DiT from Tencent (May 2024, bilingual Chinese-English DiT), Kolors from Kuaishou (July 2024, bilingual DiT), Lumina-mGPT (Alpha-VLLM), and Stable Cascade (Stability AI, February 2024, based on the Wurstchen architecture).

On the closed-source side, Midjourney shipped v6 (December 2023), v6.1 (July 2024), and v7 (April 3, 2025, default June 17, 2025); Midjourney moved its V8 series to alpha and public release in 2026, with V8.1 published on midjourney.com on April 30, 2026 as its fastest model with native 2K output. V8.1 renders standard jobs roughly 4 to 5 times faster than earlier versions and generates native 2048x2048 images without a separate upscale step.[^mjver]^[6] OpenAI's ChatGPT image generation, branded internally as gpt-image-1, launched on March 25, 2025 as an autoregressive image model integrated into GPT-4o; it was notable for handling long, structured prompts and producing legible text.[^gpt4o] OpenAI later made GPT Image the default image generator in ChatGPT in place of DALL-E 3 and announced the deprecation of the dall-e-2 and dall-e-3 API snapshots, which were removed from the API on May 12, 2026.[^dalledep]^[7] Reve Image launched in 2024 as another autoregressive contender.

Native multimodal and frontier era (2025 to 2026)

A defining shift in 2025 and 2026 was the move from dedicated text-to-image pipelines toward native multimodal models that generate images directly from a large language or vision-language backbone, often with stronger instruction following, in-image text, and conversational editing.

OpenAI's gpt-image-1 (March 2025) was the first such model to reach mass consumer use through ChatGPT, driving the one-million-users-in-under-an-hour sign-up surge noted above.^[2] Google followed with Gemini image generation. Gemini 2.5 Flash Image, marketed as "Nano Banana," was announced on August 26, 2025 as a low-latency image generation and editing model with strong character consistency and an invisible SynthID watermark; in the two weeks after launch, Google's Gemini app added more than 23 million new users and users generated over 500 million images, lifting the app to number one on the App Store and Google Play.[^nanobanana]^[8] Gemini 3 Pro Image, marketed as "Nano Banana Pro," followed on November 20, 2025, built on Gemini 3 Pro reasoning, with class-leading legible text rendering, support for up to 14 input images, resemblance preservation for up to 5 people, and upscaling to 1K, 2K, and 4K.[^nanobananapro]^[9]

ByteDance entered with Seedream 4.0 (September 2025), a unified generation-and-editing Diffusion Transformer producing up to 4K output with roughly tenfold faster inference than Seedream 3.0, followed by Seedream 4.5 in December 2025.[^seedream] Alibaba released Qwen-Image on August 4, 2025, a 20-billion-parameter MM-DiT under an Apache 2.0 license with a particular emphasis on rendering text, including Chinese logographic characters; Qwen-Image 2.0 followed on February 10, 2026 as a leaner 7-billion-parameter unified model (an 8B Qwen3-VL encoder plus a 7B diffusion decoder) supporting native 2K resolution.[^qwenimage]^[10]

Black Forest Labs extended the Flux line with Flux.1 Kontext, announced on May 29, 2025 for in-context image editing in Max, Pro, and Dev variants, and then Flux.2, announced on November 25, 2025 as a production-grade family (Pro, Flex, Dev, and the Apache 2.0 Klein, plus a Max tier) generating up to 4-megapixel images with multi-reference control; Flux.2 reportedly pairs latent flow matching with a Mistral-3 vision-language model, and Flux.2 [klein] shipped on January 15, 2026.[^flux2]

Timeline of major releases

Year	Model	Organization	Notes
2015	AlignDRAW	University of Toronto	First neural caption-to-image attention model
2016	Reed et al. GAN	Univ. Michigan, MPI	First conditional GAN on captions
2017	StackGAN	Rutgers, Lehigh, Microsoft	Two-stage GAN to 256x256
2018	AttnGAN	Microsoft Research	Word-level cross-attention
Jan 2021	DALL-E	OpenAI	12B autoregressive Transformer with dVAE
Dec 2021	GLIDE	OpenAI	Classifier-free guidance for diffusion
Apr 2022	DALL-E 2	OpenAI	CLIP-latent diffusion, 1024x1024
May 2022	Imagen	Google Research	T5-XXL text encoder, pixel-space diffusion
Jun 2022	Parti	Google Research	20B autoregressive Transformer
Aug 2022	Stable Diffusion 1.4	Stability AI, LMU Munich	First open-weights latent diffusion
Nov 2022	Stable Diffusion 2.0	Stability AI	OpenCLIP ViT-H/14
Jul 2023	SDXL	Stability AI	2.6B U-Net, 1024x1024 native
Dec 2023	Midjourney v6	Midjourney	Major quality jump
Dec 2023	Imagen 2	Google	Vertex AI release
Feb 2024	Stable Cascade	Stability AI	Wurstchen-style cascaded latent diffusion
Feb 2024	Sora (DiT)	OpenAI	DiT applied to text-to-video
Mar 2024	Stable Diffusion 3 paper	Stability AI	MM-DiT plus rectified flow
May 2024	Hunyuan-DiT	Tencent	Bilingual DiT
Jun 2024	SD3 Medium	Stability AI	2B open weights
Jul 2024	Midjourney v6.1	Midjourney	Improved detail and coherence
Jul 2024	Kolors	Kuaishou	Bilingual DiT
Aug 2024	Flux.1	Black Forest Labs	12B MM-DiT, Schnell distilled
Aug 2024	Imagen 3	Google DeepMind	T5-XXL plus diffusion transformer
Aug 2024	Ideogram 2.0	Ideogram	Strong in-image typography
Oct 2024	Recraft V3	Recraft	Briefly top of Artificial Analysis arena
Oct 2024	SD 3.5 Large	Stability AI	8B MM-DiT
Oct 2024	Flux.1.1 Pro	Black Forest Labs	Ultra 4MP mode
Jan 2025	Janus-Pro	DeepSeek	Unified autoregressive understanding plus generation
Mar 2025	Midjourney v7	Midjourney	New personalization defaults
Mar 2025	gpt-image-1	OpenAI	Autoregressive image generation in ChatGPT
May 2025	HiDream-I1	HiDream AI	17B sparse MoE DiT
May 2025	Imagen 4	Google DeepMind	Announced at I/O; GA in Gemini API August 2025
May 2025	Flux.1 Kontext	Black Forest Labs	In-context editing (Max, Pro, Dev)
Aug 2025	Gemini 2.5 Flash Image	Google DeepMind	"Nano Banana"; native multimodal, SynthID
Aug 2025	Qwen-Image	Alibaba	20B MM-DiT, Apache 2.0, strong text rendering
Sep 2025	Seedream 4.0	ByteDance	Unified DiT, up to 4K
Nov 2025	Gemini 3 Pro Image	Google DeepMind	"Nano Banana Pro"; up to 4K, 14 input images
Nov 2025	Flux.2	Black Forest Labs	Pro/Flex/Dev/Klein; up to 4MP
Dec 2025	Seedream 4.5	ByteDance	Improved typography and multi-image editing
Feb 2026	Qwen-Image 2.0	Alibaba	7B unified generation and editing, native 2K
Apr 2026	Midjourney V8.1	Midjourney	Fastest version, native 2K output

Which text-to-image models are notable?

Model	First release	Organization	Backbone	Open weights
DALL-E	Jan 2021	OpenAI	Autoregressive Transformer plus dVAE	No
GLIDE	Dec 2021	OpenAI	U-Net pixel diffusion	Partial (filtered 300M)
DALL-E 2	Apr 2022	OpenAI	U-Net diffusion with CLIP latents	No
Imagen	May 2022	Google	U-Net pixel diffusion, T5-XXL	No
Parti	Jun 2022	Google	Autoregressive Transformer (ViT-VQGAN)	No
Stable Diffusion 1.5	Oct 2022	Stability AI plus RunwayML	U-Net latent diffusion	Yes (CreativeML OpenRAIL-M)
DALL-E 3	Oct 2023	OpenAI	Diffusion plus GPT prompt rewriter	No (API removed May 2026)
SDXL	Jul 2023	Stability AI	U-Net latent diffusion, dual text encoder	Yes
Midjourney v6	Dec 2023	Midjourney	Closed (rumored DiT)	No
Imagen 2	Dec 2023	Google	Pixel diffusion	No
Stable Cascade	Feb 2024	Stability AI	Wurstchen v3 cascade	Yes
SD3 Medium	Jun 2024	Stability AI	MM-DiT plus rectified flow	Yes
Flux.1 Dev	Aug 2024	Black Forest Labs	MM-DiT plus rectified flow (12B)	Yes (non-commercial)
Flux.1 Schnell	Aug 2024	Black Forest Labs	4-step distilled MM-DiT	Yes (Apache 2.0)
Imagen 3	Aug 2024	Google DeepMind	Diffusion plus T5	No
Recraft V3	Oct 2024	Recraft	Closed	No
SD 3.5 Large	Oct 2024	Stability AI	MM-DiT (8B)	Yes
Hunyuan-DiT	May 2024	Tencent	MM-DiT (1.5B), bilingual	Yes
Janus-Pro	Jan 2025	DeepSeek	Unified autoregressive	Yes
gpt-image-1	Mar 2025	OpenAI	Autoregressive	No
HiDream-I1	May 2025	HiDream AI	Sparse MoE DiT (17B)	Yes
Imagen 4	May 2025	Google DeepMind	Diffusion (Fast, standard, Ultra)	No
Flux.1 Kontext	May 2025	Black Forest Labs	MM-DiT, in-context editing	Dev variant only
Gemini 2.5 Flash Image	Aug 2025	Google DeepMind	Native multimodal ("Nano Banana")	No
Qwen-Image	Aug 2025	Alibaba	MM-DiT (20B)	Yes (Apache 2.0)
Seedream 4.0	Sep 2025	ByteDance	Unified DiT, up to 4K	No
Gemini 3 Pro Image	Nov 2025	Google DeepMind	Native multimodal ("Nano Banana Pro")	No
Flux.2	Nov 2025	Black Forest Labs	Latent flow matching (Pro/Flex/Dev/Klein)	Klein (Apache 2.0), Dev
Recraft V3	Oct 2024	Recraft	Closed (raster plus vector)	No
Ideogram 3.0	Mar 2025	Ideogram	Closed, strong typography	No
Qwen-Image 2.0	Feb 2026	Alibaba	Unified 7B (Qwen3-VL plus diffusion)	Yes

How do text-to-image models work? (Architectures)

U-Net diffusion

The U-Net was introduced by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in 2015 for biomedical segmentation. Ho, Jain, and Abbeel adapted it as the standard denoising backbone for DDPMs in 2020. A diffusion U-Net takes a noisy latent (or pixel) image plus a timestep embedding and a text conditioning vector, and predicts the noise. The U-Net has down-sampling, middle, and up-sampling blocks with skip connections, and text conditioning enters through cross-attention layers inserted at multiple resolutions. Stable Diffusion 1.x, 2.x, and SDXL all use U-Net backbones, as did GLIDE, DALL-E 2, Imagen, and Imagen 2.

Latent diffusion (Rombach et al., 2022) compresses the image with a VAE before running diffusion. For Stable Diffusion the VAE has an 8x downsampling factor, so a 512x512 image becomes a 64x64x4 latent. Diffusion runs on the latent, and the VAE decoder maps the final latent back to pixels. This cut compute roughly tenfold compared with pixel-space diffusion and was the key engineering insight that made high-resolution open-weights image generation feasible on a single consumer GPU.

Diffusion Transformer (DiT)

William Peebles and Saining Xie's Diffusion Transformer ("Scalable Diffusion Models with Transformers," arXiv:2212.09748) replaced the U-Net with a pure Transformer. The noisy latent is split into patches (similar to ViT), each patch becomes a token, and the Transformer processes the full sequence with self-attention. Conditioning enters through adaptive layer norm (adaLN) modulated by timestep and class or text embeddings. DiT showed clean scaling: bigger models, more compute, and lower FID, with no architectural specialization for images beyond the patchification. DiT-XL/2 became the canonical reference model.[^dit]

DiT was adopted by OpenAI's Sora (announced February 2024) and influenced the move to pure Transformer backbones in image generation. PixArt-Alpha (Chen et al., 2023) was the first widely-used open-source DiT for text-to-image.

MM-DiT (multimodal DiT)

MM-DiT, introduced in the Stable Diffusion 3 paper (Esser et al., 2024), keeps separate weight matrices for image tokens and text tokens within each Transformer block, but lets self-attention operate over the concatenated sequence. This avoids the cross-attention bottleneck of U-Net designs and gives text tokens equal status with image tokens. Both SD3 and Flux.1 use MM-DiT.

Token-based and autoregressive

Autoregressive image models tokenize the image with a VQ-VAE, VQ-GAN, or similar discrete autoencoder, then predict the image-token sequence left-to-right after the text-token sequence. DALL-E (Jan 2021) was the first large autoregressive text-to-image model. Parti scaled this approach to 20 billion parameters. Muse swapped the autoregressive objective for masked-token prediction, which allowed parallel decoding.[^muse] Janus-Pro and gpt-image-1 returned to autoregressive image generation in 2025 but with multimodal Transformers that share a single backbone with the text model.

Flow Matching and Rectified Flow

Flow Matching was introduced by Yaron Lipman, Ricky Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le in "Flow Matching for Generative Modeling" (arXiv:2210.02747, October 2022). Rather than learning a denoising step at each noise level, flow matching learns a vector field that transports a simple base distribution (such as Gaussian noise) to the data distribution along straight or near-straight paths. This is a simulation-free training objective that can be trained with a regression loss.[^flowmatch]

Rectified Flow (Liu, Gong, Liu, ICLR 2023, "Flow Straight and Fast") is a specific flow-matching variant that explicitly straightens the trajectories. The 2024 SD3 paper showed that rectified flow on Transformer backbones beat standard DDPM on text-to-image tasks at scale, and the result is now the default training objective for SD3, SD 3.5, and Flux.1.[^rectflow]

How are text-to-image models evaluated? (Benchmarks)

Measuring text-to-image quality is harder than measuring classifier accuracy because the output is a continuous image whose quality depends on aesthetics, prompt alignment, factual correctness, and absence of artifacts.

Benchmark	Year	Measures	Notes
FID	2017	Distributional distance	Frechet Inception Distance, computes Frechet distance between Inception-v3 feature statistics of real and generated images[^fid]
CLIPScore	2021	Prompt alignment	Cosine similarity between CLIP text and image embeddings[^clipscore]
Inception Score (IS)	2016	Diversity and class confidence	Now considered weak
DrawBench	2022	Prompt alignment	200 hand-crafted prompts from the Imagen team
PartiPrompts (P2)	2022	Prompt alignment	1,632 prompts spanning 11 categories from the Parti team
HPS	2023	Human preference	Human Preference Score, Wu et al., trained on 98K human-labeled image pairs[^hpsv2]
HPSv2	2023	Human preference	Wu, Sun, Wang, Liu, Tao, Liu, Liu, Tang, ICCV 2023, 98K to 798K labeled pairs
ImageReward	2023	Human preference	Xu, Liu, Wu et al., NeurIPS 2023, BLIP-based reward model trained on 137K human comparisons[^imagereward]
PickScore	2023	Human preference	Kirstain et al., trained on Pick-a-Pic dataset of 500K user preferences
GenEval	2023	Compositional skills	Ghosh, Hajishirzi, Schmidt, NeurIPS 2023, object grounding, counting, color, position, attribute binding[^geneval]
T2I-CompBench	2023	Compositional skills	Huang et al., NeurIPS 2023, attribute binding, object relationships, complex compositions[^t2icomp]
DPG-Bench	2024	Dense prompt fidelity	ELLA paper, evaluates long, paragraph-length prompts[^dpgbench]
PaintSkills	2022	Visual reasoning	Cho, Zala, Bansal, EMNLP 2022, object, count, spatial, attribute skills[^paintskills]
Artificial Analysis Image Arena	2024	Human Elo	Pairwise crowdsourced votes across many models
LMSys Image Arena	2024	Human Elo	Side-by-side blind voting
Hugging Face Open T2I Arena	2024	Human Elo	Community-run arena focused on open-weights models

FID remains the most cited automated metric, but it is sensitive to dataset choice (typically MS-COCO 30K for text-to-image FID) and does not correlate well with human preference at the top of the leaderboard. Most newer papers report a combination of CLIPScore for prompt alignment, GenEval for compositional skills, and a human-preference score such as HPSv2 or PickScore, alongside Elo from a public arena.

Hugging Face hosts several public leaderboards: the Open Text-to-Image Leaderboard (now archived), the GenAI-Arena, and the Open Image Arena. Artificial Analysis runs its own Image Arena with frequent reranking; HiDream-I1, Flux.1.1 Pro, Imagen 3, and Recraft V3 have each held the top human-Elo slot at various points between 2024 and 2025.

Are text-to-image models open source? (Open-source ecosystem)

The release of Stable Diffusion's weights in August 2022 spawned an ecosystem of open-source tools that turned text-to-image generation into a hobbyist and small-team activity rather than something gated behind cloud APIs. Open-weights releases continue to anchor the field: SDXL, SD 3.5 Large, Flux.1 Dev and Schnell, Qwen-Image (Apache 2.0), Janus-Pro, HiDream-I1, and Flux.2 Klein (Apache 2.0) all ship downloadable checkpoints, while DALL-E, Imagen, Midjourney, and the Gemini image models remain closed APIs.

Inference libraries and runtimes

Hugging Face Diffusers is the standard Python library for running text-to-image models. It provides a unified API for SD 1.x, SD 2.x, SDXL, SD3, Flux, Stable Cascade, Kandinsky, PixArt, and dozens of community fine-tunes, along with implementations of all common schedulers (DDIM, DPM-Solver, Euler, UniPC, LCM) and pipelines for inpainting, ControlNet, IP-Adapter, and image-to-image.

ONNX Runtime, TensorRT, and Apple Core ML provide hardware-accelerated inference paths for production deployment. Stability AI ships SDXL in Core ML format for on-device generation on Apple Silicon. The Flux.1 Schnell distillation runs in 4 steps and is the fastest open model competitive with closed APIs at consumer prices.

Graphical interfaces

ComfyUI, created by comfyanonymous in early 2023, presents a node-graph editor for diffusion pipelines and is the most flexible of the open frontends. It supports custom nodes, complex pipelines, multi-pass workflows, and ControlNet stacking, and it has become the de facto reference UI for non-commercial Flux deployments.

Automatic1111's stable-diffusion-webui (released August 2022) was the first widely-adopted web frontend and remains popular for its plugin ecosystem and one-click installers. Forge, a fork of A1111 by lllyasviel (the author of ControlNet), focuses on memory efficiency and faster generation. InvokeAI ships a polished interface aimed at professional artists. Fooocus simplifies the UI down to a single prompt box with sensible defaults. Pinokio packages many of these tools as one-click installable apps.

Civitai (launched late 2022) hosts community fine-tunes and LoRAs for Stable Diffusion variants, with millions of downloads per month at peak.

Conditioning extensions

Text prompts alone are a coarse interface. Several extensions add fine-grained spatial, identity, or style conditioning on top of an existing text-to-image model without retraining the base weights.

ControlNet

ControlNet was introduced by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala in "Adding Conditional Control to Text-to-Image Diffusion Models" (arXiv:2302.05543, February 2023). ControlNet trains a copy of the encoder half of the diffusion U-Net to accept an additional conditioning image (Canny edges, depth maps, OpenPose skeletons, segmentation masks, scribbles, normals, line art, MLSD lines, and others). The trainable copy is connected to the frozen base model with zero-initialized convolutions, so the addition starts as a no-op and learns the spatial control without disturbing the base weights.[^controlnet] ControlNet enabled accurate pose and layout control and was a key building block for the SD-based animation and design workflows that followed.

T2I-Adapter (Mou et al., 2023) is a lightweight alternative to ControlNet with similar functionality but smaller adapter modules.

LoRA

LoRA (Low-Rank Adaptation), originally proposed by Hu et al. for language models in 2021 ("LoRA: Low-Rank Adaptation of Large Language Models," arXiv:2106.09685), became the standard way to fine-tune diffusion models on small datasets. A LoRA inserts low-rank trainable matrices into attention layers (and often cross-attention layers in the U-Net or DiT), so a fine-tune for a new character, style, or concept only stores tens of megabytes rather than the full multi-gigabyte checkpoint.[^lora] Civitai and Hugging Face host tens of thousands of public LoRAs for Stable Diffusion variants and Flux.

IP-Adapter

IP-Adapter (Ye et al., "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models," arXiv:2308.06721, August 2023) lets a user supply a reference image as a second conditioning signal alongside the text prompt, allowing identity, style, or composition transfer with a single forward pass.[^ipadapter] The InstantID and PuLID variants specialize IP-Adapter for face identity preservation.

AnimateDiff and motion modules

AnimateDiff (Guo et al., "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning," arXiv:2307.04725, July 2023) attaches a temporal motion module to a frozen SD checkpoint, turning any image checkpoint into a short-clip video model. It paved the way for many of the open-source text-to-video and image-to-video pipelines built on Stable Diffusion and SDXL.

DreamBooth (Ruiz et al., 2022) is an earlier fine-tuning method, mostly superseded by LoRA, that personalized a diffusion model to a specific subject from 3 to 5 example images.

How much do text-to-image APIs cost? (Commercial pricing)

In 2025 the major closed-source APIs price image generation roughly as follows (per image, standard resolution):

Service	Standard price	Notes
OpenAI gpt-image-1 (low)	about 1.1 cents	1024x1024 low quality
OpenAI gpt-image-1 (medium)	about 4.2 cents	Default consumer tier
OpenAI gpt-image-1 (high)	about 16.7 cents	Best text rendering
Midjourney Basic	about 2.5 cents	200 images per month at 10 USD
Stability AI SD3 Large API	about 6.5 cents	Pay-per-credit
Black Forest Labs Flux.1.1 Pro	about 4.0 cents	Replicate or fal.ai
Black Forest Labs Flux.1.1 Ultra	about 6.0 cents	4 megapixel
Google Imagen 3	about 3.0 cents	Vertex AI
Google Imagen 4	about 4.0 cents	Gemini API; Fast 2 cents, Ultra 6 cents
Google Gemini 2.5 Flash Image	about 3.9 cents	"Nano Banana"; 1290 output tokens per image
Ideogram V3	about 8.0 cents	High in-image text quality
Recraft V3	about 4.0 cents	Vector and brand presets

Google prices Gemini 2.5 Flash Image at 30 USD per 1 million output tokens, and because each image consumes 1,290 output tokens the effective cost is about 0.039 USD (3.9 cents) per image.^[8] Open-weights models such as Flux.1 Schnell or SDXL on a self-hosted GPU cost only the electricity and amortized hardware, often below 0.1 cents per image once a workflow is set up.

What are text-to-image models used for? (Applications)

Text-to-image systems are deployed across consumer creativity tools, marketing and advertising production, concept art and pre-visualization, stock-photo replacement, e-commerce product imagery, packaging design, game-asset generation, educational illustration, and personal photo editing. Adobe Firefly integrates an in-house diffusion model into Photoshop's Generative Fill. Canva, Figma, and Microsoft Designer use commercial APIs for text-to-image inside their editors. Game studios use SD or Flux fine-tunes for environment and character ideation. Architecture and product-design teams use ControlNet plus a sketch or 3D render as the conditioning input.

Separately, text-to-image is the conditioning backbone for many text-to-video, text-to-3D, and personalization systems, and synthetic images from these models are used to augment training data for vision classifiers and detection models.

What are the concerns with text-to-image models?

Copyright and training data

Most large text-to-image models were trained on web-scraped datasets such as LAION-5B (Schuhmann et al., 2022), which contains 5.85 billion image-text pairs scraped from Common Crawl.[^laion] Several artists filed lawsuits against Stability AI, Midjourney, and DeviantArt in January 2023, arguing that training on their work without consent violated copyright. The UK Getty Images v. Stability AI case began in 2023 and proceeded to trial in 2024. The US Copyright Office has issued several decisions (including Zarya of the Dawn in February 2023 and the Allen v. Perlmutter ruling in 2025) holding that outputs of generative AI without sufficient human creative input are not copyrightable.

LAION-5B was temporarily taken offline in December 2023 after a Stanford Internet Observatory report by David Thiel identified child sexual abuse material in the dataset.[^csam] A re-released version with the offending content removed was published in mid 2024.

Deepfakes and non-consensual imagery

The ability to generate photorealistic images of real people from a prompt has produced documented harms in the form of non-consensual intimate imagery and political deepfakes. The January 2024 incident in which non-consensual deepfake images of Taylor Swift spread on X led to platform-level changes and renewed legislative attention, including the US NO FAKES and DEFIANCE Acts. Many commercial models block named celebrities and apply automated nudity filters; most open-weights models can be fine-tuned to bypass those filters.

Bias and representation

Buolamwini and Gebru's earlier work on facial-recognition bias ("Gender Shades," 2018) generalized to image generation: text-to-image systems trained on web data inherit demographic, occupational, and stereotype biases from the training set. Bloomberg's 2023 analysis of Stable Diffusion outputs by Leonardo Nicoletti and Dina Bass found that prompts for "CEO" or "doctor" disproportionately produced light-skinned men, while "social worker" or "fast-food worker" leaned the other way.[^bias] Google's 2024 Gemini image-generation feature briefly produced historically inaccurate outputs (such as racially diverse Nazi soldiers) after a fairness adjustment, and Google suspended the feature for several weeks.

Energy and environmental cost

Training a large text-to-image model takes tens of thousands of GPU-hours; Stable Diffusion 1.5 used roughly 150,000 A100-hours according to the model card. Inference at scale is also non-trivial: a 2023 paper by Luccioni, Jernite, and Strubell estimated that generating 1,000 images with a large model uses about as much energy as fully charging a smartphone several times.[^energy]

References

OpenAI (2025). "Introducing 4o Image Generation." Company blog, March 25, 2025. https://openai.com/index/introducing-4o-image-generation/ Accessed 2026-06-21. ↩
Axios (2025). "Altman: ChatGPT adds 1M users in 1 hour." March 31, 2025. https://www.axios.com/2025/03/31/chatgpt-image-altman-openai-users Accessed 2026-06-21. ↩
Saharia, C., Chan, W., Saxena, S. et al. (2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding." arXiv:2205.11487. https://arxiv.org/abs/2205.11487 Accessed 2026-06-21. ↩
Esser, P., Kulal, S., Blattmann, A. et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." arXiv:2403.03206. https://arxiv.org/abs/2403.03206 Accessed 2026-06-21. ↩
Black Forest Labs (2024). "Announcing Black Forest Labs." Company blog, August 1, 2024. https://bfl.ai/announcements/24-08-01-bfl Accessed 2026-06-21. ↩
Midjourney (2026). "Version." Product documentation (V8.1 published April 30, 2026; native 2K, fastest model). https://docs.midjourney.com/hc/en-us/articles/32199405667853-Version Accessed 2026-06-21. ↩
OpenAI (2025-2026). "Deprecations." OpenAI API documentation (dall-e-2 and dall-e-3 removed from the API on May 12, 2026). https://platform.openai.com/docs/deprecations Accessed 2026-06-21. ↩
Google Developers Blog (2025). "Introducing Gemini 2.5 Flash Image, our state-of-the-art image model." August 26, 2025. https://developers.googleblog.com/en/introducing-gemini-2-5-flash-image/ Accessed 2026-06-21. ↩
Google (2025). "Nano Banana Pro: Gemini 3 Pro Image model from Google DeepMind." November 20, 2025. https://blog.google/innovation-and-ai/products/nano-banana-pro/ Accessed 2026-06-21. ↩
Qwen Team, Alibaba (2025). "Qwen-Image: Crafting with Native Text Rendering." August 4, 2025. https://qwenlm.github.io/blog/qwen-image/ Accessed 2026-06-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

Text-to-Image Models

What is a text-to-image model? (Overview)

History: how did text-to-image models develop?

Early attention-based generation (2015 to 2016)

GAN era (2017 to 2020)

Autoregressive era (2021 to 2022)

Diffusion era (2021 to 2023)

Diffusion Transformer era (2023 to 2026)

Native multimodal and frontier era (2025 to 2026)

Timeline of major releases

Which text-to-image models are notable?

How do text-to-image models work? (Architectures)

U-Net diffusion

Diffusion Transformer (DiT)

MM-DiT (multimodal DiT)

Token-based and autoregressive

Flow Matching and Rectified Flow

How are text-to-image models evaluated? (Benchmarks)

Are text-to-image models open source? (Open-source ecosystem)

Inference libraries and runtimes

Graphical interfaces

Conditioning extensions

ControlNet

LoRA

IP-Adapter

AnimateDiff and motion modules

How much do text-to-image APIs cost? (Commercial pricing)

What are text-to-image models used for? (Applications)

What are the concerns with text-to-image models?

Copyright and training data

Deepfakes and non-consensual imagery

Bias and representation

Energy and environmental cost

See also

References

Improve this article

What links here (24 of 67)

What links here (24 of 67)

What is a text-to-image model? (Overview)

History: how did text-to-image models develop?

Early attention-based generation (2015 to 2016)

GAN era (2017 to 2020)

Autoregressive era (2021 to 2022)

Diffusion era (2021 to 2023)

Diffusion Transformer era (2023 to 2026)

Native multimodal and frontier era (2025 to 2026)

Timeline of major releases

Which text-to-image models are notable?

How do text-to-image models work? (Architectures)

U-Net diffusion

Diffusion Transformer (DiT)

MM-DiT (multimodal DiT)

Token-based and autoregressive

Flow Matching and Rectified Flow

How are text-to-image models evaluated? (Benchmarks)

Are text-to-image models open source? (Open-source ecosystem)

Inference libraries and runtimes

Graphical interfaces

Conditioning extensions

ControlNet

LoRA

IP-Adapter

AnimateDiff and motion modules

How much do text-to-image APIs cost? (Commercial pricing)

What are text-to-image models used for? (Applications)

What are the concerns with text-to-image models?

Copyright and training data

Deepfakes and non-consensual imagery

Bias and representation

Energy and environmental cost

See also

References

Improve this article

Related Articles

Pika (video generation)

Sora 2

GPT Image 1

Nano Banana Pro

Seedream 4.0

Document Question Answering Models

What links here (24 of 67)

Related Articles

Pika (video generation)

Sora 2

GPT Image 1

Nano Banana Pro

Seedream 4.0

Document Question Answering Models

What links here (24 of 67)