DALL-E 2
Last reviewed
Jun 3, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,563 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,563 words
Add missing citations, update stale details, or suggest a clearer explanation.
DALL-E 2 (stylized DALL·E 2) is a text-to-image generation system developed by OpenAI and announced on April 6, 2022. It is the successor to the original DALL·E, which OpenAI unveiled in January 2021. DALL-E 2 produces more photorealistic and higher-resolution images than its predecessor, and it introduced editing capabilities such as inpainting, outpainting, and image variations. The system is built on a method that OpenAI's researchers called "unCLIP," which combines the CLIP contrastive image-text model with diffusion models.[1][2][3] DALL-E 2 was deprecated in November 2025 and is scheduled to be removed from OpenAI's API on May 12, 2026, having been superseded by DALL·E 3 and gpt-image-1.[4]
The original DALL·E, announced on January 5, 2021, was a 12-billion-parameter version of the GPT-3 language model that generated 256 by 256 pixel images autoregressively from text, using a discrete variational autoencoder to tokenize images and a CLIP model to rank candidate outputs.[5] In the months that followed, OpenAI shifted its image-generation research toward diffusion models. In December 2021 the company published GLIDE ("Guided Language to Image Diffusion for Generation and Editing"), a 3.5-billion-parameter text-conditional diffusion model whose samples human evaluators preferred over the original DALL·E. GLIDE also demonstrated text-driven inpainting.[6] DALL-E 2 directly built on this work, reusing the GLIDE decoder architecture while changing how the model is conditioned.[2]
Despite the shared "DALL·E" branding, DALL-E 2 is architecturally distinct from DALL·E 1. The original used an autoregressive transformer over discrete image tokens, whereas DALL-E 2 uses a diffusion-based pipeline conditioned on CLIP embeddings.[2][5]
DALL-E 2 is described in the paper "Hierarchical Text-Conditional Image Generation with CLIP Latents," submitted to arXiv on April 13, 2022 by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.[1] The paper names the approach "unCLIP" because it effectively inverts the CLIP image encoder: rather than mapping an image to an embedding, the system starts from a CLIP embedding and generates a corresponding image.[1][3]
The pipeline is a two-stage model that operates on CLIP's joint text-and-image embedding space:[1][2]
| Stage | Component | Function |
|---|---|---|
| CLIP encoding | Frozen CLIP text encoder | Converts the text prompt into a CLIP text embedding |
| Stage 1 | Prior | Maps the CLIP text embedding to a CLIP image embedding |
| Stage 2 | Decoder | Generates an image conditioned on the CLIP image embedding |
For the prior, the authors experimented with both an autoregressive model and a diffusion model, and reported that the diffusion prior was more computationally efficient and produced higher-quality samples. The diffusion prior is a Transformer with a width of 2048 across 24 blocks.[1]
The decoder is a diffusion model that generates a 64 by 64 pixel image. The paper states that the decoder uses "the 3.5 billion parameter GLIDE model, with the same architecture and diffusion hyperparameters," modified to condition on CLIP image embeddings rather than directly on text.[1] Two cascaded diffusion upsampler models then increase the resolution, first from 64 by 64 to 256 by 256 and then from 256 by 256 to 1024 by 1024. Neither upsampler uses attention.[1] The final 1024 by 1024 output is roughly four times the linear resolution of the original DALL·E's 256 by 256 images, a comparison OpenAI summarized as "4x greater resolution."[3][5]
Because the decoder is conditioned on a CLIP image embedding rather than a fixed image, the same embedding can be decoded multiple times to produce different but semantically consistent outputs. This mechanism underlies the model's image-variation feature, in which DALL-E 2 generates new images that preserve the content and style of an input while varying incidental details. Operating in CLIP's embedding space also lets the model interpolate between two images or apply text-guided edits in a zero-shot manner.[1][2]
The unCLIP design placed DALL-E 2 within a broader 2022 wave of diffusion-based text-to-image systems, alongside Google's Imagen and the open-source Stable Diffusion, which emerged later the same year.[2]
DALL-E 2's core capability is generating images from natural-language descriptions, including photorealistic photographs, paintings, illustrations, and other styles, and combining concepts, attributes, and styles that may not co-occur in its training data.[3] Beyond text-to-image generation, the system offered several editing tools:
| Feature | Description | Availability |
|---|---|---|
| Text-to-image | Generates images from a written prompt | April 2022 (research); July 2022 (beta) |
| Inpainting (Edit) | Edits a region of an existing image using a prompt, matching style, lighting, and shadows | At launch |
| Image variations | Produces alternative versions of an uploaded or generated image | At launch |
| Outpainting | Extends an image beyond its original borders, adding new content in the same style | August 31, 2022 |
Outpainting, announced on August 31, 2022, let users expand an image past its frame, for example extending a painting into a larger scene while maintaining its visual style and direction.[7]
The system also had well-documented limitations. It struggled with binding attributes to the correct objects (for instance, reliably distinguishing "a red cube on a blue cube" from the reverse), with rendering coherent text within images, and with prompts involving many objects, negations, or complex spatial relationships.[2]
OpenAI followed a staged rollout. At the April 6, 2022 research announcement, access was limited to a small group of trusted testers.[3] On July 20, 2022, OpenAI moved DALL-E 2 into a paid beta and began expanding access toward roughly one million people on its waitlist. Alongside the beta it introduced a credit-based pricing model: users received 50 free credits in their first month and 15 free credits each subsequent month, with additional credits sold in batches (115 credits for US$15). A single credit generated four images from a prompt, or three images for an edit or variation. The beta also granted users full commercial rights to the images they created, including the right to reprint, sell, and merchandise them.[8]
On September 28, 2022, OpenAI removed the waitlist entirely, opening DALL-E 2 to anyone who wished to sign up. The company said that more than 1.5 million users were then actively creating with the system, generating over 2 million images per day, with about 100,000 users sharing work and feedback in OpenAI's Discord community.[9] On November 3, 2022, OpenAI released the DALL·E API in public beta, allowing developers to integrate image generation into their own applications and products.[10]
OpenAI published a system card for the DALL-E 2 preview in April 2022 documenting risks including bias and representation, mis- and disinformation, explicit content, economic effects, harassment and hate, and copyright and memorization.[11] To reduce harms, OpenAI filtered the training data to remove images with obvious violent, sexual, or hateful content, which it said reduced the model's ability to produce such material.[3][9]
The content policy prohibited generating sexual, violent, hateful, and other disallowed imagery. OpenAI also rejected uploads containing realistic human faces and blocked attempts to generate the likenesses of public figures, including political leaders and celebrities.[9] Each generated image carried a visible signature in the bottom-right corner consisting of five colored squares (in muted yellow, cyan, green, red, and blue) to mark it as DALL-E output, and OpenAI said it was continuing to explore watermarking and other image-provenance techniques.[11]
Like other systems trained on internet data, DALL-E 2 reflected social biases: neutral occupational prompts often produced outputs skewed by gender, race, and other attributes. In July 2022 OpenAI announced a technique that, for prompts not specifying demographics, would steer outputs toward greater diversity by adjusting how prompts were applied internally.[11]
DALL-E 2 was succeeded by DALL·E 3, which OpenAI announced in 2023, and later by the gpt-image-1 family of image models. On November 14, 2025, OpenAI notified developers that the DALL·E model snapshots were being deprecated. According to OpenAI's API deprecation documentation, both dall-e-2 and dall-e-3 are scheduled to be removed from the API on May 12, 2026, with the gpt-image-2, gpt-image-1, and gpt-image-1-mini models recommended as replacements.[4]
DALL-E 2 was one of the systems that brought text-to-image generation to mainstream public attention in 2022. Its unCLIP architecture demonstrated that conditioning a diffusion decoder on CLIP image embeddings could improve the diversity of generated images while retaining photorealism and adherence to the prompt, and its staged public rollout, content policies, and bias-mitigation efforts influenced how subsequent generative-image systems were released and governed.[1][2][9]