DALL-E
Last reviewed
Jun 1, 2026
Sources
22 citations
Review status
Source-backed
Revision
v6 ยท 4,722 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 1, 2026
Sources
22 citations
Review status
Source-backed
Revision
v6 ยท 4,722 words
Add missing citations, update stale details, or suggest a clearer explanation.
DALL-E is a family of artificial intelligence (AI) image generation models developed by OpenAI that create images from text descriptions.[2][13] The name is a portmanteau of the surrealist artist Salvador Dali and WALL-E, the Pixar character.[13] Since the original model's introduction in January 2021, DALL-E has gone through three major versions and has played a central role in popularizing text-to-image synthesis as a consumer and developer tool.[6]
The original DALL-E used a transformer-based architecture with 12 billion parameters, drawing heavily on GPT-3.[6] DALL-E 2, released in April 2022, replaced the autoregressive approach with a diffusion model conditioned on CLIP embeddings.[1][9] DALL-E 3, launched in October 2023, focused on improved prompt following through a caption-improvement training methodology and was natively integrated into ChatGPT.[14] In March 2025, OpenAI began transitioning away from the DALL-E brand with the release of GPT-4o native image generation, and the DALL-E 2 and DALL-E 3 APIs are scheduled for deprecation on May 12, 2026.[15]
The table below summarizes the three major DALL-E versions along with their successor models.
| Version | Release Date | Architecture | Parameters | Max Resolution | Key Features |
|---|---|---|---|---|---|
| DALL-E 1 | January 2021 | Autoregressive transformer + discrete VAE | 12 billion | 256 x 256 | Zero-shot text-to-image generation; first large-scale text-to-image model |
| DALL-E 2 | April 2022 | CLIP + diffusion model (unCLIP) | 3.5 billion (+ 1.5B upsampler) | 1024 x 1024 | Inpainting; outpainting; variations; 4x higher resolution than DALL-E 1 |
| DALL-E 3 | October 2023 | Diffusion model + improved caption training | Not disclosed | 1024 x 1024, 1024 x 1792, 1792 x 1024 | ChatGPT integration; prompt rewriting; improved text rendering; HD quality option |
| GPT Image 1 | March/April 2025 | Autoregressive (native multimodal) | Not disclosed | Variable | Native ChatGPT integration; image-to-image transformation; precise text rendering |
| GPT Image 1.5 | December 2025 | Autoregressive (native multimodal) | Not disclosed | Variable | 4x faster generation; precision editing; improved small text handling |
OpenAI announced DALL-E on January 5, 2021, alongside CLIP (Contrastive Language-Image Pre-training).[13] The model was described in the paper "Zero-Shot Text-to-Image Generation" by Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.[6] DALL-E built on the success of GPT-2 and GPT-3, applying autoregressive text generation techniques to image synthesis. Where GPT-3 predicted the next token in a text sequence, DALL-E predicted the next token in a combined text-and-image sequence.[6]
DALL-E 1's architecture consisted of two primary components working together:
Discrete Variational Autoencoder (dVAE). The dVAE compressed each 256 x 256 pixel image into a 32 x 32 grid of discrete tokens, yielding 1,024 image tokens drawn from a codebook vocabulary of 8,192 entries.[6] This compression was essential because modeling raw pixels would have required millions of tokens, making autoregressive training computationally infeasible.
Autoregressive Transformer. The core of DALL-E was a 12-billion-parameter decoder-only transformer similar in architecture to GPT-3.[6] It had 64 self-attention layers, each with 62 attention heads. Text captions were encoded using Byte Pair Encoding (BPE) with a vocabulary size of 16,384, producing up to 256 text tokens. These text tokens were concatenated with the 1,024 image tokens from the dVAE, forming a single sequence of up to 1,280 tokens.[6] The transformer was then trained to model this sequence autoregressively, predicting each token based on all preceding tokens.
The model was pre-trained on approximately 250 million text-image pairs sourced from the internet.[6] During inference, DALL-E generated candidate images by sampling token sequences conditioned on a text prompt, and CLIP was used to rank and select the best results from a batch of generated candidates.[6]
DALL-E 1 demonstrated a striking ability for zero-shot image generation. Given novel text prompts such as "an armchair in the shape of an avocado" or "a professional high-quality illustration of a baby daikon radish in a tutu walking a dog," the model produced plausible images without having seen these specific combinations during training.[6][13] As Singh et al. (2021) observed, this compositional generalization, combining familiar concepts in new ways, represents a form of imagination that is central to human intelligence.[5]
However, the model had clear limitations. Resolution was capped at 256 x 256 pixels. It struggled with compositional prompts involving attribute binding (distinguishing "a yellow book and a red vase" from "a red book and a yellow vase"), negation, precise counts of more than three objects, and complex spatial relationships.[6][5]
OpenAI introduced DALL-E 2 on April 6, 2022.[1] The system entered public beta in July 2022, and in September 2022, the waitlist was removed, making the service available to anyone.[4] By November 2022, when the API launched, more than 3 million users were generating over 4 million images per day.[8]
DALL-E 2 represented a fundamental shift in architecture. Rather than using an autoregressive transformer, it employed a two-stage diffusion process conditioned on CLIP embeddings, an approach that OpenAI internally called "unCLIP."[9][3]
The unCLIP architecture has three main components:
1. CLIP Encoder. CLIP consists of two neural network branches, a text encoder and an image encoder, trained jointly on hundreds of millions of image-caption pairs using a contrastive objective.[9] The training maximizes the cosine similarity between correctly paired text and image embeddings while minimizing similarity for incorrect pairs. This produces a shared representation space where semantically related text and images are close together.[3]
2. Prior Model. The prior translates a CLIP text embedding into a corresponding CLIP image embedding. DALL-E 2 used a diffusion-based prior implemented as a decoder-only transformer with a causal attention mask. It accepted tokenized text, CLIP text encodings, a diffusion timestep encoding, and noised CLIP image embeddings as input, and it output a predicted unnoised CLIP image embedding.[9]
3. Decoder (Modified GLIDE). The image decoder was a modified version of GLIDE, an earlier OpenAI diffusion model. It took the CLIP image embedding produced by the prior and iteratively denoised a sample of Gaussian noise into a 64 x 64 pixel image. Two cascaded upsampling diffusion models then increased resolution first to 256 x 256 and then to 1,024 x 1,024 pixels.[9][3]
The full generation pipeline thus ran as follows: a text prompt was encoded by CLIP's text encoder into a text embedding; the prior model mapped this text embedding to a CLIP image embedding; and the decoder generated the final image from this image embedding through iterative denoising.
DALL-E 2 used approximately 3.5 billion parameters for its primary model, with an additional 1.5 billion parameters for the resolution-enhancing upsamplers.[3] Despite having fewer parameters than DALL-E 1's 12 billion, DALL-E 2 produced images at four times the resolution with significantly improved realism and accuracy.[1][11]
DALL-E 2 introduced two important editing features:
Inpainting allowed users to select a region within an existing image and fill it with new AI-generated content guided by a text prompt. The model adapted new objects to match the style, lighting, shadows, and textures present in the original image.[11]
Outpainting extended an image beyond its original borders, generating new content that was consistent with the existing scene's perspective, shadows, reflections, and textures. This enabled creation of larger images and different aspect ratios from a starting composition.[10]
Marcus et al. (2022) conducted a systematic evaluation of DALL-E 2 and reported several observations:[12]
On July 20, 2022, OpenAI announced that users would receive full usage rights to commercialize images they created with DALL-E 2, including the right to reprint, sell, and merchandise them.[7] OpenAI retained the right to commercialize user-created images as well.[7]
OpenAI announced DALL-E 3 in September 2023 and began rolling it out in October 2023.[14] The most significant change was native integration with ChatGPT. Unlike previous versions where users typed prompts directly into an image generation interface, DALL-E 3 was accessed through ChatGPT.[14] Users described what they wanted in natural conversation, and ChatGPT automatically expanded brief requests into detailed prompts optimized for image generation. This approach effectively eliminated the need for users to learn prompt engineering techniques specific to image models.[14]
When given a request, ChatGPT typically generated multiple detailed prompt variations, each producing a different image. Users could then refine results through continued conversation, asking ChatGPT to adjust colors, compositions, styles, or specific elements.
The technical paper behind DALL-E 3, titled "Improving Image Generation with Better Captions" by James Betker, Gabriel Goh, Li Jing, and colleagues, identified a root cause of poor prompt following in earlier text-to-image models: the low quality of text-image pair captions in training datasets.[14] Most internet-sourced captions are short, vague, or inaccurate descriptions of the images they accompany.
To address this, OpenAI trained a custom image captioner jointly with a CLIP and language modeling objective. This captioner produced long, highly descriptive captions covering the main subject, surroundings, background, visible text, artistic style, and coloration of each training image. The training dataset was then recaptioned using this model.[14]
During DALL-E 3 training, a regularization technique randomly selected between the synthetic caption (95% of the time) and the original ground-truth caption (5% of the time) for each sample. This hybrid approach prevented overfitting to the captioner's specific patterns while still delivering the benefits of more descriptive training data.[14]
One of DALL-E 3's most notable improvements was its ability to render readable text within generated images.[14] Earlier models, including DALL-E 2 and Midjourney, frequently produced garbled or illegible text in signs, labels, and other contexts. DALL-E 3 achieved substantially better text rendering, with accuracy estimated at approximately 95% for common text-in-image scenarios. This improvement came partly from using larger text encoders with character-level understanding.[14]
DALL-E 3 supported three resolution options: 1024 x 1024 (square), 1024 x 1792 (portrait), and 1792 x 1024 (landscape). It also offered two quality tiers:
A style parameter allowed users to choose between "vivid" (more dramatic, hyper-real images) and "natural" (less stylized, more photographic results).[20]
The ChatGPT integration included an automatic prompt rewriting system that served both quality and safety purposes. ChatGPT transformed user requests into detailed prompts that improved generation quality while simultaneously checking for potential content policy violations.[14] If a request appeared to violate OpenAI's guidelines, the prompt transformation could modify it to remove the problematic elements. This system was tested against 500 synthetic prompts and reportedly reduced generation of identifiable public figures to zero when explicitly requested by name.[14]
DALL-E 3 also refused to generate images in the style of living artists, addressing concerns from the creative community about unauthorized style replication.[14]
DALL-E was developed and announced alongside CLIP, a Contrastive Language-Image Pre-training model. While these two models serve different purposes, they are deeply interconnected.
CLIP was trained on 400 million image-text pairs scraped from the internet. It learned to predict which caption best matches a given image from a list of thousands of candidates.[13] In the context of the DALL-E family, CLIP served two roles:
This relationship, where DALL-E generates images from text while CLIP creates text descriptions from images, is what gave the DALL-E 2 architecture its "unCLIP" name: it inverts the CLIP process.
On March 25, 2025, OpenAI released native image generation capabilities in GPT-4o, marking a departure from the DALL-E approach.[15] Unlike DALL-E 2 and 3, which were separate diffusion models invoked by ChatGPT as external tools, GPT-4o's image generation is built directly into the language model. The model is natively multimodal, processing text and images within the same neural network rather than delegating to a specialized image generation system.[15]
Key improvements over DALL-E 3 included:
The underlying model was made available to developers as "gpt-image-1" via the API on April 23, 2025.[16]
OpenAI released GPT Image 1.5 on December 16, 2025, as the next iteration of its image generation capabilities. It was simultaneously rolled out in ChatGPT (branded as "ChatGPT Images") and made available through the API.[17]
Notable improvements:
On November 14, 2025, OpenAI announced that DALL-E 2 and DALL-E 3 model snapshots would be deprecated and removed from the API on May 12, 2026.[17] Developers were directed to migrate to GPT Image 1 or GPT Image 1.5. The DALL-E brand has effectively been retired in favor of the GPT Image product line, though the DALL-E models continue to function via the API through the deprecation date.[17]
OpenAI has offered API access for image generation across multiple model generations. The following table summarizes pricing as of early 2026.
| Model | Quality | Resolution | Price per Image |
|---|---|---|---|
| DALL-E 2 (legacy) | Standard | 1024 x 1024 | $0.020 |
| DALL-E 2 (legacy) | Standard | 512 x 512 | $0.018 |
| DALL-E 2 (legacy) | Standard | 256 x 256 | $0.016 |
| DALL-E 3 (deprecated) | Standard | 1024 x 1024 | $0.040 |
| DALL-E 3 (deprecated) | Standard | 1024 x 1536 or 1536 x 1024 | $0.080 |
| DALL-E 3 (deprecated) | HD | 1024 x 1024 | $0.080 |
| DALL-E 3 (deprecated) | HD | 1024 x 1536 or 1536 x 1024 | $0.120 |
| GPT Image 1 | Low | 1024 x 1024 | $0.011 |
| GPT Image 1 | Medium | 1024 x 1024 | $0.042 |
| GPT Image 1 | High | 1024 x 1024 | $0.167 |
GPT Image 1.5 uses token-based pricing ($8.00 per million input tokens, $32.00 per million output tokens) rather than per-image pricing, and its image inputs and outputs are 20% cheaper than GPT Image 1.[21][17]
For ChatGPT subscribers, image generation is included in their subscription at no additional per-image cost, subject to usage limits that vary by plan tier (Free, Plus, Pro, Team, Enterprise, and Edu).
OpenAI maintains a comprehensive content policy for all DALL-E and GPT Image models. Prohibited content categories include:
The policy specifically blocks the generation of photorealistic images of identifiable real people, including public figures, to prevent deepfake creation.[14] DALL-E 3 and its successors also refuse to generate images in the style of living artists.[14]
OpenAI has removed violent, sexual, and otherwise harmful content from the training datasets used for DALL-E models. Text prompt filters block requests that violate the content policy before they reach the model, and output classifiers scan generated images for policy violations before they are returned to users.
Beginning with DALL-E 3, OpenAI implemented C2PA (Coalition for Content Provenance and Authenticity) metadata in all generated images.[18] This metadata includes the Content Credentials logo (CR) and embedded information about the time and date of creation and the AI-generated nature of the image. Users can verify whether an image was generated by ChatGPT or the DALL-E API using tools such as Content Credentials Verify.[18]
In May 2024, OpenAI joined the C2PA as a steering committee member, alongside Adobe, BBC, Intel, Microsoft, Google, Publicis Groupe, Sony, and Truepic.[19] All GPT-4o-generated images also include C2PA metadata.[15]
However, C2PA metadata has limitations. It is stored as file metadata rather than being embedded directly into the image pixels, meaning it can be stripped by a simple file conversion or metadata removal tool. This makes it an imperfect solution for tracking the provenance of AI-generated images shared across the internet.
DALL-E and its successors have found use across a wide range of professional and creative domains.
Businesses use DALL-E to produce custom images for advertising campaigns, social media content, websites, and email marketing without relying on stock photography. Marketing teams can input specific creative briefs and receive tailored visuals in seconds, reducing the cost and turnaround time of visual content production.
Designers use DALL-E for early-stage concept visualization. Product mockups can be generated quickly to explore design directions before investing in physical prototypes or detailed 3D models. Fashion design platform CALA integrated the DALL-E 2 API to let its users improve and iterate on design ideas through text prompts.[8]
The film, game, and animation industries use DALL-E for concept art, storyboarding, and visual development. Artists use it to rapidly explore artistic directions, generate reference material, and communicate visual ideas to collaborators before committing to detailed production work.
Educators use DALL-E to generate visual aids for teaching, including diagrams of scientific phenomena, reconstructions of historical scenes, language-learning flashcards, and illustrations of abstract concepts. The tool makes it possible to create customized educational imagery that precisely matches lesson content.
Publishers use DALL-E for book covers, magazine illustrations, article headers, and other editorial imagery. The ability to generate custom illustrations on demand reduces dependence on stock photography and commissioned artwork for routine publishing needs.
DALL-E is increasingly used in combination with other AI tools. For example, an image generated by DALL-E can be animated and given voice using D-ID's AI-generated text-to-video technology. A landscape created in one tool can become an opening shot of a video, accompanied by music composed by an AI music model. Gil Perry, CEO of D-ID, has noted that "people are layering different AI tools to produce even more creative content."
DALL-E competes with several other AI image generation systems. The following table compares the major platforms as of early 2026.
| Platform | Developer | Model Type | Open Source | Key Strengths | Pricing Model |
|---|---|---|---|---|---|
| DALL-E 3 / GPT Image | OpenAI | Diffusion (DALL-E) / Autoregressive (GPT Image) | No | ChatGPT integration; text rendering; ease of use | API per-image/token; subscription |
| Midjourney | Midjourney Inc. | Proprietary diffusion | No | Artistic quality; aesthetic coherence; color harmony | Subscription ($10-$120/month) |
| Stable Diffusion | Stability AI | Latent diffusion | Yes | Full customization; local deployment; fine-tuning; no subscription needed | Free (self-hosted); API varies |
| Imagen | Google DeepMind | Diffusion (Imagen 4) | No | Photorealistic output; text rendering; integrated in Gemini | API per-image ($0.02-$0.06) |
| Flux | Black Forest Labs | Rectified flow transformer | Partially | Photorealism; high detail; fast inference | API and open-weight options |
Midjourney is widely regarded as the leader in artistic quality and aesthetic impact. Its V7 model, released in April 2025, is optimized for visual coherence, color harmony, and compositional balance. Midjourney is particularly strong for concept art, fantasy landscapes, and stylized portraits. It originally operated exclusively through Discord but has since launched a dedicated web interface.
Stable Diffusion, developed by Stability AI, is the most customizable option. As an open-source model, it can run on local hardware, be fine-tuned for specific styles or domains, and integrated into custom workflows without subscription fees. Stable Diffusion 3.5, released in late 2025, brought significant improvements in image quality, prompt adherence, and text rendering, narrowing the gap with proprietary models.
Google's Imagen models are integrated into the Gemini platform. Imagen 4, available through Google Cloud, offers competitive photorealism and text rendering at lower per-image costs than OpenAI's highest-quality tiers. Google's Gemini 3 Pro (late 2025) includes enhanced image generation capabilities.
Flux, developed by Black Forest Labs (founded by former Stability AI researchers), uses a rectified flow transformer architecture. Flux Pro produces some of the most photorealistic AI-generated images available, with exceptional detail and lighting. Flux offers both open-weight models for self-hosting and commercial API access.
The rise of AI image generation has prompted debate about the future of human artistry. Critics argue that when anyone can generate compelling visuals from a text prompt, the value of traditional artistic skill is diminished. Proponents counter that these tools democratize visual creation, enabling people who lack formal art training, time, or physical ability to express visual ideas.
Copyright issues surrounding AI-generated images remain legally unsettled. OpenAI grants users commercial rights over images they create, but since users contribute only a text prompt and the images are machine-generated, the copyrightability of the output is unclear under current law.
In August 2022, Getty Images banned the upload and sale of images generated by DALL-E 2, Stable Diffusion, and other AI tools, citing "unaddressed right issues" because training datasets contained copyrighted images.[22] Other platforms including Newgrounds, PurplePort, and FurAffinity enacted similar bans.[22] In contrast, Shutterstock announced a partnership with OpenAI to integrate DALL-E 2 for content generation.
The use of copyrighted material in training datasets remains the subject of ongoing litigation, with artists and rights holders arguing that training on their work without permission constitutes infringement.
AI image generators can be misused to create deceptive imagery, from deepfake portraits of public figures to fabricated photographic "evidence" of events that never occurred. OpenAI addresses this through content filters, the prohibition on generating identifiable real people, and C2PA metadata.[18] However, other image generation tools, particularly open-source ones like Stable Diffusion, have fewer restrictions and have already been used to create deepfakes of celebrities.
Legislative responses have emerged worldwide. In May 2025, the U.S. Congress passed the TAKE IT DOWN Act, the first major federal statute targeting non-consensual intimate imagery including AI-generated deepfakes. The EU AI Regulation (2024/1689) established mandatory labeling requirements for AI-generated content.
The following timeline traces the key milestones in DALL-E's development and the broader evolution of OpenAI's image generation capabilities.
| Date | Event |
|---|---|
| 2019 | GPT-2 released, demonstrating large-scale autoregressive text generation with 1.5 billion parameters |
| June 2020 | GPT-3 released with 175 billion parameters, providing the architectural foundation for DALL-E |
| January 2021 | DALL-E and CLIP announced simultaneously |
| April 2022 | DALL-E 2 introduced with unCLIP architecture and 1024 x 1024 resolution |
| July 2022 | DALL-E 2 enters public beta; commercial usage rights granted to users |
| September 2022 | DALL-E 2 waitlist removed; open access begins |
| November 2022 | DALL-E 2 API launched; 3 million users generating 4 million images daily |
| August 2022 | Getty Images bans AI-generated content uploads |
| September 2023 | DALL-E 3 announced with ChatGPT integration |
| October 2023 | DALL-E 3 rolls out to ChatGPT Plus and Enterprise users |
| November 2023 | DALL-E 3 API released alongside text-to-speech models |
| February 2024 | C2PA metadata added to DALL-E 3 API-generated images |
| May 2024 | OpenAI joins C2PA steering committee |
| March 2025 | GPT-4o native image generation replaces DALL-E 3 in ChatGPT |
| April 2025 | gpt-image-1 API released for developers |
| November 2025 | DALL-E 2 and DALL-E 3 API deprecation announced (sunset: May 12, 2026) |
| December 2025 | GPT Image 1.5 released in ChatGPT and API |
Across its versions, the DALL-E family has used two fundamentally different approaches to image generation.
DALL-E 1 treated image generation as a sequence prediction task. Images were tokenized into discrete codes using a variational autoencoder, then concatenated with text tokens and fed into a transformer that predicted the next token in the sequence.[6] This is the same principle behind GPT-3's text generation, applied to a mixed text-image sequence.
GPT Image 1 and 1.5 returned to an autoregressive approach but within a natively multimodal architecture. Rather than treating images and text as separate modalities that must be bridged, these models process both within a single neural network.[15]
DALL-E 2 and DALL-E 3 used diffusion models, which generate images by learning to reverse a gradual noising process. Training proceeds by adding noise to an image step by step until it becomes pure Gaussian noise, then training the model to reverse each step. At inference time, the model starts with random noise and iteratively denoises it into a coherent image, guided by the text prompt's embedding.
In DALL-E 2, this guidance came through CLIP embeddings (the unCLIP architecture).[9] In DALL-E 3, the primary innovation was not in the diffusion architecture itself but in the quality of the training data captions that conditioned the diffusion process.[14]