DALL-E

Diffusion Models Generative AI Image Generation OpenAI

25 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v7 · 5,061 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DALL-E is a family of artificial intelligence (AI) image generation models developed by OpenAI that create images from natural-language text descriptions.^[2]^[13] OpenAI introduced the original model on January 5, 2021, describing it as "a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text-image pairs."^[6]^[13]^[23] The name is a portmanteau of the surrealist artist Salvador Dali and WALL-E, the Pixar character.^[13] Since that introduction, DALL-E has gone through three major versions and has played a central role in popularizing text-to-image synthesis as a consumer and developer tool.^[6] When OpenAI released DALL-E 2 in April 2022, human evaluators preferred its output over the original DALL-E 88.8% of the time for photorealism and 71.7% of the time for caption matching, while producing images at 4x greater resolution.^[1]^[24]

The original DALL-E used a transformer-based architecture with 12 billion parameters, drawing heavily on GPT-3.^[6] DALL-E 2, released in April 2022, replaced the autoregressive approach with a diffusion model conditioned on CLIP embeddings.^[1]^[9] DALL-E 3, launched in October 2023, focused on improved prompt following through a caption-improvement training methodology and was natively integrated into ChatGPT.^[14] In March 2025, OpenAI began transitioning away from the DALL-E brand with the release of GPT-4o native image generation, and the DALL-E 2 and DALL-E 3 APIs are scheduled for deprecation on May 12, 2026.^[15]

What are the DALL-E versions?

The table below summarizes the three major DALL-E versions along with their successor models.

Version	Release Date	Architecture	Parameters	Max Resolution	Key Features
DALL-E 1	January 2021	Autoregressive transformer + discrete VAE	12 billion	256 x 256	Zero-shot text-to-image generation; first large-scale text-to-image model
DALL-E 2	April 2022	CLIP + diffusion model (unCLIP)	3.5 billion (+ 1.5B upsampler)	1024 x 1024	Inpainting; outpainting; variations; 4x higher resolution than DALL-E 1
DALL-E 3	October 2023	Diffusion model + improved caption training	Not disclosed	1024 x 1024, 1024 x 1792, 1792 x 1024	ChatGPT integration; prompt rewriting; improved text rendering; HD quality option
GPT Image 1	March/April 2025	Autoregressive (native multimodal)	Not disclosed	Variable	Native ChatGPT integration; image-to-image transformation; precise text rendering
GPT Image 1.5	December 2025	Autoregressive (native multimodal)	Not disclosed	Variable	4x faster generation; precision editing; improved small text handling

DALL-E 1

When was DALL-E released?

OpenAI announced DALL-E on January 5, 2021, alongside CLIP (Contrastive Language-Image Pre-training).^[13]^[23] The model was described in the paper "Zero-Shot Text-to-Image Generation" by Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.^[6] DALL-E built on the success of GPT-2 and GPT-3, applying autoregressive text generation techniques to image synthesis. Where GPT-3 predicted the next token in a text sequence, DALL-E predicted the next token in a combined text-and-image sequence.^[6]

Technical Architecture

DALL-E 1's architecture consisted of two primary components working together:

Discrete Variational Autoencoder (dVAE). The dVAE compressed each 256 x 256 pixel image into a 32 x 32 grid of discrete tokens, yielding 1,024 image tokens drawn from a codebook vocabulary of 8,192 entries.^[6] This compression was essential because modeling raw pixels would have required millions of tokens, making autoregressive training computationally infeasible.

Autoregressive Transformer. The core of DALL-E was a 12-billion-parameter decoder-only transformer similar in architecture to GPT-3.^[6] It had 64 self-attention layers, each with 62 attention heads. Text captions were encoded using Byte Pair Encoding (BPE) with a vocabulary size of 16,384, producing up to 256 text tokens. These text tokens were concatenated with the 1,024 image tokens from the dVAE, forming a single sequence of up to 1,280 tokens.^[6] The transformer was then trained to model this sequence autoregressively, predicting each token based on all preceding tokens.

Training

The model was pre-trained on approximately 250 million text-image pairs sourced from the internet.^[6] During inference, DALL-E generated candidate images by sampling token sequences conditioned on a text prompt, and CLIP was used to rank and select the best results from a batch of generated candidates.^[6]

What could DALL-E 1 do, and what were its limits?

DALL-E 1 demonstrated a striking ability for zero-shot image generation. Given novel text prompts such as "an armchair in the shape of an avocado" or "a professional high-quality illustration of a baby daikon radish in a tutu walking a dog," the model produced plausible images without having seen these specific combinations during training.^[6]^[13] The avocado armchair in particular became the iconic example from the launch, widely reproduced by media outlets.^[23] Lead author Aditya Ramesh described the model's most surprising behavior this way: "The thing that surprised me the most is that the model can take two unrelated concepts and put them together in a way that results in something kind of functional."^[23] As Singh et al. (2021) observed, this compositional generalization, combining familiar concepts in new ways, represents a form of imagination that is central to human intelligence.^[5]

However, the model had clear limitations. Resolution was capped at 256 x 256 pixels. It struggled with compositional prompts involving attribute binding (distinguishing "a yellow book and a red vase" from "a red book and a yellow vase"), negation, precise counts of more than three objects, and complex spatial relationships.^[6]^[5]

DALL-E 2

Release and Overview

OpenAI introduced DALL-E 2 on April 6, 2022.^[1] The system entered public beta in July 2022, and in September 2022, the waitlist was removed, making the service available to anyone.^[4] By November 2022, when the API launched, more than 3 million users were generating over 4 million images per day.^[8]

DALL-E 2 represented a fundamental shift in architecture. Rather than using an autoregressive transformer, it employed a two-stage diffusion process conditioned on CLIP embeddings, an approach that OpenAI internally called "unCLIP."^[9]^[3]

How much better was DALL-E 2 than DALL-E 1?

In OpenAI's human-evaluation study, raters compared image generations from both models and preferred DALL-E 2 over the original DALL-E 88.8% of the time for photorealism and 71.7% of the time for caption matching, while DALL-E 2 generated images with up to four times the resolution of DALL-E 1.^[1]^[24] The accompanying research paper summarized the design goal directly: the authors proposed "a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding," and showed that "explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity."^[9]^[25]

Technical Architecture (unCLIP)

The unCLIP architecture has three main components:

1. CLIP Encoder. CLIP consists of two neural network branches, a text encoder and an image encoder, trained jointly on hundreds of millions of image-caption pairs using a contrastive objective.^[9] The training maximizes the cosine similarity between correctly paired text and image embeddings while minimizing similarity for incorrect pairs. This produces a shared representation space where semantically related text and images are close together.^[3]

2. Prior Model. The prior translates a CLIP text embedding into a corresponding CLIP image embedding. DALL-E 2 used a diffusion-based prior implemented as a decoder-only transformer with a causal attention mask. It accepted tokenized text, CLIP text encodings, a diffusion timestep encoding, and noised CLIP image embeddings as input, and it output a predicted unnoised CLIP image embedding.^[9]

3. Decoder (Modified GLIDE). The image decoder was a modified version of GLIDE, an earlier OpenAI diffusion model. It took the CLIP image embedding produced by the prior and iteratively denoised a sample of Gaussian noise into a 64 x 64 pixel image. Two cascaded upsampling diffusion models then increased resolution first to 256 x 256 and then to 1,024 x 1,024 pixels.^[9]^[3]

The full generation pipeline thus ran as follows: a text prompt was encoded by CLIP's text encoder into a text embedding; the prior model mapped this text embedding to a CLIP image embedding; and the decoder generated the final image from this image embedding through iterative denoising.

Parameter Count

DALL-E 2 used approximately 3.5 billion parameters for its primary model, with an additional 1.5 billion parameters for the resolution-enhancing upsamplers.^[3] Despite having fewer parameters than DALL-E 1's 12 billion, DALL-E 2 produced images at four times the resolution with significantly improved realism and accuracy.^[1]^[11]

Inpainting and Outpainting

DALL-E 2 introduced two important editing features:

Inpainting allowed users to select a region within an existing image and fill it with new AI-generated content guided by a text prompt. The model adapted new objects to match the style, lighting, shadows, and textures present in the original image.^[11]

Outpainting extended an image beyond its original borders, generating new content that was consistent with the existing scene's perspective, shadows, reflections, and textures. This enabled creation of larger images and different aspect ratios from a starting composition.^[10]

Characteristics and Evaluation

Marcus et al. (2022) conducted a systematic evaluation of DALL-E 2 and reported several observations:^[12]

Images exhibited high visual quality overall
Language comprehension was reliable for straightforward prompts
Compositionality remained problematic: relationships between entities were often confused, and anaphora posed challenges
Numerical concepts were poorly understood (e.g., generating the wrong number of objects)
Negation was not handled reliably
Common-sense reasoning failures were frequent
Content filters occasionally blocked benign prompts while missing edge cases

Commercial Rights

On July 20, 2022, OpenAI announced that users would receive full usage rights to commercialize images they created with DALL-E 2, including the right to reprint, sell, and merchandise them.^[7] OpenAI retained the right to commercialize user-created images as well.^[7]

DALL-E 3

Release and ChatGPT Integration

OpenAI announced DALL-E 3 in September 2023 and began rolling it out in October 2023.^[14] The most significant change was native integration with ChatGPT. Unlike previous versions where users typed prompts directly into an image generation interface, DALL-E 3 was accessed through ChatGPT.^[14] Users described what they wanted in natural conversation, and ChatGPT automatically expanded brief requests into detailed prompts optimized for image generation. This approach effectively eliminated the need for users to learn prompt engineering techniques specific to image models.^[14]

When given a request, ChatGPT typically generated multiple detailed prompt variations, each producing a different image. Users could then refine results through continued conversation, asking ChatGPT to adjust colors, compositions, styles, or specific elements.

How does DALL-E 3 improve prompt following?

The technical paper behind DALL-E 3, titled "Improving Image Generation with Better Captions" by James Betker, Gabriel Goh, Li Jing, and colleagues, identified a root cause of poor prompt following in earlier text-to-image models: the low quality of text-image pair captions in training datasets.^[14] Most internet-sourced captions are short, vague, or inaccurate descriptions of the images they accompany.

To address this, OpenAI trained a custom image captioner jointly with a CLIP and language modeling objective. This captioner produced long, highly descriptive captions covering the main subject, surroundings, background, visible text, artistic style, and coloration of each training image. The training dataset was then recaptioned using this model.^[14]

During DALL-E 3 training, a regularization technique randomly selected between the synthetic caption (95% of the time) and the original ground-truth caption (5% of the time) for each sample. This hybrid approach prevented overfitting to the captioner's specific patterns while still delivering the benefits of more descriptive training data.^[14]

Improved Text Rendering

One of DALL-E 3's most notable improvements was its ability to render readable text within generated images.^[14] Earlier models, including DALL-E 2 and Midjourney, frequently produced garbled or illegible text in signs, labels, and other contexts. DALL-E 3 achieved substantially better text rendering, with accuracy estimated at approximately 95% for common text-in-image scenarios. This improvement came partly from using larger text encoders with character-level understanding.^[14]

Resolution and Quality Options

DALL-E 3 supported three resolution options: 1024 x 1024 (square), 1024 x 1792 (portrait), and 1792 x 1024 (landscape). It also offered two quality tiers:

Standard quality: faster generation at lower cost
HD quality: additional processing time for finer details and greater consistency

A style parameter allowed users to choose between "vivid" (more dramatic, hyper-real images) and "natural" (less stylized, more photographic results).^[20]

Prompt Rewriting and Safety

The ChatGPT integration included an automatic prompt rewriting system that served both quality and safety purposes. ChatGPT transformed user requests into detailed prompts that improved generation quality while simultaneously checking for potential content policy violations.^[14] If a request appeared to violate OpenAI's guidelines, the prompt transformation could modify it to remove the problematic elements. This system was tested against 500 synthetic prompts and reportedly reduced generation of identifiable public figures to zero when explicitly requested by name.^[14]

DALL-E 3 also refused to generate images in the style of living artists, addressing concerns from the creative community about unauthorized style replication.^[14]

DALL-E and CLIP

DALL-E was developed and announced alongside CLIP, a Contrastive Language-Image Pre-training model. While these two models serve different purposes, they are deeply interconnected.

CLIP was trained on 400 million image-text pairs scraped from the internet. It learned to predict which caption best matches a given image from a list of thousands of candidates.^[13] In the context of the DALL-E family, CLIP served two roles:

Ranking (DALL-E 1): After DALL-E 1 generated a batch of candidate images for a given prompt, CLIP scored each candidate based on how well it matched the text description, and the highest-scoring images were selected for display.^[6]
Conditioning (DALL-E 2): In the unCLIP architecture, CLIP embeddings served as the intermediate representation between text and image. The prior model translated CLIP text embeddings into CLIP image embeddings, and the decoder generated images conditioned on these embeddings.^[9]

This relationship, where DALL-E generates images from text while CLIP creates text descriptions from images, is what gave the DALL-E 2 architecture its "unCLIP" name: it inverts the CLIP process.

GPT Image Models: The DALL-E Successors

GPT-4o Native Image Generation (March 2025)

On March 25, 2025, OpenAI released native image generation capabilities in GPT-4o, marking a departure from the DALL-E approach.^[15] Unlike DALL-E 2 and 3, which were separate diffusion models invoked by ChatGPT as external tools, GPT-4o's image generation is built directly into the language model. The model is natively multimodal, processing text and images within the same neural network rather than delegating to a specialized image generation system.^[15]

Key improvements over DALL-E 3 included:

Accurate text rendering in images (words, labels, signs, and logos)
Ability to handle 10 to 20 distinct objects in a single scene
Use of chat history to maintain consistency across multiple generations in a conversation
Image-to-image transformation capabilities
Interactive refinement where users could ask for specific edits to generated images^[15]

The underlying model was made available to developers as "gpt-image-1" via the API on April 23, 2025.^[16]

GPT Image 1.5 (December 2025)

OpenAI released GPT Image 1.5 on December 16, 2025, as the next iteration of its image generation capabilities. It was simultaneously rolled out in ChatGPT (branded as "ChatGPT Images") and made available through the API.^[17]

Notable improvements:

Up to 4x faster image generation compared to GPT Image 1
Precision editing that changes only the requested elements while preserving lighting, composition, and appearance consistency
Better handling of dense and small text within images
Improved instruction following and prompt adherence
20% lower API costs compared to GPT Image 1
LM Arena Elo score of 1,264, the highest among tested image generation models as of early 2026^[17]

DALL-E API Deprecation

On November 14, 2025, OpenAI announced that DALL-E 2 and DALL-E 3 model snapshots would be deprecated and removed from the API on May 12, 2026.^[17] Developers were directed to migrate to GPT Image 1 or GPT Image 1.5. The DALL-E brand has effectively been retired in favor of the GPT Image product line, though the DALL-E models continue to function via the API through the deprecation date.^[17]

API Access and Pricing

OpenAI has offered API access for image generation across multiple model generations. The following table summarizes pricing as of early 2026.

Model	Quality	Resolution	Price per Image
DALL-E 2 (legacy)	Standard	1024 x 1024	$0.020
DALL-E 2 (legacy)	Standard	512 x 512	$0.018
DALL-E 2 (legacy)	Standard	256 x 256	$0.016
DALL-E 3 (deprecated)	Standard	1024 x 1024	$0.040
DALL-E 3 (deprecated)	Standard	1024 x 1536 or 1536 x 1024	$0.080
DALL-E 3 (deprecated)	HD	1024 x 1024	$0.080
DALL-E 3 (deprecated)	HD	1024 x 1536 or 1536 x 1024	$0.120
GPT Image 1	Low	1024 x 1024	$0.011
GPT Image 1	Medium	1024 x 1024	$0.042
GPT Image 1	High	1024 x 1024	$0.167

GPT Image 1.5 uses token-based pricing ($8.00 per million input tokens, $32.00 per million output tokens) rather than per-image pricing, and its image inputs and outputs are 20% cheaper than GPT Image 1.^[21]^[17]

For ChatGPT subscribers, image generation is included in their subscription at no additional per-image cost, subject to usage limits that vary by plan tier (Free, Plus, Pro, Team, Enterprise, and Edu).

Safety Measures

Content Policy

OpenAI maintains a comprehensive content policy for all DALL-E and GPT Image models. Prohibited content categories include:

Violence, gore, and graphic content
Sexual or explicit imagery, including in stylized or fictional form
Hateful, threatening, or harassing content
Political content designed to mislead
Content depicting real people without consent
Child sexual abuse material (reported to the National Center for Missing and Exploited Children)

The policy specifically blocks the generation of photorealistic images of identifiable real people, including public figures, to prevent deepfake creation.^[14] DALL-E 3 and its successors also refuse to generate images in the style of living artists.^[14]

Training Data Safety

OpenAI has removed violent, sexual, and otherwise harmful content from the training datasets used for DALL-E models. Text prompt filters block requests that violate the content policy before they reach the model, and output classifiers scan generated images for policy violations before they are returned to users.

C2PA Watermarking and Content Credentials

Beginning with DALL-E 3, OpenAI implemented C2PA (Coalition for Content Provenance and Authenticity) metadata in all generated images.^[18] This metadata includes the Content Credentials logo (CR) and embedded information about the time and date of creation and the AI-generated nature of the image. Users can verify whether an image was generated by ChatGPT or the DALL-E API using tools such as Content Credentials Verify.^[18]

In May 2024, OpenAI joined the C2PA as a steering committee member, alongside Adobe, BBC, Intel, Microsoft, Google, Publicis Groupe, Sony, and Truepic.^[19] All GPT-4o-generated images also include C2PA metadata.^[15]

However, C2PA metadata has limitations. It is stored as file metadata rather than being embedded directly into the image pixels, meaning it can be stripped by a simple file conversion or metadata removal tool. This makes it an imperfect solution for tracking the provenance of AI-generated images shared across the internet.

Applications

What is DALL-E used for?

DALL-E and its successors have found use across a wide range of professional and creative domains.

Marketing and Advertising

Businesses use DALL-E to produce custom images for advertising campaigns, social media content, websites, and email marketing without relying on stock photography. Marketing teams can input specific creative briefs and receive tailored visuals in seconds, reducing the cost and turnaround time of visual content production.

Product Design and Prototyping

Designers use DALL-E for early-stage concept visualization. Product mockups can be generated quickly to explore design directions before investing in physical prototypes or detailed 3D models. Fashion design platform CALA integrated the DALL-E 2 API to let its users improve and iterate on design ideas through text prompts.^[8]

Concept Art and Entertainment

The film, game, and animation industries use DALL-E for concept art, storyboarding, and visual development. Artists use it to rapidly explore artistic directions, generate reference material, and communicate visual ideas to collaborators before committing to detailed production work.

Education

Educators use DALL-E to generate visual aids for teaching, including diagrams of scientific phenomena, reconstructions of historical scenes, language-learning flashcards, and illustrations of abstract concepts. The tool makes it possible to create customized educational imagery that precisely matches lesson content.

Publishing and Media

Publishers use DALL-E for book covers, magazine illustrations, article headers, and other editorial imagery. The ability to generate custom illustrations on demand reduces dependence on stock photography and commissioned artwork for routine publishing needs.

Combined AI Workflows

DALL-E is increasingly used in combination with other AI tools. For example, an image generated by DALL-E can be animated and given voice using D-ID's AI-generated text-to-video technology. A landscape created in one tool can become an opening shot of a video, accompanied by music composed by an AI music model. Gil Perry, CEO of D-ID, has noted that "people are layering different AI tools to produce even more creative content."

Competition

How does DALL-E compare to other image generators?

DALL-E competes with several other AI image generation systems. The following table compares the major platforms as of early 2026.

Platform	Developer	Model Type	Open Source	Key Strengths	Pricing Model
DALL-E 3 / GPT Image	OpenAI	Diffusion (DALL-E) / Autoregressive (GPT Image)	No	ChatGPT integration; text rendering; ease of use	API per-image/token; subscription
Midjourney	Midjourney Inc.	Proprietary diffusion	No	Artistic quality; aesthetic coherence; color harmony	Subscription ($10-$120/month)
Stable Diffusion	Stability AI	Latent diffusion	Yes	Full customization; local deployment; fine-tuning; no subscription needed	Free (self-hosted); API varies
Imagen	Google DeepMind	Diffusion (Imagen 4)	No	Photorealistic output; text rendering; integrated in Gemini	API per-image ($0.02-$0.06)
Flux	Black Forest Labs	Rectified flow transformer	Partially	Photorealism; high detail; fast inference	API and open-weight options

Midjourney

Midjourney is widely regarded as the leader in artistic quality and aesthetic impact. Its V7 model, released in April 2025, is optimized for visual coherence, color harmony, and compositional balance. Midjourney is particularly strong for concept art, fantasy landscapes, and stylized portraits. It originally operated exclusively through Discord but has since launched a dedicated web interface.

Stable Diffusion

Stable Diffusion, developed by Stability AI, is the most customizable option. As an open-source model, it can run on local hardware, be fine-tuned for specific styles or domains, and integrated into custom workflows without subscription fees. Stable Diffusion 3.5, released in late 2025, brought significant improvements in image quality, prompt adherence, and text rendering, narrowing the gap with proprietary models.

Imagen

Google's Imagen models are integrated into the Gemini platform. Imagen 4, available through Google Cloud, offers competitive photorealism and text rendering at lower per-image costs than OpenAI's highest-quality tiers. Google's Gemini 3 Pro (late 2025) includes enhanced image generation capabilities.

Flux

Flux, developed by Black Forest Labs (founded by former Stability AI researchers), uses a rectified flow transformer architecture. Flux Pro produces some of the most photorealistic AI-generated images available, with exceptional detail and lighting. Flux offers both open-weight models for self-hosting and commercial API access.

Concerns and Controversies

Artistic Integrity

The rise of AI image generation has prompted debate about the future of human artistry. Critics argue that when anyone can generate compelling visuals from a text prompt, the value of traditional artistic skill is diminished. Proponents counter that these tools democratize visual creation, enabling people who lack formal art training, time, or physical ability to express visual ideas.

Copyright and Intellectual Property

Copyright issues surrounding AI-generated images remain legally unsettled. OpenAI grants users commercial rights over images they create, but since users contribute only a text prompt and the images are machine-generated, the copyrightability of the output is unclear under current law.

In August 2022, Getty Images banned the upload and sale of images generated by DALL-E 2, Stable Diffusion, and other AI tools, citing "unaddressed right issues" because training datasets contained copyrighted images.^[22] Other platforms including Newgrounds, PurplePort, and FurAffinity enacted similar bans.^[22] In contrast, Shutterstock announced a partnership with OpenAI to integrate DALL-E 2 for content generation.

The use of copyrighted material in training datasets remains the subject of ongoing litigation, with artists and rights holders arguing that training on their work without permission constitutes infringement.

Deepfakes and Misinformation

AI image generators can be misused to create deceptive imagery, from deepfake portraits of public figures to fabricated photographic "evidence" of events that never occurred. OpenAI addresses this through content filters, the prohibition on generating identifiable real people, and C2PA metadata.^[18] However, other image generation tools, particularly open-source ones like Stable Diffusion, have fewer restrictions and have already been used to create deepfakes of celebrities.

Legislative responses have emerged worldwide. In May 2025, the U.S. Congress passed the TAKE IT DOWN Act, the first major federal statute targeting non-consensual intimate imagery including AI-generated deepfakes. The EU AI Regulation (2024/1689) established mandatory labeling requirements for AI-generated content.

Development Timeline

The following timeline traces the key milestones in DALL-E's development and the broader evolution of OpenAI's image generation capabilities.

Date	Event
2019	GPT-2 released, demonstrating large-scale autoregressive text generation with 1.5 billion parameters
June 2020	GPT-3 released with 175 billion parameters, providing the architectural foundation for DALL-E
January 2021	DALL-E and CLIP announced simultaneously
April 2022	DALL-E 2 introduced with unCLIP architecture and 1024 x 1024 resolution
July 2022	DALL-E 2 enters public beta; commercial usage rights granted to users
September 2022	DALL-E 2 waitlist removed; open access begins
November 2022	DALL-E 2 API launched; 3 million users generating 4 million images daily
August 2022	Getty Images bans AI-generated content uploads
September 2023	DALL-E 3 announced with ChatGPT integration
October 2023	DALL-E 3 rolls out to ChatGPT Plus and Enterprise users
November 2023	DALL-E 3 API released alongside text-to-speech models
February 2024	C2PA metadata added to DALL-E 3 API-generated images
May 2024	OpenAI joins C2PA steering committee
March 2025	GPT-4o native image generation replaces DALL-E 3 in ChatGPT
April 2025	gpt-image-1 API released for developers
November 2025	DALL-E 2 and DALL-E 3 API deprecation announced (sunset: May 12, 2026)
December 2025	GPT Image 1.5 released in ChatGPT and API

How It Works: Technical Summary

Across its versions, the DALL-E family has used two fundamentally different approaches to image generation.

Autoregressive Generation (DALL-E 1 and GPT Image)

DALL-E 1 treated image generation as a sequence prediction task. Images were tokenized into discrete codes using a variational autoencoder, then concatenated with text tokens and fed into a transformer that predicted the next token in the sequence.^[6] This is the same principle behind GPT-3's text generation, applied to a mixed text-image sequence.

GPT Image 1 and 1.5 returned to an autoregressive approach but within a natively multimodal architecture. Rather than treating images and text as separate modalities that must be bridged, these models process both within a single neural network.^[15]

Diffusion-Based Generation (DALL-E 2 and DALL-E 3)

DALL-E 2 and DALL-E 3 used diffusion models, which generate images by learning to reverse a gradual noising process. Training proceeds by adding noise to an image step by step until it becomes pure Gaussian noise, then training the model to reverse each step. At inference time, the model starts with random noise and iteratively denoises it into a coherent image, guided by the text prompt's embedding.

In DALL-E 2, this guidance came through CLIP embeddings (the unCLIP architecture).^[9] In DALL-E 3, the primary innovation was not in the diffusion architecture itself but in the quality of the training data captions that conditioned the diffusion process.^[14]

References

OpenAI. "DALL-E 2." openai.com. April 2022. ↩
Huang, K. "DALL-E: A Brief Overview." 2022. ↩
Bastian, M. "DALL-E 2 Explained." The Decoder. 2022. ↩
OpenAI. "DALL-E Now Available Without Waitlist." September 2022. ↩
Singh, G. et al. "Illiterate DALL-E Learns to Compose." arXiv preprint arXiv:2110.11405. 2021. ↩
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. "Zero-Shot Text-to-Image Generation." Proceedings of the 38th International Conference on Machine Learning. 2021. ↩
OpenAI. "DALL-E Usage Policy Update." July 2022. ↩
OpenAI. "DALL-E API." November 2022. ↩
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. "Hierarchical Text-Conditional Image Generation with CLIP Latents." arXiv preprint arXiv:2204.06125. April 2022. ↩
OpenAI. "DALL-E 2: Outpainting." August 2022. ↩
OpenAI. "DALL-E 2 Research." openai.com. 2022. ↩
Marcus, G., Davis, E., and Aaronson, S. "A very preliminary analysis of DALL-E 2." arXiv preprint arXiv:2204.13807. 2022. ↩
Heikkila, M. "DALL-E, the AI Art Generator, Explained." MIT Technology Review. 2022. ↩
Betker, J., Goh, G., Jing, L., et al. "Improving Image Generation with Better Captions." OpenAI Research Paper. 2023. ↩
OpenAI. "Introducing 4o Image Generation." openai.com. March 2025. ↩
OpenAI. "Image Generation API: Introducing gpt-image-1." openai.com. April 2025. ↩
OpenAI. "The New ChatGPT Images is Here." openai.com. December 2025. ↩
OpenAI. "C2PA in ChatGPT Images." OpenAI Help Center. 2024. ↩
C2PA. "OpenAI Joins C2PA Steering Committee." May 2024. ↩
OpenAI. "DALL-E 3 API." OpenAI Developer Documentation. 2023. ↩
OpenAI. "GPT Image 1 Model." OpenAI API Documentation. 2025. ↩
Schwartz, E. H. "Getty Images Removes and Bans AI-Generated Art." Voicebot.ai. September 23, 2022. https://voicebot.ai/2022/09/23/getty-images-removes-and-bans-ai-generated-art/ ↩
Heaven, W. D. "This avocado armchair could be the future of AI." MIT Technology Review. January 5, 2021. https://www.technologyreview.com/2021/01/05/1015754/avocado-armchair-future-ai-openai-deep-learning-nlp-gpt3-computer-vision-common-sense/ ↩
OpenAI. "DALL-E 2." openai.com. April 6, 2022. (Human evaluators preferred DALL-E 2 over DALL-E 1 88.8% of the time for photorealism and 71.7% for caption matching; 4x greater resolution.) ↩
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. "Hierarchical Text-Conditional Image Generation with CLIP Latents" (abstract). arXiv:2204.06125. April 13, 2022. https://arxiv.org/abs/2204.06125 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What are the DALL-E versions?

DALL-E 1

When was DALL-E released?

Technical Architecture

Training

What could DALL-E 1 do, and what were its limits?

DALL-E 2

Release and Overview

How much better was DALL-E 2 than DALL-E 1?

Technical Architecture (unCLIP)

Parameter Count

Inpainting and Outpainting

Characteristics and Evaluation

Commercial Rights

DALL-E 3

Release and ChatGPT Integration

How does DALL-E 3 improve prompt following?

Improved Text Rendering

Resolution and Quality Options

Prompt Rewriting and Safety

DALL-E and CLIP

GPT Image Models: The DALL-E Successors

GPT-4o Native Image Generation (March 2025)

GPT Image 1.5 (December 2025)

DALL-E API Deprecation

API Access and Pricing

Safety Measures

Content Policy

Training Data Safety

C2PA Watermarking and Content Credentials

Applications

What is DALL-E used for?

Marketing and Advertising

Product Design and Prototyping

Concept Art and Entertainment

Education

Publishing and Media

Combined AI Workflows

Competition

How does DALL-E compare to other image generators?

Midjourney

Stable Diffusion

Imagen

Flux

Concerns and Controversies

Artistic Integrity

Copyright and Intellectual Property

Deepfakes and Misinformation

Development Timeline

How It Works: Technical Summary

Autoregressive Generation (DALL-E 1 and GPT Image)

Diffusion-Based Generation (DALL-E 2 and DALL-E 3)

References

Improve this article

Related Articles

GLIDE (OpenAI)

Stable Diffusion

Midjourney

Imagen (text-to-image model)

Flux (text-to-image model)

Black Forest Labs

What links here (24 of 159)

Related Articles

GLIDE (OpenAI)

Stable Diffusion

Midjourney

Imagen (text-to-image model)

Flux (text-to-image model)

Black Forest Labs

What links here (24 of 159)