DALL-E 3
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,999 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,999 words
Add missing citations, update stale details, or suggest a clearer explanation.
DALL-E 3 (stylized by OpenAI as DALL·E 3) is the third generation of OpenAI's text-to-image system, announced on September 20, 2023 and released to ChatGPT Plus and Enterprise subscribers in October 2023, followed by a standalone API endpoint named dall-e-3.[^1][^2] Compared with its predecessor DALL-E 2, the model produces images that adhere much more closely to long, detailed prompts, renders legible text inside images far more reliably, and is paired with a server-side prompt-rewriting pipeline driven by ChatGPT.[^1][^3] The training recipe is documented in the OpenAI technical report Improving Image Generation with Better Captions by James Betker, Gabriel Goh, Li Jing and colleagues, which credits a bespoke image captioner and synthetic-caption training as the key intervention.[^4] DALL-E 3 was integrated into Microsoft's Bing Image Creator and Copilot products in late 2023[^5][^6] and remained OpenAI's primary image generator until it was superseded in 2025 by GPT-4o native image generation (the API model gpt-image-1) and, by December 2025, by gpt-image-1.5 as the default in ChatGPT.[^7][^8][^9]
| Item | Value |
|---|---|
| Developer | OpenAI |
| Predecessor | DALL-E 2 (April 2022) |
| Announcement | September 20, 2023[^1][^2] |
| ChatGPT availability | October 2023 (Plus / Enterprise)[^10] |
| Bing availability | October 3, 2023 (free, via Bing Chat and Bing.com/create)[^23] |
| API model id | dall-e-3[^11] |
| API launch | November 6, 2023 (OpenAI DevDay)[^24] |
| Output sizes | 1024x1024, 1024x1792, 1792x1024[^11] |
| Quality settings | standard, hd[^11] |
| Style settings | vivid, natural[^11] |
| Provenance | C2PA metadata (added Feb 2024)[^12][^13] |
| Free ChatGPT tier | August 8, 2024 (two images per day)[^25] |
| Successor | gpt-image-1 (GPT-4o image generation, March-April 2025)[^7][^8] |
| API deprecation | Announced Nov 14, 2025; removal scheduled May 12, 2026[^9] |
OpenAI's first DALL-E, released as a research preview in January 2021, was an autoregressive transformer that produced images by predicting discrete image tokens, conditioned on text.[^1][^26] The original DALL-E model was led by Aditya Ramesh, with co-inventors Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen, and was named as a portmanteau of Salvador Dalí and Pixar's WALL-E.[^26] DALL-E 2, announced on April 6, 2022, replaced this with a CLIP-conditioned diffusion model that operated in unCLIP image-embedding space, dramatically improving image fidelity but exhibiting weak prompt following: it tended to combine attributes incorrectly, struggled with multi-object scenes, ignored words in long prompts, and was notoriously poor at rendering legible text and detailed hands.[^1][^4] By mid-2023, Stable Diffusion derivatives and Midjourney v5 had narrowed or closed the aesthetic gap with DALL-E 2 while exposing the same compositional weaknesses, motivating OpenAI's next-generation effort.[^4][^21]
Aditya Ramesh continued to lead the DALL-E project through this period, working alongside a team that grew to include researchers focused on safety, evaluation, and data curation.[^26] OpenAI gradually wound down the DALL-E 2 research preview during 2022 and 2023 in preparation for DALL-E 3, although DALL-E 2 remained available as an API model (dall-e-2) until the announced 2026 deprecation.[^9][^26]
OpenAI publicly announced DALL-E 3 on September 20, 2023 with a blog post on its corporate website and a research preview campaign showcasing improved prompt following and text rendering.[^1] In the announcement OpenAI emphasized that DALL-E 3 had been built "natively on ChatGPT", meaning the chat model would automatically rewrite short user requests into more detailed image prompts before they reached the image model.[^1] Coverage on launch day from TechCrunch summarized the same three-pronged story: tighter ChatGPT integration, refusals for living-artist styles and named public figures, and an opt-out form for artists who do not want their work used to train future OpenAI image models.[^2]
Sam Altman, then CEO of OpenAI, demonstrated DALL-E 3 in a personal social-media post the same day that depicted a t-shirt with the exact text "I exist" rendered cleanly, an output that was widely circulated as evidence of the improvement in text-in-image rendering relative to DALL-E 2.[^2] The announcement post also included sample outputs of multi-object compositions with explicit spatial relationships ("an old man in a coat sitting on a park bench reading a newspaper, with a pigeon eating breadcrumbs at his feet") that previous OpenAI models had been unable to render coherently.[^1] At launch, OpenAI stated that DALL-E 3 would not generate images in the style of named living artists, would refuse requests depicting named public figures (politicians and celebrities), and would offer artists a form through which they could opt out their works from future training sets.[^1][^2]
DALL-E 3 began rolling out to ChatGPT Plus and Enterprise customers in October 2023, replacing DALL-E 2 as the default image tool in the chat interface.[^10] In an October 19, 2023 blog post, OpenAI confirmed broader availability and disclosed additional safety work: the company said it had trained a provenance classifier (an internal tool) that, in early internal evaluations, was over 99% accurate at identifying unmodified DALL-E 3 images and over 95% accurate after common modifications such as cropping, resizing, JPEG compression, or partial overlay with real imagery.[^10][^27]
The October release arrived as part of a broader expansion of multimodal capabilities in ChatGPT that also included GPT-4 with vision (image input) and a voice mode driven by Whisper for transcription and a new text-to-speech model for output, marking the moment when ChatGPT shifted from a text-only product to a multimodal assistant.[^10][^28] OpenAI rolled out the image tool gradually to Plus and Enterprise subscribers throughout October 2023.[^10]
In parallel with the ChatGPT release, Microsoft integrated the model into Bing Image Creator (originally called "Image Creator from Microsoft Designer") and into the Copilot consumer surface around the September 21, 2023 Copilot rebrand, making DALL-E 3 available to non-paying users through Bing for the first time.[^5][^6][^29] Microsoft initially offered DALL-E 3 to Bing Chat Enterprise subscribers and rolled out the free Bing consumer integration broadly on October 3, 2023.[^5][^23] At that point, Microsoft said Bing Image Creator had already generated more than one billion images since its launch (using DALL-E 2.5 derivatives) and that the new DALL-E 3 upgrade delivered "more beautiful creations and better renderings for details like fingers and eyes."[^23]
The Bing/Copilot route was strategically important because, unlike OpenAI's paid distribution path, it was free at point of use, financed by Microsoft's existing search advertising business; Microsoft framed the integration as a way to drive Bing search adoption and to differentiate Copilot from Google's then-nascent Gemini-based image tools.[^5][^6] On September 21, 2023 Microsoft also rebranded all variants of its Copilot family (including Bing Chat and Microsoft 365 Copilot) under the unified Microsoft Copilot name, with DALL-E 3 listed as one of the headline new capabilities.[^29]
OpenAI exposed DALL-E 3 as a standalone API endpoint (dall-e-3) on November 6, 2023, at the company's first developer conference (DevDay), with parameters for size, quality (standard versus hd), and style (vivid versus natural).[^11][^14][^24] The API release shipped alongside the introduction of the Assistants API and the GPT-4 Turbo model family.[^14][^24] Pricing started at $0.04 per generated image for the standard 1024x1024 size.[^18][^24]
Unlike the DALL-E 2 API, dall-e-3 did not expose endpoints for image editing (inpainting) or image variations: only the generation endpoint POST /v1/images/generations was supported.[^11][^24] OpenAI also stated at launch that the API would automatically rewrite user prompts "for safety reasons and to add more detail," a behavior that became one of the most-discussed practical differences between the API and the ChatGPT-facing version.[^24]
For nearly a year after launch, DALL-E 3 in ChatGPT remained restricted to paying users. On August 8, 2024, OpenAI extended access to free-tier ChatGPT accounts, allowing each free user to generate up to two images per day; Plus subscribers continued to receive a higher quota with one image per minute at the time.[^25] The expansion was described by OpenAI as part of a phased democratization of the multimodal capabilities introduced in 2023, and it coincided with rising competitive pressure from Midjourney v6 and Google's Imagen 2.[^25]
On February 6, 2024 OpenAI announced that images generated by DALL-E 3, both through ChatGPT and through the API, would carry content-credentials metadata following the C2PA (Coalition for Content Provenance and Authenticity) standard, including a visible "CR" logo and embedded cryptographic provenance information.[^12][^13] OpenAI also joined the C2PA Steering Committee at the same time.[^30] OpenAI later extended its multi-layered approach by adopting Google DeepMind's SynthID for invisible pixel-level watermarking on later image generators.[^15][^31] In November 2023 OpenAI also released the consistency decoder used to convert DALL-E 3 latents to pixels as open-source under an MIT license; the decoder was distilled from a diffusion-based decoder down to two sampling steps using consistency distillation and could be used as a drop-in replacement for the Stable Diffusion v1 VAE decoder.[^32][^33]
OpenAI's GPT-4o image generation system, announced on March 25, 2025 and exposed in the API as the model gpt-image-1 on April 23, 2025, gradually replaced DALL-E 3 as the default image generator in ChatGPT.[^7][^8] OpenAI stated that the new model used the GPT-4o architecture rather than the diffusion stack underlying DALL-E 3, allowing it to take both text and reference images as input within the same autoregressive image-generation process and to produce edited outputs that preserved scene identity across turns.[^7] The image-generation rollout drew unusually heavy traffic; OpenAI reported that more than 130 million users created over 700 million images in the first week, and Sam Altman publicly asked users to slow down because the company's GPUs were "melting".[^7][^34] A viral trend in which users transformed personal photos into Studio Ghibli-style images contributed to the load and drew renewed attention to copyright questions surrounding the training data.[^34]
On November 14, 2025 OpenAI announced that the dall-e-2 and dall-e-3 API model snapshots would be deprecated and removed from the API on May 12, 2026, directing developers to gpt-image-1 and successors such as gpt-image-1-mini (October 2025) and gpt-image-1.5 (December 2025).[^9][^16] In December 2025 OpenAI silently changed the default image model behind ChatGPT's /image command from DALL-E 3 to gpt-image-1.5, completing the consumer-side transition.[^9][^16][^35] Developers in OpenAI's community forums responded with petitions to keep DALL-E 3 alive, arguing that the diffusion-based output had a distinct aesthetic that the newer autoregressive model did not reproduce.[^36]
The publicly released OpenAI technical report associated with DALL-E 3, Improving Image Generation with Better Captions, by James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh, focuses almost exclusively on the training-data side of the pipeline rather than on the diffusion model architecture itself.[^4] The report's central claim is that prompt-following ability is bottlenecked by the noisy, short, and often inaccurate alt-text and web captions that have historically been used to train text-to-image systems, and that this can be addressed by training a bespoke image captioner and using it to relabel the synthetic-caption-only training set.[^4]
Concretely, the authors train an image captioner (described in the report as a CLIP-conditioned language model) and fine-tune it through two stages, producing "short synthetic captions" (SSC) and "descriptive synthetic captions" (DSC). The SSC pass is trained to produce single-sentence captions resembling typical web alt-text describing the main subject only, while the DSC pass is trained on a smaller set of human-written long captions that describe foregrounds, backgrounds, lighting, object counts, spatial relations, and on-image text.[^4][^37] They then train three text-to-image models on different mixtures of ground-truth and synthetic captions and use a CLIP-based evaluation to compare prompt following.[^4] The report shows monotonic gains in CLIP score as the synthetic-caption ratio increases, with the best results obtained at a 95% synthetic / 5% ground-truth mix; the DALL-E 3 production model is then trained on this mixture.[^4][^37] The ground-truth fraction is included as a regularization device: it exposes the model to the surface statistics of real-world prompts (typical lengths, capitalization, and punctuation patterns) so the trained system does not silently expect every input to look like a DSC paragraph.[^4][^37]
A key ablation in the paper compares image generation quality at the same training compute when only the captions are changed: models trained on the DSC mix score substantially higher on both prompt-following (CLIP-based) and human-preference evaluations than identical models trained on the same images with ground-truth web captions.[^4] The authors argue this strongly suggests that, at frontier scale, caption quality rather than image quality is now the bottleneck for prompt following, an observation that has shaped subsequent text-to-image research.[^4][^22] The paper deliberately does not disclose model architecture details, dataset composition, training compute, or model size, noting that this information has been withheld for competitive and safety reasons.[^4]
The paper itself acknowledges several limitations that survive into the production model. First, prompt following degrades when user prompts are shorter than typical DSCs, motivating the upstream ChatGPT rewriter described below.[^4] Second, the model has a documented deficit on spatial reasoning ("the cup is to the left of the plate, below the painting"), with the authors attributing this to a known weakness in the underlying CLIP-based captioner: it tends to hallucinate plausible spatial relations rather than to faithfully record the ones in the image.[^4] Third, although text rendering is dramatically better than DALL-E 2 for short strings, longer text and unusual fonts still fail.[^4] Fourth, the captioner exhibits domain-specific hallucinations (in the paper's own words, inventing botanical or ornithological details when training images are ambiguous), which can be passed forward into the trained generator.[^4][^37]
OpenAI has not published the architecture of the DALL-E 3 image generator. Public materials describe it as a latent diffusion model with a transformer-based text encoder rather than the autoregressive token model used in the original DALL-E (2021); this contrasts with DALL-E 2, whose architecture was disclosed in a 2022 OpenAI paper.[^1][^4][^26] The Betker et al. report includes ablation studies on the captioner and on the text-to-image training mix but stops short of describing the denoising network, sampler, latent space, or text encoder of the production model.[^4]
Third-party reverse-engineering and inference about the architecture has converged on a high-level picture: a variational autoencoder (VAE) that performs 8x spatial downsampling, mapping 256-pixel image regions to 32x32 latent grids; a text encoder (commonly believed to be a T5-class transformer based on resemblance to Google's Imagen architecture); and a U-Net or transformer denoiser conditioned on the encoded text via cross-attention or GroupNorm injection.[^38] None of these details are confirmed by OpenAI. The only model component released in full is the consistency decoder, which OpenAI open-sourced on November 6, 2023; the decoder reuses the Stable Diffusion v1 latent space and was trained as a diffusion-based decoder, then distilled to two sampling steps via consistency distillation.[^32][^33]
Because the synthetic captions used to train DALL-E 3 are long and descriptive, short user prompts often underperform. OpenAI's solution is a server-side prompt rewriting layer: when a user asks ChatGPT to "draw" or "create" an image, the chat model is instructed to expand the request into a much more detailed image prompt (typically a paragraph) before invoking the image tool, and to return both the rewritten prompt and the generated image to the user.[^1][^17] The same rewriting also runs by default through the API: developers calling images.generate with model="dall-e-3" receive a revised_prompt field in the response describing what was actually sent to the model.[^11][^17]
The system prompt that ChatGPT uses to perform this rewrite was reverse-engineered by users and published on community sites in late 2023.[^39][^40] According to those reconstructions, ChatGPT is instructed to generate exactly four images per request when no explicit count is provided, to limit the rewritten prompt to roughly 4000 characters, to avoid copyrighted artist names from the last 100 years (substituting three style adjectives instead), to avoid named living people including politicians and celebrities, to phrase prompts as a "detailed scene description" rather than a list of tags, and to translate any non-English input into English before rewriting.[^39][^40] The system prompt also imposes an explicit diversity requirement: when generating images of people, the rewriter is told to specify both descent and gender for each person using direct terms.[^39][^40] OpenAI never officially released the system prompt, but the existence of such instructions is consistent with the observable behavior of the ChatGPT image tool and with the contents of the revised_prompt returned by the API.[^11][^17]
Rewriting cannot be globally disabled on the API; OpenAI's developer documentation recommends adding a literal instruction such as "Do NOT add any detail; just use the prompt AS-IS" to opt out, although in practice users have reported only partial success.[^17] An academic study by Jahani et al. at UC Berkeley, conducted as an online experiment with 1,891 participants randomly assigned to use DALL-E 2, DALL-E 3, or DALL-E 3 with automatic prompt revision, found that LLM-based rewriting reduced the prompt-following gap between DALL-E 3 and DALL-E 2 by approximately 58%, indicating that much of the qualitative improvement is attributable to the rewriting layer rather than to the image model alone.[^17] A separate analysis from The Decoder characterized the rewriting layer as serving a covert moderation role: by transforming prompts that might violate policy into compliant variants, ChatGPT effectively functioned as a safety filter in front of the image model.[^41]
In paired human evaluations reported in the OpenAI blog and the Betker et al. technical report, DALL-E 3 outperforms DALL-E 2 and Stable Diffusion XL on prompt following: raters preferred DALL-E 3 outputs in roughly 70% of comparisons on a curated prompt set, and DALL-E 3 was rated most realistic on a random sample of 250 MSCOCO captions.[^1][^4][^37] The paper benchmarks against Midjourney v5.2 and SDXL 1.0 and reports DALL-E 3 winning on prompt adherence, style consistency, and overall coherence across human raters.[^4][^37] The report also documents qualitative improvements on long compositional prompts ("a stop sign in front of a building, with the words 'STOP HALLUCINATING' written on it"), where prior models routinely dropped objects or mangled spelled-out text.[^4]
For automated evaluation, the report uses CLIP-based prompt-image alignment scores and an internally developed DrawBench-style benchmark. Although the absolute numbers reported in the paper are not directly comparable to subsequent third-party studies (because OpenAI did not release the eval prompts or the captioner), independent reviewers using their own prompt sets in 2024 consistently confirmed the relative ordering: DALL-E 3 placed first on prompt-following metrics, with Midjourney v6 and later open-source models such as FLUX.1 closing the gap by mid-2024.[^21][^22] Empirical reviews placed DALL-E 3's in-image text accuracy near 95% for short strings, well above SDXL 1.0 and ahead of Midjourney v6 at release, although gpt-image-1 surpassed all of them on the same metric by April 2025.[^42][^7]
dall-e-3 is invoked through the OpenAI Images API at POST /v1/images/generations. Required and supported parameters at launch included:[^11]
| Parameter | Allowed values | Notes |
|---|---|---|
model | dall-e-3 | |
prompt | up to ~4000 characters | |
n | 1 | Multi-image generation not supported in a single call.[^11] |
size | 1024x1024, 1024x1792, 1792x1024 | |
quality | standard, hd | hd doubles per-image price.[^11][^18] |
style | vivid (default), natural | vivid produces hyper-real images; natural is less stylized.[^11] |
response_format | url, b64_json |
Unlike dall-e-2, the dall-e-3 endpoint did not support image editing (inpainting), image variations, or batched multi-image returns; the /v1/images/edits and /v1/images/variations endpoints remained DALL-E 2 only.[^24][^43] OpenAI's ChatGPT web interface did expose an inpainting "select an area" tool for DALL-E 3 outputs, but that capability was not exposed in the API.[^43][^44]
OpenAI listed per-image pricing for dall-e-3 on its API pricing page. The published rates were:[^18]
| Quality | 1024x1024 | 1024x1792 or 1792x1024 |
|---|---|---|
| Standard | $0.040 | $0.080 |
| HD | $0.080 | $0.120 |
Image generation is billed per request rather than per token; the prompt and revised prompt are not separately metered.[^18]
OpenAI's help-center article on image rate limits specifies that the default dall-e-3 rate limit on the standard usage tier is 7 images per minute, with tier-based increases for accounts in higher usage tiers and the ability to request quota increases by contacting OpenAI support.[^19] The 7-image-per-minute throttle, combined with the inability to batch multiple images in a single call (n is fixed at 1 for dall-e-3), made the model unsuitable for high-throughput batch generation pipelines and pushed many developers towards Stability AI's hosted SDXL endpoints or, after 2024, Black Forest Labs' FLUX models, both of which allowed batched calls and substantially higher per-account throughput at comparable price points.[^18][^21] Free-tier ChatGPT users were further limited to two images per day after the August 2024 expansion, while ChatGPT Plus subscribers received roughly one image per minute through the chat interface.[^25]
A typical Python call using the OpenAI Python SDK looked like the following (paraphrased from the OpenAI cookbook):[^14]
from openai import OpenAI
client = OpenAI()
response = client.images.generate(
model="dall-e-3",
prompt="A poster for a chess tournament in a 1960s Saul Bass style",
size="1024x1792",
quality="hd",
style="vivid",
n=1,
)
image_url = response.data<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>.url
revised = response.data<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>.revised_prompt
Developers were responsible for downloading the URL within the response TTL (1 hour) and for surfacing or suppressing the revised_prompt field as appropriate for their application.[^11][^14]
OpenAI published a separate DALL-E 3 system card on October 3, 2023, documenting the safety stack and the external red-teaming process used to evaluate the model before release.[^45] The system card describes a multi-tiered safety system covering training-data filtration, prompt classifiers, output classifiers, and policy-driven refusals. Red teaming was carried out by both internal and external experts and probed four "dual-use" risk domains aligned with CBRN (chemical, biological, radiological, nuclear) categories, focusing on whether image generation could meaningfully accelerate harmful workflows; OpenAI concluded that DALL-E 3 did not provide a substantive uplift in these domains.[^45] The red team work also targeted graphic content, sexual imagery, and the model's ability to be coaxed into producing misleading or impersonation images of named individuals.[^45]
The system card also discloses that DALL-E 3 inherits known biases from web-trained image data, including a default tendency to produce images of "individuals who appear White, female, and youthful" in the absence of explicit demographic prompting, and notes that the ChatGPT prompt rewriter (which forces gender and descent attributes) was added in part to counteract this default.[^45][^46]
The DALL-E 3 system card and OpenAI's launch documentation describe a layered safety stack. Training data was filtered to remove explicit content and to reduce the prevalence of violent and graphic imagery; prompt-side classifiers refuse requests that target named living artists or that depict public figures by name; and output classifiers screen generated images before they are returned to the user.[^1][^2][^45] The classifier architecture combines a frozen CLIP image encoder for feature extraction with a small auxiliary model for safety score prediction, and the same classifier family is used for both pre-generation prompt screening and post-generation image screening.[^45]
Beginning February 6, 2024, OpenAI added C2PA "Content Credentials" metadata to every image returned by DALL-E 3 through both ChatGPT and the API.[^12][^13] The metadata is signed and lists the originating model and the generation timestamp; OpenAI also adds a small visible "CR" badge in the upper left corner of ChatGPT-served images.[^12][^13] OpenAI has cautioned that C2PA metadata is "not a silver bullet": the data is preserved by compliant tools but can be stripped by simple operations such as screenshotting or re-saving an image.[^13][^30]
OpenAI complemented C2PA with two additional measures. First, in May 2024 the company announced it had joined the C2PA Steering Committee and would adopt DeepMind's SynthID watermarking for invisible pixel-level marks; SynthID survives compression, resizing, cropping, and other common transformations that strip C2PA metadata.[^15][^31] Second, in May 2024 OpenAI announced a DALL-E 3 detection classifier through a research access program that, in internal tests, correctly identified roughly 98% of DALL-E 3 images and produced false positives on under 0.5% of human-created images, with degraded accuracy of 5-10% incorrect classification for images from non-OpenAI models.[^15][^27]
When Microsoft integrated DALL-E 3 into Bing Image Creator and Copilot in late 2023, it added an additional safety layer of prompt and output filters that, in some categories, are stricter than the OpenAI defaults.[^5][^6] In late January 2024, sexually explicit non-consensual deepfake images of the musician Taylor Swift circulated widely on X and 4chan; investigative reporting by 404 Media traced the origin to a Telegram community using Microsoft Designer (and indirectly DALL-E 3 through Designer's image generator).[^20][^47] The bypass relied on a prompt-rewriting loophole in which users inserted descriptors between first and last names (for example "taylor 'singer' swift" or "jennifer 'actor' aniston") to evade the name-aware filter.[^47] Microsoft responded by tightening Designer's filters on January 29, 2024, closing the loophole; CEO Satya Nadella publicly committed to "move fast" on guardrails in a January 26 interview with NBC News.[^20][^47]
Separately, a Microsoft AI engineering leader named Shane Jones disclosed in late January 2024 that he had identified safety vulnerabilities in DALL-E 3 in early December 2023, including methods to bypass content filters to produce violent or explicit images, and that Microsoft's legal department had instructed him to delete a December 14, 2023 LinkedIn open letter calling on OpenAI to suspend the model.[^20][^48] Jones subsequently wrote letters to U.S. Senators Patty Murray and Maria Cantwell, Rep. Adam Smith, and Washington State Attorney General Bob Ferguson asking that DALL-E 3 be removed from public use until the issues were addressed.[^48] OpenAI responded that the techniques described did not in fact bypass its safety systems, and Microsoft tightened Designer's filters in the wake of the Taylor Swift incident.[^20][^48]
DALL-E 3 was released into a crowded text-to-image market. The most direct contemporaries were Midjourney v6 (released as an alpha in December 2023), Stability AI's SDXL (released July 2023), and (from early 2024) Stable Diffusion 3 and Google's Imagen 3.[^21][^49] Independent comparisons in 2024 and 2025 generally agreed on the following picture:
| Capability | DALL-E 3 | Midjourney v6 | SDXL 1.0 |
|---|---|---|---|
| Prompt following on long prompts | Strongest of the three[^21] | Improved over v5 but weaker than DALL-E 3[^21] | Weak without prompt-engineering scaffolding[^21] |
| Text-in-image rendering | Reliable for short strings; ~95% accuracy[^1][^42] | Improved in v6; less reliable than DALL-E 3[^21] | Generally poor without LoRAs[^21] |
| Aesthetic / photorealism | Strong but often visibly "AI" smoothed[^21] | Strongest aesthetic in 2024 community reviews[^21] | Tunable via fine-tunes and ControlNet[^21] |
| Customization | None (closed model)[^11] | None (closed) but supports --sref style references[^21] | Open weights; supports LoRA / diffusion tooling[^21] |
| Interface | ChatGPT, Bing/Copilot, API[^10][^5] | Discord and web app[^21] | Local, third-party UIs, hosted APIs[^21] |
| Cost per image | $0.04-$0.12 via API[^18] | Subscription ($10-$120/month)[^21] | Free if self-hosted; metered on hosts[^21] |
Google's Imagen 3, released in 2024, surpassed DALL-E 3 on several public benchmarks, achieving a +114 Elo gap over the second-best model on prompt-adherence ratings on Google's internal DrawBench-style evaluation and a 63% win rate against the runner-up; it lost to Midjourney v6 on raw aesthetic appeal in the same study.[^49][^50] On the DOCCI benchmark (photographs with detailed 136-word captions), Imagen 3 outperformed both Midjourney v6 and DALL-E 3.[^49][^50] Stable Diffusion 3, introduced in February 2024 and released in June 2024 with the MMDiT (Multimodal Diffusion Transformer) architecture, also closed the gap on text rendering and prompt adherence and was available with open weights.[^51]
Industry reviews in 2024-2025 frequently recommended using the systems together: DALL-E 3 for tasks demanding faithful prompt adherence (signage, layouts with specified text), Midjourney v6 for aesthetic-led illustration, and SDXL or Stable Diffusion 3 when controllability and custom fine-tunes mattered.[^21][^42]
DALL-E 3 mattered for three reasons that go beyond image quality. First, it normalized the pattern of pairing a frontier text-to-image model with an LLM-driven prompt rewriter, which has since become standard practice in Imagen 3, FLUX.1, and most commercial image APIs.[^7][^21] Second, the Better Captions paper popularized the use of LLM-generated synthetic captions as a training-data lever, an approach that has been adopted (and credited) in subsequent open-source image-model papers; Google's Imagen 3 paper, for example, explicitly cites Betker et al. as motivation for its own use of synthetic captions.[^4][^22][^49] Third, by routing image generation through ChatGPT, DALL-E 3 was the first widely used image generator whose primary interface was a conversation rather than a prompt box, foreshadowing the multi-turn, refinement-style interface that GPT-4o native image generation made canonical in 2025.[^7]
Beyond technical impact, DALL-E 3's release accelerated industry-wide consolidation around two strategic patterns. The first was bundling image generation into general-purpose chat assistants rather than offering it as a standalone product: in the year after DALL-E 3, Google added image generation to Gemini, Anthropic shipped image input (but not output) in Claude, and xAI added image generation to Grok.[^21] The second pattern was the introduction of C2PA Content Credentials as the de facto provenance standard, which Adobe, Microsoft, Google, and OpenAI all adopted within an 18-month window of DALL-E 3's launch.[^12][^15][^30] DALL-E 3's role as the image generator behind Microsoft Copilot also made it the first AI image system embedded into a major operating system, with Microsoft announcing 150 new Copilot features in Windows 11 in the September 21, 2023 rebrand event.[^29]
Because DALL-E 3 ran behind ChatGPT's rewriter, users frequently complained that they could not get the exact prompt they wrote into the image model. The rewriter often added unrequested details (lighting, mood, ethnicity) and sometimes dropped key constraints, particularly negation ("no humans in the image").[^17][^41] OpenAI added partial workarounds in API documentation (the literal "use the prompt AS-IS" preamble) but never exposed a true disable flag.[^17][^14] The system prompt's hard-coded diversity instructions also produced unexpected outputs (for example, racially diverse historical figures when historical accuracy was the user's intent), echoing a similar controversy that affected Google's Gemini image generator in February 2024.[^39][^46]
Neither DALL-E 3 weights nor the bespoke captioner used for training were released. The Betker et al. report withholds dataset composition, model size, training cost, sampler configuration, and most architectural details, citing competitive and safety concerns.[^4] This contrasts with SDXL and FLUX.1, whose weights were openly released, and complicates third-party safety auditing.[^4][^21] The consistency decoder is the only component of the DALL-E 3 stack that OpenAI open-sourced.[^32][^33]
C2PA Content Credentials embedded in DALL-E 3 images survive only round-trips through C2PA-aware tools; screenshots, common social-media uploads, and most re-encoders strip the metadata, limiting its real-world utility as an authentication signal.[^13][^15][^30] OpenAI's own DALL-E 3 detection classifier reportedly degrades on cross-model images (with 5-10% misclassification when shown outputs from other generators) and remained available only through an access program rather than as a public tool.[^15][^27]
The Taylor Swift Designer/Copilot deepfake episode in January 2024 exposed that even policy-compliant generators, when chained behind a permissive front-end with name-aware prompt rewriting, can be coaxed into producing non-consensual sexual imagery of named individuals through prompt obfuscation (for example, by inserting a fictional profession between first and last name).[^20][^47] Shane Jones's December 2023 disclosure indicated that Microsoft had been warned about such vulnerabilities months earlier, and his subsequent letters to U.S. senators and the Washington Attorney General called for DALL-E 3 to be pulled from public use.[^20][^48] OpenAI maintained that the specific bypass paths described did not exceed its policy thresholds, and Microsoft updated Designer's filters in late January 2024.[^20][^47]
OpenAI was not named in the original Andersen v. Stability AI class action filed in January 2023, which targeted Stability AI, Midjourney, and DeviantArt over the use of the LAION-5B dataset, in part because OpenAI had not disclosed the contents of its DALL-E training data.[^52][^53] The opt-out form announced alongside DALL-E 3 in September 2023 was the first concrete artist-side mechanism that OpenAI offered for image training, although it covered only future models and required artists to identify their work in submitted images.[^2][^53] Following the GPT-4o image generation launch in March 2025 and the viral Studio Ghibli style trend that produced 700 million images in a week, copyright lawyers noted that DALL-E 3 and gpt-image-1 had both clearly been trained on Ghibli, Pixar, and Disney imagery without licenses, but the legal status of style imitation under U.S. copyright law remained unsettled.[^34][^54] In the broader New York Times v. OpenAI lawsuit filed December 27, 2023, the Times specifically cited DALL-E 3's ability to reproduce or paraphrase Times-style content as part of its complaint against OpenAI and Microsoft.[^55]
Reviewers consistently flagged a recognizable "DALL-E 3 look": soft, painterly, heavily saturated, with smoothed faces. This was attributed to the dominance of descriptive synthetic captions in the training set and to the default vivid style; switching to natural partially mitigated it but never matched Midjourney v6's aesthetic variety.[^21] The constraint that DALL-E 3 would refuse to generate images in the style of named living artists, while reasonable as a policy choice, also narrowed the achievable style space relative to Stable Diffusion fine-tunes that had no such restriction.[^2][^21]
Although DALL-E 3 markedly improved on DALL-E 2 in rendering hands and faces, third-party testing in 2024 and 2025 documented continuing failure modes.[^42][^56] Hand-object interactions remained error-prone in busy scenes, the model occasionally produced bodies with mismatched orientations (head pointing one way, torso another), and on-image text longer than a short phrase still frequently degenerated into pseudo-typography or duplicated characters.[^42][^56] One forensic review attributed the residual hand problem to a training data imbalance: hands appear in fewer pixels per image than faces on average, so the model has had less effective supervision on hand geometry.[^56]
Per-image latency for DALL-E 3 in hd mode was widely reported in the 10-30 second range, considerably slower than Midjourney v6 (typically 30-60 seconds for a 4-image grid) or local SDXL inference on consumer GPUs (1-3 seconds per image with optimized samplers).[^21][^42] Combined with the 7-image-per-minute API rate limit, this made DALL-E 3 a poor fit for use cases requiring large image volumes, such as synthetic dataset construction for training other models, programmatic ad-creative generation, or game-asset pipelines.[^19][^21]
DALL-E 3 sits in a lineage that began with OpenAI's original DALL-E (2021), an autoregressive transformer over discrete image tokens, and continued with DALL-E 2 (2022), a CLIP-conditioned diffusion model with the unCLIP architecture.[^1][^26] Adjacent commercial systems of the same era include Google's Imagen family and later Imagen 3 (which adopted similar caption-augmentation tactics), Stability AI's Stable Diffusion line including SDXL and Stable Diffusion 3, and Black Forest Labs' FLUX.1 models.[^4][^21][^49][^51] On the OpenAI roadmap, DALL-E 3 was eventually superseded by GPT-4o native image generation in 2025, exposed in the API as gpt-image-1.[^7][^8] Related research strands include caption augmentation, prompt engineering for image generation, and image provenance via the C2PA / content credentials specification.[^22][^12][^15]