GLIDE (OpenAI)
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,257 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,257 words
Add missing citations, update stale details, or suggest a clearer explanation.
GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a text-conditional diffusion model for image synthesis and editing developed by OpenAI. Introduced in December 2021, GLIDE generates photorealistic images from natural-language captions and can edit existing images through text-guided inpainting. The full model contains roughly 3.5 billion parameters, and in human evaluations its outputs were preferred to those of OpenAI's earlier DALL-E system. GLIDE is widely regarded as a direct technical precursor to DALL-E 2, whose image decoder was adapted from it. [1][2]
GLIDE was described in the paper "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models," first posted to arXiv (arXiv:2112.10741) on 20 December 2021, with a revised version released on 8 March 2022. The authors are Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen, all affiliated with OpenAI. [1]
The work built on a line of OpenAI research connecting diffusion-based generative modeling and language-conditioned image synthesis. It followed Nichol and Dhariwal's earlier diffusion papers (which had shown diffusion models could rival generative adversarial networks on image quality) and the contrastive vision-language model CLIP, and it used both classifier guidance ideas and the text-conditioning concept that had underpinned the original DALL-E. GLIDE's central contribution was applying guided diffusion to free-form text prompts and demonstrating that the resulting samples could surpass the autoregressive DALL-E model in human preference tests, even when DALL-E used CLIP-based reranking. [1][2]
GLIDE is built on a diffusion model, a class of generative model that learns to reverse a gradual noising process: starting from pure Gaussian noise, the network iteratively denoises the sample over many steps until a coherent image emerges. To condition this process on text, GLIDE encodes the caption with a Transformer and feeds the resulting token representations into the diffusion model's denoising network. [1]
The full system is composed of two stages. A base text-conditional diffusion model generates images at 64 x 64 resolution and accounts for about 3.5 billion parameters: roughly 2.3 billion in the visual (image) component and roughly 1.2 billion in the text-encoding Transformer, which uses 24 residual blocks at a width of 2048. A separate 1.5 billion parameter upsampling diffusion model then increases the resolution from 64 x 64 to 256 x 256, conditioned on the same caption but using a smaller text encoder (width 1024). The base model was trained for about 2.5 million iterations at a batch size of 2048. [1]
The paper's main experimental focus was comparing two strategies for steering generation toward the text prompt:
| Guidance method | Mechanism | Reported result |
|---|---|---|
| CLIP guidance | Uses a noise-aware ("noised") CLIP model to push samples toward higher image-text similarity during sampling | Effective, but less preferred by evaluators |
| Classifier-free guidance | Trains the diffusion model both with and without the caption, then extrapolates between the two predictions at sampling time | Preferred for both photorealism and caption similarity |
Classifier-free guidance avoids needing a separate classifier or CLIP model at inference and, in GLIDE's evaluations, produced the more photorealistic and caption-faithful images. Using classifier-free guidance, GLIDE reported a zero-shot Frechet Inception Distance (FID) of 12.24 on MS-COCO at 256 x 256 resolution. In side-by-side human comparisons against DALL-E, GLIDE's samples were preferred about 87% of the time for photorealism and about 69% of the time for caption similarity. [1][2]
GLIDE supports two primary capabilities. The first is text-conditional image generation: given a caption such as a description of an object, scene, or artistic style, the model synthesizes a corresponding image. Because guidance is applied at sampling time, users can trade off diversity against fidelity to the prompt by adjusting the guidance strength. [1]
The second capability is text-driven image editing through inpainting. By fine-tuning the model for inpainting, GLIDE can take an existing image, a masked region, and a text instruction, then fill the masked area with content consistent with both the surrounding image and the prompt. This enables iterative, natural-language editing, for example inserting an object, changing an element's appearance, or extending a scene, while preserving the rest of the image. The paper highlighted this as "powerful text-driven image editing" and presented it as a key practical advantage of the diffusion-based approach. [1]
Citing safety concerns about a powerful, openly available text-to-image model, OpenAI did not release the full 3.5 billion parameter GLIDE model or its weights. Instead, it published a smaller model, commonly called GLIDE (filtered), on GitHub at openai/glide-text2im. This released model has roughly 300 million parameters and was trained on a heavily filtered dataset. [2][3]
The original training corpus comprised several hundred million text-image pairs collected from the internet. For the public release, OpenAI applied filters intended to remove images of people, violent objects, and hate symbols; the model card notes the filtered set contained approximately 67 million text-image pairs. The accompanying noised CLIP model distributed for CLIP-guided sampling was trained on a larger augmented set of about 137 million pairs. OpenAI stated that the filtered model was explicitly not intended to generate images of people or the other categories it had filtered out, and that the data still exhibited biases toward Western-centric concepts. [3]
The public repository provides three example notebooks demonstrating the released model:
| Notebook | Function |
|---|---|
| text2im | Text-conditional generation using classifier-free guidance |
| inpaint | Text-conditional inpainting of masked image regions |
| clip_guided | Text-conditional generation using the filtered noised CLIP model for guidance |
By releasing a deliberately restricted model rather than the full system, OpenAI took an early position on staged, safety-oriented release for generative image models, a stance it would carry into subsequent products. [2][3]
GLIDE was an influential step in the development of modern text-to-image generation. It provided strong evidence that diffusion models, rather than the autoregressive transformer approach used in the original DALL-E, could produce high-quality, caption-faithful images, and it popularized classifier-free guidance as the dominant conditioning technique for such models. The diffusion-plus-upsampler architecture it used became a common template for later text-to-image systems. [1][2]
GLIDE's most direct legacy is its role as the foundation for DALL-E 2, released by OpenAI in April 2022. DALL-E 2's architecture (referred to internally as unCLIP) pairs a prior that maps a CLIP text embedding to a CLIP image embedding with a decoder that turns that image embedding into a picture. That decoder is a modified version of GLIDE: it retains GLIDE's diffusion-with-classifier-free-guidance design but is conditioned on CLIP image embeddings, projected into the network and added to the timestep embedding and the text-encoder outputs, instead of conditioning on text alone. In this sense GLIDE supplied the generative backbone that DALL-E 2 adapted, marking the transition of OpenAI's flagship image system from autoregressive modeling to diffusion. [4][5]