GLIDE (OpenAI)

Diffusion Models Image Generation OpenAI

7 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v2 · 1,479 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a text-conditional diffusion model for text-to-image synthesis and editing released by OpenAI in December 2021. GLIDE generates photorealistic images from natural-language captions and can edit existing images through text-guided inpainting. Its 3.5 billion parameter model, using classifier-free guidance, was preferred by human evaluators over OpenAI's earlier 12 billion parameter DALL-E system, and GLIDE is widely regarded as the direct technical precursor to DALL-E 2, whose image decoder was adapted from it. ^[1]^[2]

GLIDE was the paper that established diffusion models, rather than autoregressive transformers, as the dominant approach to high-fidelity text-to-image generation, and it popularized classifier-free guidance as the standard conditioning technique. ^[1]^[2]

What is GLIDE?

GLIDE was described in the paper "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models," first posted to arXiv (arXiv:2112.10741) on 20 December 2021, with a revised version released on 8 March 2022. The authors are Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen, all affiliated with OpenAI. ^[1]

The paper's abstract states its central result directly: "Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking." ^[1]

The work built on a line of OpenAI research connecting diffusion-based generative modeling and language-conditioned image synthesis. It followed Nichol and Dhariwal's earlier diffusion papers (which had shown diffusion models could rival generative adversarial networks on image quality) and the contrastive vision-language model CLIP, and it used both classifier guidance ideas and the text-conditioning concept that had underpinned the original DALL-E. GLIDE's central contribution was applying guided diffusion to free-form text prompts and demonstrating that the resulting samples could surpass the autoregressive DALL-E model in human preference tests, even when DALL-E used CLIP-based reranking, despite GLIDE having roughly one third as many parameters (3.5 billion versus DALL-E's 12 billion). ^[1]^[2]

How does GLIDE work?

GLIDE is built on a diffusion model, a class of generative model that learns to reverse a gradual noising process: starting from pure Gaussian noise, the network iteratively denoises the sample over many steps until a coherent image emerges. To condition this process on text, GLIDE encodes the caption with a Transformer and feeds the resulting token representations into the diffusion model's denoising network. ^[1]

The full system is composed of two stages. A base text-conditional diffusion model generates images at 64 x 64 resolution and accounts for about 3.5 billion parameters: roughly 2.3 billion in the visual (image) component and roughly 1.2 billion in the text-encoding Transformer, which uses 24 residual blocks at a width of 2048. A separate 1.5 billion parameter upsampling diffusion model then increases the resolution from 64 x 64 to 256 x 256, conditioned on the same caption but using a smaller text encoder (width 1024). The base model was trained for about 2.5 million iterations at a batch size of 2048. ^[1]

Component	Parameters	Resolution	Notes
Base text-conditional diffusion model	~3.5 billion (≈2.3B visual + ≈1.2B text)	64 x 64	Text encoder: 24 residual blocks, width 2048
Upsampling diffusion model	~1.5 billion	64 x 64 to 256 x 256	Smaller text encoder, width 1024

How does classifier-free guidance compare to CLIP guidance in GLIDE?

The paper's main experimental focus was comparing two strategies for steering generation toward the text prompt:

Guidance method	Mechanism	Reported result
CLIP guidance	Uses a noise-aware ("noised") CLIP model to push samples toward higher image-text similarity during sampling	Effective, but less preferred by evaluators
Classifier-free guidance	Trains the diffusion model both with and without the caption, then extrapolates between the two predictions at sampling time	Preferred for both photorealism and caption similarity

The paper reports that classifier-free guidance "is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples." ^[1] Classifier-free guidance avoids needing a separate classifier or CLIP model at inference and, in GLIDE's evaluations, produced the more photorealistic and caption-faithful images. Using classifier-free guidance, GLIDE reported a zero-shot Frechet Inception Distance (FID) of 12.24 on MS-COCO at 256 x 256 resolution. In side-by-side human comparisons against DALL-E, GLIDE's samples were preferred about 87% of the time for photorealism and about 69% of the time for caption similarity. ^[1]^[2]

What can GLIDE do?

GLIDE supports two primary capabilities. The first is text-conditional image generation: given a caption such as a description of an object, scene, or artistic style, the model synthesizes a corresponding image. Because guidance is applied at sampling time, users can trade off diversity against fidelity to the prompt by adjusting the guidance strength. ^[1]

The second capability is text-driven image editing through inpainting. By fine-tuning the model for inpainting, GLIDE can take an existing image, a masked region, and a text instruction, then fill the masked area with content consistent with both the surrounding image and the prompt. This enables iterative, natural-language editing, for example inserting an object, changing an element's appearance, or extending a scene, while preserving the rest of the image. The paper highlighted this as "powerful text-driven image editing" and presented it as a key practical advantage of the diffusion-based approach. ^[1]

Is GLIDE open source?

Citing safety concerns about a powerful, openly available text-to-image model, OpenAI did not release the full 3.5 billion parameter GLIDE model or its weights. Instead, it published a smaller model, commonly called GLIDE (filtered), on GitHub at openai/glide-text2im. This released model has roughly 300 million parameters and was trained on a heavily filtered dataset. ^[2]^[3]

The original training corpus comprised several hundred million text-image pairs collected from the internet. For the public release, OpenAI applied filters intended to remove images of people, violent objects, and hate symbols (with specific attention to swastikas and confederate flags); the model card notes the filtered set contained approximately 67 million text-image pairs. The accompanying noised CLIP model distributed for CLIP-guided sampling was trained on a larger augmented set of about 137 million pairs. The model card states that "these models are explicitly not intended to generate images of people or other subjects we filtered for," notes that the data still exhibited biases toward Western-centric concepts, and adds that OpenAI does "not currently recommend it for commercial use." ^[3]

The public repository provides three example notebooks demonstrating the released model:

Notebook	Function
text2im	Text-conditional generation using classifier-free guidance
inpaint	Text-conditional inpainting of masked image regions
clip_guided	Text-conditional generation using the filtered noised CLIP model for guidance

By releasing a deliberately restricted model rather than the full system, OpenAI took an early position on staged, safety-oriented release for generative image models, a stance it would carry into subsequent products. ^[2]^[3]

GLIDE was an influential step in the development of modern text-to-image generation. It provided strong evidence that diffusion models, rather than the autoregressive transformer approach used in the original DALL-E, could produce high-quality, caption-faithful images, and it popularized classifier-free guidance as the dominant conditioning technique for such models. The diffusion-plus-upsampler architecture it used became a common template for later text-to-image systems. ^[1]^[2]

GLIDE's most direct legacy is its role as the foundation for DALL-E 2, released by OpenAI in April 2022. DALL-E 2's architecture (referred to internally as unCLIP) pairs a prior that maps a CLIP text embedding to a CLIP image embedding with a decoder that turns that image embedding into a picture. That decoder is a modified version of GLIDE: it retains GLIDE's diffusion-with-classifier-free-guidance design but is conditioned on CLIP image embeddings, projected into the network and added to the timestep embedding and the text-encoder outputs, instead of conditioning on text alone. In this sense GLIDE supplied the generative backbone that DALL-E 2 adapted, marking the transition of OpenAI's flagship image system from autoregressive modeling to diffusion. ^[4]^[5]

References

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., and Chen, M. "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models." arXiv:2112.10741, December 2021 (revised March 2022). https://arxiv.org/abs/2112.10741 ↩
OpenAI. "glide-text2im: GLIDE: a diffusion-based text-conditional image synthesis model" (GitHub repository). https://github.com/openai/glide-text2im ↩
OpenAI. "GLIDE (filtered) Model Card." https://github.com/openai/glide-text2im/blob/main/model-card.md ↩
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. "Hierarchical Text-Conditional Image Generation with CLIP Latents." arXiv:2204.06125, April 2022. https://arxiv.org/abs/2204.06125 ↩
Synced. "OpenAI Releases GLIDE: A Scaled-Down Text-to-Image Model That Rivals DALL-E Performance." December 2021. https://syncedreview.com/2021/12/24/openai-releases-glide-a-scaled-down-text-to-image-model-that-rivals-dall-e-performance/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Classifier-Free Guidance (CFG)DALL-E 2 DALL-E 3 OpenAI Point-E Shap-E

What is GLIDE?

How does GLIDE work?

How does classifier-free guidance compare to CLIP guidance in GLIDE?

What can GLIDE do?

Is GLIDE open source?

How is GLIDE related to DALL-E 2?

See also

References

Improve this article

Related Articles

DALL-E

Stable Diffusion

Midjourney

Runwayml/stable-diffusion-v1-5 model

Imagen (text-to-image model)

Flux (text-to-image model)

What links here

Related Articles

DALL-E

Stable Diffusion

Midjourney

Runwayml/stable-diffusion-v1-5 model

Imagen (text-to-image model)

Flux (text-to-image model)

What links here