Gemini Diffusion

Diffusion Models Google DeepMind Large Language Models

8 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 1,677 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Gemini Diffusion is an experimental text generation model from Google DeepMind that produces text and code using a diffusion process rather than the autoregressive, token-by-token approach used by most large language models.^[1]^[3] It was unveiled at Google I/O 2025, the company's annual developer conference whose keynotes ran on May 20, 2025, and was made available as a limited research demo behind a waitlist.^[1]^[2]^[3] Instead of predicting the next word in a sequence, the model starts from noise and refines a whole block of text over repeated denoising steps; Google reports this lets it generate at a sampling speed of 1,479 tokens per second, with press coverage citing an end-to-end range of roughly 1,000 to 2,000 tokens per second.^[1]^[2]^[4]

Part of the broader Gemini family, it was framed not as a product but as a research model, an effort to show that the diffusion techniques that power modern image and video generators could be adapted to language while matching the quality of a comparable conventional model at several times the speed.^[1]^[4] Google describes it as "currently available as an experimental demo to help develop and refine future models."^[8]

The headline result was speed. Google reported a sampling rate of 1,479 tokens per second, and press coverage cited an end-to-end range of roughly 1,000 to 2,000 tokens per second depending on the task, with coding workloads at the upper end.^[1]^[4]^[2] Crucially, the company positioned this against quality rather than as raw throughput alone: on a battery of coding and reasoning benchmarks, Gemini Diffusion performed about on par with Gemini 2.0 Flash-Lite, one of Google's smaller and cheaper autoregressive models, while running roughly five times faster.^[4]^[5]

What is Gemini Diffusion?

Gemini Diffusion is a diffusion-based large language model: a text generation system that creates output by iteratively refining noise into coherent tokens rather than emitting one token after another from left to right. Google DeepMind unveiled it at Google I/O on May 20, 2025 and released it as an experimental demo gated by a waitlist, not as a shipping product.^[1]^[2]^[3] Its defining characteristic is throughput: Google reported a sampling speed of 1,479 tokens per second, roughly five times faster than the comparable autoregressive model Gemini 2.0 Flash-Lite, at broadly similar quality on coding benchmarks.^[1]^[4]^[5]

Background

Diffusion models first became prominent in image generation, where systems such as Stable Diffusion and Google's own Imagen learn to turn random noise into a coherent picture through many small denoising steps. Applying the same idea to discrete text is harder, because language is a sequence of distinct tokens rather than continuous pixel values, and for years text diffusion models lagged well behind autoregressive transformers on quality. Gemini Diffusion is significant partly because it is one of the first instances of a diffusion language model from a major lab reaching rough parity with a shipped autoregressive model on standard benchmarks.^[5]^[6]

The work sits within a wider research current. Other groups, including the team behind the Mercury models from Inception Labs and academics such as Stanford's Stefano Ermon, had been pushing diffusion approaches to language; Ermon told Fortune that Google's entry "validates the direction we've been pursuing."^[2] At I/O, Google paired Gemini Diffusion's reveal with separate latency work on its mainline models, noting a faster version of Gemini 2.5 Flash-Lite was on the way, which underlined that inference speed had become a competitive front across the field.^[7]

How does diffusion-based text generation work?

A standard autoregressive language model generates one token at a time, left to right, with each new token conditioned on everything written so far. Google notes that "traditional autoregressive language models generate text one word, or token, at a time," and that "this sequential process can be slow, and limit the quality and coherence of the output."^[8] That sequential dependency is what makes generation slow: producing the thousandth token requires the previous 999 to exist first, and the model cannot revise what it has already emitted.

A diffusion language model takes a different route. During training the model learns to reverse a corruption process: clean text is progressively masked or noised, and the model is taught to recover the original. At generation time it starts from a fully noised or masked sequence and refines the whole block over a series of denoising steps, gradually committing tokens to their final values until a complete passage emerges.^[3]^[6] Google describes this as learning "to generate outputs by refining noise, step-by-step."^[1]

Two properties follow from this design. First, because the model works on a whole block at once rather than strictly one token after another, much of the computation can happen in parallel, which is the main source of the speed gains.^[6]^[8] Second, the denoiser attends across the full sequence in both directions rather than only to earlier tokens, so the model can reconsider and self-correct parts of its output across steps instead of being locked into each token the moment it is produced. Google highlighted this as an advantage for tasks like editing, mathematics, and code, where coherence across a span of text matters and a late realization should be able to fix an earlier mistake.^[1]^[3] The trade-off is that quality depends on the number of refinement steps, so the technique balances speed against the compute spent denoising.^[6]

How fast is Gemini Diffusion?

Google reported the model's sampling speed and its scores on a set of standard benchmarks, comparing it against Gemini 2.0 Flash-Lite. The speed figures were as follows.^[1]^[3]

Metric	Reported figure
Sampling speed	1,479 tokens/second
Latency overhead	0.84 seconds
End-to-end (press-reported range)	~1,000 to 2,000 tokens/second
Relative speed vs. Gemini 2.0 Flash-Lite	About 5x faster

Independent hands-on testing reported lower but still high real-world rates. Developer Simon Willison, after gaining waitlist access, measured about 857 tokens per second while having the model build a small chat application, and likened the feel to fast inference hardware he had used before.^[3]

How does Gemini Diffusion perform on benchmarks?

On benchmarks, the picture is that of a small, fast model: competitive on code, weaker on knowledge and harder reasoning. The published comparison against Gemini 2.0 Flash-Lite is below.^[5]^[9]

Benchmark	Category	Gemini Diffusion	Gemini 2.0 Flash-Lite
HumanEval	Code	89.6%	90.2%
MBPP	Code	76.0%	75.8%
LiveCodeBench (v6)	Code	30.9%	28.5%
BigCodeBench	Code	45.4%	45.8%
LBPP (v2)	Code	56.8%	56.0%
SWE-Bench Verified	Code	22.9%	28.5%
AIME 2025	Mathematics	23.3%	20.0%
GPQA Diamond	Science	40.4%	56.5%
BIG-Bench Extra Hard	Reasoning	15.0%	21.0%
Global MMLU (Lite)	Multilingual	69.1%	79.0%

The pattern is consistent with the claim Google actually made, and with what reviewers concluded: on programming the two models are nearly indistinguishable, and Gemini Diffusion even edges ahead on a couple of coding and math tests, but it falls clearly behind on broad knowledge and the hardest reasoning, where GPQA Diamond and Global MMLU show double-digit gaps.^[5]^[9] Jack Rae, a principal scientist at Google DeepMind, called the result "a fascinating and powerful model that is also lightning fast," adding that it had not been clear the quality gap with autoregressive models "would ever be closed."^[2]

Is Gemini Diffusion publicly available?

At launch Gemini Diffusion was not a general release. Google offered it as an experimental demo with access gated by a waitlist, inviting developers and researchers to sign up through a form linked from the DeepMind announcement.^[1]^[2]^[3] The stated purpose of the limited rollout was to collect feedback and continue refining the model rather than to ship it broadly. It was not added to the public Gemini API or apps at announcement, and Google characterized it explicitly as a research model.^[1]^[4]

Why does Gemini Diffusion matter?

Gemini Diffusion drew attention less for what it could do than for what it represented. It was, by several accounts, one of the quieter announcements at I/O 2025 yet potentially one of the more consequential, because it offered concrete evidence that diffusion could rival autoregressive generation for language and not just images.^[4]^[6] AI researcher Nathan Lambert described it as the "biggest endorsement yet" for the approach while cautioning that Google had released few technical details, so rigorous comparison was difficult.^[2]

The practical appeal is straightforward. If a diffusion model can match a small autoregressive model's quality at several times the speed, it points toward cheaper, lower-latency inference, which matters as the cost of serving models has become a central concern. Analyst Dave Nicholson of the Futurum Group made exactly that point, noting that competing approaches "will eventually be measured against each model's running costs."^[2] Whether the approach scales to the quality of the largest frontier models such as Gemini 2.5 Pro remained an open question at the time of release, and one the experimental demo was meant to help answer. What Gemini Diffusion established is that the question is now worth asking.

References

Google. "Gemini Diffusion: Google DeepMind's experimental research model." The Keyword (blog.google), 20 May 2025. https://blog.google/technology/google-deepmind/gemini-diffusion/ ↩
Goldman, S. "At Google I/O, Gemini Diffusion's speed and coding skills hint at the next phase of the AI model wars." Fortune, 21 May 2025. https://fortune.com/2025/05/21/gemini-diffusion-google-io-sleeper-hit-blazing-speed-ai-model-wars/ ↩
Willison, S. "Gemini Diffusion." simonwillison.net, 21 May 2025. https://simonwillison.net/2025/May/21/gemini-diffusion/ ↩
TechCrunch. "Gemini Diffusion is an ultra-fast new AI model." TechCrunch, 20 May 2025. https://techcrunch.com/snippet/3009698/gemini-diffusion-is-an-ultra-fast-new-ai-model/ ↩
Bastian, M. "Gemini Diffusion could be Google's most important I/O news that slipped under the radar." The Decoder, 22 May 2025. https://the-decoder.com/gemini-diffusion-could-be-googles-most-important-i-o-news-that-slipped-under-the-radar/ ↩
Sharma, S. "Beyond GPT architecture: Why Google's Diffusion approach could reshape LLM deployment." VentureBeat, 22 May 2025. https://venturebeat.com/ai/beyond-gpt-architecture-why-googles-diffusion-approach-could-reshape-llm-deployment ↩
Google. "Gemini 2.5: Our most intelligent models are getting even better." The Keyword (blog.google), 20 May 2025. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/google-gemini-updates-io-2025/ ↩
Google DeepMind. "Gemini Diffusion." deepmind.google, 2025. https://deepmind.google/models/gemini-diffusion/ ↩
CometAPI. "What is Gemini Diffusion? All You Need to Know." cometapi.com, 22 May 2025. https://www.cometapi.com/what-is-gemini-diffusion/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Autoregressive Model Diffusion Language Models Discrete diffusion language model Gemini (language model)

What is Gemini Diffusion?

Background

How does diffusion-based text generation work?

How fast is Gemini Diffusion?

How does Gemini Diffusion perform on benchmarks?

Is Gemini Diffusion publicly available?

Why does Gemini Diffusion matter?

References

Improve this article

Related Articles

Imagen (text-to-image model)

GenCast

Imagen 2

Diffusion Language Models

Inception Labs

LLaDA (Large Language Diffusion)

What links here

Related Articles

Imagen (text-to-image model)

GenCast

Imagen 2

Diffusion Language Models

Inception Labs

LLaDA (Large Language Diffusion)

What links here