Gemini Diffusion
Last reviewed
Jun 9, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 · 1,453 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 9, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 · 1,453 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gemini Diffusion is an experimental text generation model from Google DeepMind that produces text and code using a diffusion process rather than the autoregressive, token-by-token approach used by most large language models. It was unveiled at Google I/O 2025, the company's annual developer conference whose keynotes ran on May 20, 2025, and was made available as a limited demo behind a waitlist.[1][2][3] Part of the broader Gemini family, it was framed not as a product but as a research model, an effort to show that the diffusion techniques that power modern image and video generators could be adapted to language while matching the quality of a comparable conventional model at several times the speed.[1][4]
The headline result was speed. Google reported a sampling rate of 1,479 tokens per second, and press coverage cited an end-to-end range of roughly 1,000 to 2,000 tokens per second depending on the task, with coding workloads at the upper end.[1][4][2] Crucially, the company positioned this against quality rather than as raw throughput alone: on a battery of coding and reasoning benchmarks, Gemini Diffusion performed about on par with Gemini 2.0 Flash-Lite, one of Google's smaller and cheaper autoregressive models, while running roughly five times faster.[4][5]
Diffusion models first became prominent in image generation, where systems such as Stable Diffusion and Google's own Imagen learn to turn random noise into a coherent picture through many small denoising steps. Applying the same idea to discrete text is harder, because language is a sequence of distinct tokens rather than continuous pixel values, and for years text diffusion models lagged well behind autoregressive transformers on quality. Gemini Diffusion is significant partly because it is one of the first instances of a diffusion language model from a major lab reaching rough parity with a shipped autoregressive model on standard benchmarks.[5][6]
The work sits within a wider research current. Other groups, including the team behind the Mercury models from Inception Labs and academics such as Stanford's Stefano Ermon, had been pushing diffusion approaches to language; Ermon told Fortune that Google's entry "validates the direction we've been pursuing."[2] At I/O, Google paired Gemini Diffusion's reveal with separate latency work on its mainline models, noting a faster version of Gemini 2.5 Flash-Lite was on the way, which underlined that inference speed had become a competitive front across the field.[7]
A standard autoregressive language model generates one token at a time, left to right, with each new token conditioned on everything written so far. That sequential dependency is what makes generation slow: producing the thousandth token requires the previous 999 to exist first, and the model cannot revise what it has already emitted.
A diffusion language model takes a different route. During training the model learns to reverse a corruption process: clean text is progressively masked or noised, and the model is taught to recover the original. At generation time it starts from a fully noised or masked sequence and refines the whole block over a series of denoising steps, gradually committing tokens to their final values until a complete passage emerges.[3][6] Google describes this as learning "to generate outputs by refining noise, step-by-step."[1]
Two properties follow from this design. First, because the model works on a whole block at once rather than strictly one token after another, much of the computation can happen in parallel, which is the main source of the speed gains.[6][8] Second, the denoiser attends across the full sequence in both directions rather than only to earlier tokens, so the model can reconsider and self-correct parts of its output across steps instead of being locked into each token the moment it is produced. Google highlighted this as an advantage for tasks like editing, mathematics, and code, where coherence across a span of text matters and a late realization should be able to fix an earlier mistake.[1][3] The trade-off is that quality depends on the number of refinement steps, so the technique balances speed against the compute spent denoising.[6]
Google reported the model's sampling speed and its scores on a set of standard benchmarks, comparing it against Gemini 2.0 Flash-Lite. The speed figures were as follows.[1][3]
| Metric | Reported figure |
|---|---|
| Sampling speed | 1,479 tokens/second |
| Latency overhead | 0.84 seconds |
| End-to-end (press-reported range) | ~1,000 to 2,000 tokens/second |
| Relative speed vs. Gemini 2.0 Flash-Lite | About 5x faster |
Independent hands-on testing reported lower but still high real-world rates. Developer Simon Willison, after gaining waitlist access, measured about 857 tokens per second while having the model build a small chat application, and likened the feel to fast inference hardware he had used before.[3]
On benchmarks, the picture is that of a small, fast model: competitive on code, weaker on knowledge and harder reasoning. The published comparison against Gemini 2.0 Flash-Lite is below.[5][9]
| Benchmark | Category | Gemini Diffusion | Gemini 2.0 Flash-Lite |
|---|---|---|---|
| HumanEval | Code | 89.6% | 90.2% |
| MBPP | Code | 76.0% | 75.8% |
| LiveCodeBench (v6) | Code | 30.9% | 28.5% |
| BigCodeBench | Code | 45.4% | 45.8% |
| LBPP (v2) | Code | 56.8% | 56.0% |
| SWE-Bench Verified | Code | 22.9% | 28.5% |
| AIME 2025 | Mathematics | 23.3% | 20.0% |
| GPQA Diamond | Science | 40.4% | 56.5% |
| BIG-Bench Extra Hard | Reasoning | 15.0% | 21.0% |
| Global MMLU (Lite) | Multilingual | 69.1% | 79.0% |
The pattern is consistent with the claim Google actually made, and with what reviewers concluded: on programming the two models are nearly indistinguishable, and Gemini Diffusion even edges ahead on a couple of coding and math tests, but it falls clearly behind on broad knowledge and the hardest reasoning, where GPQA Diamond and Global MMLU show double-digit gaps.[5][9] Jack Rae, a principal scientist at Google DeepMind, called the result "a fascinating and powerful model that is also lightning fast," adding that it had not been clear the quality gap with autoregressive models "would ever be closed."[2]
At launch Gemini Diffusion was not a general release. Google offered it as an experimental demo with access gated by a waitlist, inviting developers and researchers to sign up through a form linked from the DeepMind announcement.[1][2][3] The stated purpose of the limited rollout was to collect feedback and continue refining the model rather than to ship it broadly. It was not added to the public Gemini API or apps at announcement, and Google characterized it explicitly as a research model.[1][4]
Gemini Diffusion drew attention less for what it could do than for what it represented. It was, by several accounts, one of the quieter announcements at I/O 2025 yet potentially one of the more consequential, because it offered concrete evidence that diffusion could rival autoregressive generation for language and not just images.[4][6] AI researcher Nathan Lambert described it as the "biggest endorsement yet" for the approach while cautioning that Google had released few technical details, so rigorous comparison was difficult.[2]
The practical appeal is straightforward. If a diffusion model can match a small autoregressive model's quality at several times the speed, it points toward cheaper, lower-latency inference, which matters as the cost of serving models has become a central concern. Analyst Dave Nicholson of the Futurum Group made exactly that point, noting that competing approaches "will eventually be measured against each model's running costs."[2] Whether the approach scales to the quality of the largest frontier models such as Gemini 2.5 Pro remained an open question at the time of release, and one the experimental demo was meant to help answer. What Gemini Diffusion established is that the question is now worth asking.