Diffusion Language Models

Diffusion Models Large Language Models

29 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v5 · 5,718 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Diffusion language models (DLMs, sometimes written dLLMs at frontier scale) are text generators that synthesize a sequence by iteratively denoising or unmasking many tokens in parallel, rather than predicting one token at a time left to right the way conventional autoregressive models do. They are a family of generative models for text that reverse a stochastic corruption process: where large language models such as the gpt family or llama 3 are causal systems that factor the joint probability of a sequence into a product of left-to-right conditionals, diffusion language models instead start from a fully corrupted or masked sequence and iteratively denoise it, refining many positions in parallel at each step. The approach inherits its mathematical scaffolding from the continuous-space diffusion models used for images, such as ddpm, adapted to either continuous word-embedding spaces or to discrete token spaces with absorbing or uniform corruption.^[1]^[2]^[3]

The motivation for studying diffusion language models has shifted over time. Early work in 2021 and 2022 was driven by the desire for controllable generation and for an alternative to the strict left-to-right factorization of autoregressive models. By 2024 the focus had moved to closing the likelihood gap with autoregressive baselines and to the demonstration that masked discrete diffusion is a competitive paradigm at scale. In 2025 the field reached a frontier-scale milestone with LLaDA, an 8 billion parameter masked diffusion model trained from scratch that matches llama 3 8B on standard benchmarks, and a commercial milestone with Mercury from Inception Labs, the first widely deployed diffusion-based language model API. Additional 2025 releases including Block Diffusion (BD3-LM) from Cornell, Dream 7B from the University of Hong Kong and Huawei Noah's Ark Lab, ByteDance's high-speed Seed Diffusion, and Google DeepMind's experimental Gemini Diffusion consolidated the case that diffusion can serve as a credible substitute for next-token prediction in many production settings. By late 2025 and early 2026 the line had reached the mixture-of-experts regime with LLaDA-MoE and the 100-billion-parameter LLaDA 2.0 family from Ant Group, and a second commercial generation with Mercury 2, billed by Inception Labs as the first reasoning-capable diffusion model.^[1]^[4]^[5]^[6]

This article surveys the technical foundations of diffusion language models, traces the principal research lines from D3PM (2021) through Diffusion-LM (2022), SEDD (2024), MDLM and MD4 (2024), the autoregressive-to-diffusion adaptation line (DiffuLLaMA, 2024), LLaDA and LLaDA 1.5 (2025), Block Diffusion (2025), Mercury and Mercury 2 (2025 to 2026), Dream 7B (2025), Seed Diffusion (2025), Gemini Diffusion (2025), and the mixture-of-experts scaling work of LLaDA-MoE and LLaDA 2.0 (2025), and discusses the speed and quality trade-offs that distinguish them from autoregressive systems built on the transformer architecture.

The principal systems surveyed below are summarized in the following catalog. Speed figures are vendor- or author-reported and are measured on different hardware and workloads, so they are not directly comparable; they are reproduced here with their attributions.

System	Origin	Year	Type	Open weights	Headline claim (as reported)
Diffusion-LM	Stanford	2022	Continuous (embedding)	Yes	First diffusion text model; controllable generation^[2]
Plaid	Stanford	2023	Continuous (embedding)	Yes	First diffusion likelihood beating a 124M GPT-2^[8]
SEDD	Stanford	2024	Discrete (score entropy)	Yes	ICML 2024 Best Paper; near GPT-2 perplexity^[3]
MDLM / MD4	Cornell / Google DeepMind	2024	Discrete (masked)	Yes	Simplified masked-diffusion recipe at GPT-2 scale^[9]^[10]
DiffuGPT / DiffuLLaMA	HKU et al.	2024	Discrete (masked, AR-adapted)	Yes	Converts GPT-2/LLaMA to diffusion with under 200B tokens^[13]
LLaDA 8B	Renmin University / Ant Group	2025	Discrete (masked)	Yes	Matches llama 3 8B on academic suite^[4]
LLaDA 1.5	Renmin University / Ant Group	2025	Discrete (masked) + VRPO	Yes	Adds preference-optimization alignment to LLaDA^[14]
LLaDA-MoE	Ant Group / Renmin University	2025	Discrete (masked) MoE	Yes	First open MoE dLLM; 7B total / 1.4B active^[15]
LLaDA 2.0	Ant Group (InclusionAI)	2025	Discrete (masked) MoE	Yes	Scales dLLMs to 100B total parameters^[16]
Block Diffusion (BD3-LM)	Cornell	2025	Hybrid (block)	Yes	ICLR 2025 Oral; interpolates AR and diffusion^[6]
Dream 7B	HKU / Huawei Noah's Ark	2025	Discrete (masked, AR-init)	Yes	Strongest open dLLM at release^[12]
Seed Diffusion	ByteDance Seed	2025	Discrete (masked)	No	~2,146 tok/s on H20 for code (author-reported)^[17]
Mercury Coder	Inception Labs	2025	Commercial (block/masked)	No	~1,109 tok/s on H100 (vendor-reported)^[5]^[7]
Gemini Diffusion	Google DeepMind	2025	Commercial (experimental)	No	~1,479 tok/s; Gemini 2.0 Flash-Lite quality (vendor-reported)^[18]
Mercury 2	Inception Labs	2026	Commercial, reasoning	No	~1,009 tok/s on Blackwell; reasoning dLLM (vendor-reported)^[19]

How do diffusion language models work?

A diffusion language model is defined by a forward corruption process that gradually destroys a clean text sequence and a learned reverse process that undoes the corruption. In discrete masked diffusion, the dominant modern form, the forward process progressively replaces real tokens with a special mask symbol until the sequence is fully masked, and a transformer is trained to predict the original tokens at masked positions given the partially masked sequence. Generation runs the reverse process: starting from an all-mask sequence, the model fills in tokens over a chosen number of denoising steps, unmasking many positions per step rather than emitting a single token per forward pass.^[1]^[9] Because the network attends to the full sequence at every step without a causal mask, the same model can in principle revise any position at any step, which is what gives diffusion language models their parallel, order-agnostic character. The number of denoising steps is a tunable knob that trades latency against quality, a control surface that left-to-right decoding does not naturally expose.

Why use diffusion for text generation?

Two structural motivations drive interest in diffusion for language. The first is parallel decoding. An autoregressive transformer must produce tokens one at a time, paying the cost of a full forward pass through the model for each emitted token; this is the central bottleneck of long-form decoding and the reason why systems such as flash attention or speculative decoding receive so much engineering attention. A diffusion language model can in principle update every position in a sequence at every denoising step. Sampling is therefore a function of the number of denoising steps and the per-step cost, not of the sequence length, which allows trade-offs between latency and quality that the autoregressive paradigm does not naturally expose. Mercury Coder Mini, for example, reports throughput of approximately 1,109 tokens per second on NVIDIA H100 GPUs, roughly an order of magnitude faster than speed-optimized autoregressive models of similar quality.^[5]^[7]

The second motivation is controllability. Because a diffusion sampler maintains a representation of the entire sequence at every step, classifier-guided or gradient-based control can be applied to that representation in much the way it is applied to images in continuous diffusion. Diffusion-LM, the first widely cited diffusion model for text, was introduced specifically to demonstrate fine-grained controllable generation, and subsequent work has explored infilling, length control, and global constraints that are awkward to express in left-to-right decoding.^[2]

A third, more recent motivation is bidirectional reasoning. Autoregressive models are notoriously susceptible to the so-called reversal curse: a model that learns "A is B" during pretraining does not reliably answer the reverse question "B is A?" without explicit data. Because a masked diffusion model does not assume a fixed token ordering during training, it can produce predictions in arbitrary orders. LLaDA reports that on a reversal poem completion task its 8B model outperforms gpt 4o (45.6 vs 34.3), although it underperforms in the forward direction (51.8 vs 82.7), an asymmetry that the authors attribute to the order-agnostic training objective.^[4]

Continuous diffusion: Diffusion-LM and Plaid

The first published diffusion-based text model was Diffusion-LM by Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto, released in May 2022.^[2] Diffusion-LM adapts the Gaussian diffusion machinery developed for images to text by embedding discrete tokens into a continuous space, running the forward Gaussian corruption process on those embeddings, and learning a denoising network that reverses it. A learned rounding step maps the final denoised vectors back to tokens. The principal experimental result was on six fine-grained controllable generation tasks, where Diffusion-LM significantly outperformed gpt-2 based plug-and-play and fine-tuning baselines by applying classifier guidance directly to the continuous latent variables. It also functioned as a proof of concept that continuous diffusion in embedding space can produce fluent, fixed-length text.^[2]

The continuous embedding approach was scaled and refined by Plaid (Likelihood-Based Diffusion Language Models) from Ishaan Gulrajani and Tatsunori B. Hashimoto, presented at NeurIPS 2023. Plaid 1B reported the first diffusion-based likelihood numbers that exceeded a 124M GPT-2 baseline on standard benchmarks and introduced training recipes oriented around maximum likelihood rather than denoising quality alone, providing a foundation for later scaling studies. Even so, by late 2024 the continuous embedding line was largely overtaken by discrete masked diffusion, which is closer to how text is actually represented in tokenizers and which exhibited better empirical perplexity at large scale.^[8]

Discrete diffusion: D3PM and the absorbing-state objective

The discrete diffusion line begins with D3PM, short for "Structured Denoising Diffusion Models in Discrete State-Spaces," published at NeurIPS 2021 by Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg.^[1] D3PM generalizes the DDPM forward process from Gaussian noise on continuous tensors to Markov transition matrices on discrete tokens. Instead of adding noise to coordinates, the forward chain probabilistically swaps tokens for other tokens, including a designated "mask" token. By selecting different transition matrices, D3PM unifies several corruption strategies: uniform random replacement, swap to nearest-neighbor tokens in an embedding space, and absorbing-state replacement in which any token can be replaced by a special mask symbol but the mask itself never reverts.^[1]

The absorbing-state variant is particularly important because it draws a formal connection between diffusion and masked language modeling. As Austin et al. observe, when the forward process only ever masks tokens, the reverse process is exactly trained to predict masked positions, which is the BERT-style objective up to a choice of masking schedule. The connection clarifies that bert and its descendants can be viewed as one-step approximations to absorbing-state diffusion, and conversely that masked diffusion models generalize masked language models by iterating the denoising step many times to support generation.^[1] D3PM also showed that autoregressive models emerge from a deterministic left-to-right masking schedule, placing the three paradigms (autoregressive, masked, diffusion) on a single spectrum.

Score Entropy Discrete Diffusion (SEDD)

A second major step came in 2024 with SEDD, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution," by Aaron Lou, Chenlin Meng, and Stefano Ermon, which received the Best Paper award at ICML 2024.^[3] SEDD addresses a longstanding theoretical obstacle: standard score matching, which is the workhorse loss for continuous diffusion, does not naturally apply to discrete spaces because there is no gradient of the log density to estimate.

Lou et al. introduced a new loss called score entropy that extends score matching to discrete spaces by estimating ratios of the data distribution rather than gradients of its logarithm. The score entropy loss provides a principled and computationally tractable training objective for discrete diffusion models with general (not just absorbing) corruption processes. Empirically, SEDD reduced perplexity by 25 to 75 percent relative to prior discrete diffusion baselines at comparable scales, and obtained roughly six to eight times better unconditional generative perplexity than an unannealed gpt-2. SEDD was the first discrete diffusion result to be broadly competitive with autoregressive baselines on standard language modeling benchmarks at the GPT-2 scale, and its reception as ICML best paper marked the moment that diffusion language modeling moved from a curiosity to a serious research direction.^[3]

The Masked Diffusion framework: MDLM and MD4

Building on D3PM's absorbing-state insight and SEDD's tractable training, two NeurIPS 2024 papers refined the masked diffusion framework into something close to a unified recipe. The first, MDLM (Simple and Effective Masked Diffusion Language Models) by Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and Volodymyr Kuleshov, presents a substitution-based parameterization that reduces the absorbing-state diffusion variational bound to a mixture of standard masked language modeling cross-entropy losses with appropriate weights.^[9] The MDLM paper also derives a Rao-Blackwellized objective that further reduces variance during training. With these simplifications and modern engineering practices, masked diffusion models match or approach autoregressive perplexity on standard benchmarks at GPT-2 scale.^[9]

The second, MD4 (Simplified and Generalized Masked Diffusion for Discrete Data) by Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias, generalizes the same family. MD4 shows that the continuous-time variational objective for masked diffusion is a simple weighted integral of cross-entropy losses, supports state-dependent masking schedules, and reports state-of-the-art discrete diffusion results that exceed autoregressive likelihood on pixel-level image modeling and surpass prior diffusion language models on four out of five zero-shot language tasks at GPT-2 scale.^[10] Together MDLM and MD4 cemented masked diffusion as the dominant practical instantiation of discrete diffusion for text, and provided the simplified training recipes that LLaDA, Mercury, and Dream all build upon.

Adapting autoregressive models: DiffuGPT and DiffuLLaMA

Training a large diffusion language model from scratch is expensive, so a parallel research line asks whether an existing pretrained autoregressive checkpoint can be cheaply converted into a diffusion model. The reference result is DiffuGPT and DiffuLLaMA, presented in "Scaling Diffusion Language Models via Adaptation from Autoregressive Models" by Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong, submitted in October 2024 and accepted at ICLR 2025.^[13] The work converts gpt-2 and LLaMA base models, spanning parameter sizes from 127M to 7B, into masked diffusion models using fewer than 200 billion tokens of continued training; DiffuLLaMA in particular was adapted from LLaMA-2 7B on a mixture of SlimPajama and Starcoder data. The authors report that the resulting models outperform earlier diffusion language models and are competitive with their autoregressive counterparts, while acquiring diffusion-native abilities such as filling in the middle of a sequence without reordering the prompt. The AR-adaptation idea was influential: Dream 7B and the later LLaDA 2.0 family both initialize from autoregressive checkpoints rather than training the diffusion model from a random start.^[13]

LLaDA: the first 8B-scale demonstration

The frontier-scale moment for diffusion language modeling arrived in February 2025 with LLaDA, "Large Language Diffusion Models," by Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li from the Gaussian Self-Attention Group at Renmin University and Ant Group.^[4] LLaDA is a masked discrete diffusion model whose architecture is a standard decoder-only transformer without causal masking, trained with a forward masking process and a reverse generation process that predicts masked tokens. The training and SFT pipeline closely mirrors a conventional large language model recipe, with diffusion appearing only in the loss and at sampling time.

LLaDA was trained from scratch on 2.3 trillion tokens using approximately 0.13 million H800 GPU-hours, releasing both a 1B and an 8B variant. The 8B base model performs comparably to llama 3 8B on the standard suite of academic benchmarks. On MMLU LLaDA 8B scores 65.9 versus 65.4 for LLaMA3 8B Base; on GSM8K it scores 70.3 versus 48.7; on MATH it scores 31.4 versus 16.0; and on HumanEval it scores 35.4 versus 34.8.^[4] After supervised fine-tuning the model exhibits competitive multi-turn dialogue, instruction following, and the previously discussed reversal-curse behavior. The headline contribution of LLaDA is not any single benchmark number but the demonstration that the diffusion paradigm scales: it is the first publicly reported masked diffusion language model trained at the same scale as competitive autoregressive baselines, and the first to match them on the standard academic suite. The authors state that "LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities."^[4] Subsequent work extended LLaDA to vision-language inputs (LLaDA-V), to mathematical reasoning, and to instruction-tuned chat checkpoints.^[4]

The LLaDA family: LLaDA 1.5, LLaDA-MoE, and LLaDA 2.0

The original LLaDA was aligned with supervised fine-tuning only, and the same group released a series of successors that added preference optimization, sparse mixture-of-experts architectures, and frontier-scale parameter counts.

LLaDA 1.5, "Variance-Reduced Preference Optimization for Large Language Diffusion Models" by Fengqi Zhu and colleagues (submitted May 2025), addresses the difficulty of applying preference-based alignment to diffusion models. Because a diffusion model's likelihood is only available through a high-variance evidence-lower-bound (ELBO) estimate, naive Direct Preference Optimization is unstable. The authors introduce VRPO (Variance-Reduced Preference Optimization), a framework that analyzes the bias and variance of the preference loss and applies unbiased variance-reduction strategies including optimal Monte Carlo budget allocation and antithetic sampling. LLaDA 1.5 improves over the supervised-fine-tuning-only LLaDA baseline by reported margins of +4.7 on GSM8K, +3.0 on HumanEval, +1.8 on MBPP, +4.0 on IFEval, and +4.3 on Arena-Hard.^[14]

LLaDA-MoE, "A Sparse MoE Diffusion Language Model" by Fengqi Zhu and colleagues from Ant Group and Renmin University (submitted September 2025), is reported as the first open-source mixture-of-experts diffusion language model trained from scratch. It was pretrained on approximately 20 trillion tokens and uses a learned router to activate only about 1.4 billion of its roughly 7 billion total parameters per token, which keeps inference cost low while preserving the capacity of a larger model. The authors report state-of-the-art results among diffusion language models, surpassing LLaDA, LLaDA 1.5, and Dream, with the instruct-tuned variant described as comparable to Qwen2.5-3B-Instruct.^[15]

LLaDA 2.0, "Scaling Up Diffusion Language Models to 100B" from the InclusionAI team at Ant Group (released December 2025), pushes the mixture-of-experts line to frontier scale by converting a pretrained autoregressive model into a diffusion model with a three-phase block-level training schedule. The release includes two instruction-tuned MoE variants: LLaDA 2.0-mini at 16 billion total parameters and LLaDA 2.0-flash at 100 billion total parameters with only about 6.1 billion activated per token. The authors report that LLaDA 2.0-flash outperforms open dense models of similar scale while reducing computational cost, making it the largest publicly described diffusion language model as of early 2026.^[16]

Mercury: Inception Labs' commercial diffusion language model

Where LLaDA established the research case for diffusion at scale, Mercury from Inception Labs established the commercial case. Inception Labs was founded by three researchers with deep credentials in diffusion modeling and language: Stanford professor Stefano Ermon (co-author of the SEDD paper), UCLA professor Aditya Grover, and Cornell professor Volodymyr Kuleshov (co-author of MDLM and Block Diffusion). The company launched Mercury Coder, billed as the first commercial-scale diffusion-based large language model, on February 26, 2025, with backing including a $50M financing round led by Menlo Ventures with participation from Andrew Ng and Andrej Karpathy.^[5]^[11] In its launch announcement, the company described it under the headline "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model," and stated that "our models run at over 1000 tokens/sec on NVIDIA H100s, a speed previously possible only using custom chips."^[5]

Mercury Coder ships in Mini and Small variants. According to the technical report "Mercury: Ultra-Fast Language Models Based on Diffusion," Mercury Coder Mini achieves approximately 1,109 tokens per second on NVIDIA H100 GPUs, while Mercury Coder Small achieves approximately 737 tokens per second, in both cases roughly an order of magnitude faster than speed-optimized autoregressive frontier models of comparable quality.^[7] On the Copilot Arena coding leaderboard Mercury Coder Mini placed second on quality and first on speed at the time of launch, matching or exceeding GPT-4o Mini and Claude 3.5 Haiku while running at a fraction of their latency. The company subsequently added Mercury Chat for general conversation and made the models available through Amazon Bedrock and Azure AI Foundry.^[5]^[7]

The Mercury report describes the underlying model as a transformer that predicts multiple tokens simultaneously, trained with a masked diffusion objective in the MDLM/MD4 family. It does not give exact parameter counts for public-facing variants, but the design philosophy is clear: amortize compute by removing the strict sequential bottleneck of next-token prediction, then exploit parallelism on existing GPU hardware. The result is a paradigm in which throughput scales with the number of denoising steps rather than with the number of generated tokens, and in which latency is largely decoupled from output length over typical coding workloads.^[7]

In early 2026 Inception Labs announced Mercury 2, which it describes as the first reasoning-capable diffusion large language model. According to the company's launch materials, Mercury 2 generates output through parallel refinement rather than sequential decoding, runs at approximately 1,009 tokens per second on NVIDIA Blackwell GPUs, supports a 128K-token context window, and offers tunable reasoning. Inception positions it as roughly 5 times faster than leading speed-optimized autoregressive models while remaining competitive in quality, and prices it at $0.25 per million input tokens and $0.75 per million output tokens through an OpenAI-API-compatible endpoint. As with the first generation, the company does not disclose parameter counts. These figures are vendor-reported and have not been independently benchmarked here.^[19]

Block Diffusion (BD3-LM, 2025)

Pure diffusion models suffer from two limitations relative to autoregressive systems: they cannot easily generate variable-length sequences, and they cannot reuse KV cache the way a standard transformer decoder can. Block Diffusion (BD3-LM, "Block Discrete Denoising Diffusion Language Models") was proposed to address both. The paper, by Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov, was accepted as an Oral at ICLR 2025.^[6]

The core idea is to partition a token sequence into blocks and to model the joint distribution as an autoregressive product over blocks, with each block generated by a discrete diffusion process internally. When the block size equals the full sequence length, BD3-LM degenerates into a standard masked diffusion model; when the block size equals one, it degenerates into a standard autoregressive transformer. Intermediate block sizes interpolate smoothly between the two regimes, trading off sample efficiency (favored by autoregressive) against parallelism and controllability (favored by diffusion). The authors show that BD3-LM supports flexible-length generation, KV caching, and parallel sampling within each block, achieves state-of-the-art perplexity among diffusion language models, and effectively bridges the autoregressive and diffusion paradigms with a single architecture. The framework also clarified that the production-scale Mercury models could be viewed as instances of block diffusion with carefully chosen block sizes.^[6]

Dream 7B and the open-weights frontier

In August 2025, Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong from the University of Hong Kong and Huawei Noah's Ark Lab released Dream 7B, "Dream 7B: Diffusion Large Language Models," with open weights.^[12] Dream 7B is a 7 billion parameter discrete diffusion language model that, like LLaDA, follows the masked diffusion recipe. Two training tricks were emphasized in the report: AR-based LLM initialization, in which the diffusion model is initialized from a pretrained autoregressive checkpoint and then continued under the masked diffusion objective, and context-adaptive token-level noise rescheduling, which adjusts the per-token corruption rate based on local context to stabilize training.

The authors report that Dream 7B consistently outperforms previous open diffusion language models on general, mathematical, and coding benchmarks while remaining competitive with autoregressive llama-class models of similar scale. A companion release, Dream-Coder 7B, specialized the model for code; its Instruct variant attains 21.4 percent pass@1 on LiveCodeBench. Both Dream 7B and Dream-Coder 7B were released with open weights on Hugging Face, making them among the most capable open diffusion language models available at the time of release and providing a public counterpart to the closed-source Mercury family.^[12]

Gemini Diffusion

In May 2025 Google DeepMind unveiled Gemini Diffusion, an experimental text diffusion model that generates output by refining noise step by step and that "generates entire blocks of tokens at once" rather than predicting tokens sequentially. DeepMind reports that the model corrects errors during generation for more consistent outputs and that it excels at editing tasks in the context of mathematics and code. The model page reports a sampling speed of roughly 1,479 tokens per second (excluding an approximately 0.84-second overhead), and Google's announcement frames the goal as delivering "the performance of Gemini 2.0 Flash-Lite at 5x the speed." On DeepMind's reported benchmarks Gemini Diffusion scores 89.6 percent on HumanEval, 76.0 percent on MBPP, 23.3 percent on AIME 2025, and 69.1 percent on Global MMLU (Lite).^[18] As of the date of this article Gemini Diffusion is offered only as an experimental demo to gated testers rather than as a generally available production model, and all of these figures are vendor-reported.^[18]

Seed Diffusion

Seed Diffusion is a high-speed discrete diffusion language model from ByteDance's Seed team, described in the technical report "Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference" (submitted August 2025). The report focuses on structured code generation as a proving ground for the discrete diffusion approach and emphasizes inference speed: the Seed Diffusion Preview is reported to reach approximately 2,146 tokens per second on NVIDIA H20 GPUs, which the authors describe as roughly 5.4 times faster than autoregressive models of comparable scale and as faster than the contemporary Mercury and Gemini Diffusion systems. The team attributes this throughput to a two-stage diffusion training procedure, constrained-order learning, and on-policy learning of an efficient parallel decoding order, and claims state-of-the-art results on the speed-quality Pareto frontier for code models. The model is offered through ByteDance rather than as open weights, and the speed figures are author-reported.^[17]

How do diffusion language models compare to autoregressive models?

The diffusion-versus-autoregressive comparison has three dimensions worth separating: speed, quality, and reasoning.

Speed. Diffusion models can in principle generate many tokens per forward pass. In practice the number of denoising steps determines the actual latency: at a fixed quality, Mercury Coder Mini achieves roughly 1,109 tokens per second on H100, compared to perhaps 100 to 200 tokens per second for autoregressive models of similar quality.^[7] LLaDA and Dream 7B research checkpoints report similar parallel decoding behavior, although the gap narrows when speculative decoding or other autoregressive acceleration techniques are applied to baseline systems. The advantage is largest for short to medium outputs and shrinks for very long generations, where the per-step transformer cost grows with sequence length.

Quality. As of early 2026, the best open and closed diffusion language models are competitive with comparably sized autoregressive models on standard academic benchmarks. LLaDA 8B matches or exceeds llama 3 8B on MMLU, GSM8K, MATH, and HumanEval.^[4] Mercury Coder matches or exceeds GPT-4o Mini and Claude 3.5 Haiku on coding benchmarks while running 5 to 10 times faster.^[7] Dream 7B is broadly competitive with llama-class 7B baselines.^[12] However, no diffusion language model has yet been shown to match the absolute frontier of gpt-4-class or Claude-class systems for general chat, complex reasoning, or agentic workflows.

Reasoning. Diffusion models exhibit some characteristic strengths and weaknesses on reasoning workloads. The bidirectional nature of masked diffusion gives them an advantage on reversal-style tasks and on global-constraint problems; LLaDA's reversal-curse experiments are one example.^[4] On chain-of-thought style reasoning, however, diffusion models do not naturally produce a single causal trace, which complicates the application of standard inference-time scaling techniques developed for autoregressive reasoning models. Several research lines, including DoT-Plaid (chain-of-thought in latent space) and the more recent work on integrating diffusion with state-space models such as mamba and mamba 2, attempt to bridge this gap, but the field has not yet produced a diffusion-native equivalent of OpenAI's o-series or DeepSeek-R1.^[8]

What open problems remain for diffusion language models?

Several open problems remain unresolved as of 2026.

The first is scaling. For most of 2025 the dense LLaDA 8B was the largest publicly verified diffusion language model trained from scratch. The mixture-of-experts line has since pushed the public ceiling higher: LLaDA-MoE was pretrained from scratch on roughly 20 trillion tokens, and LLaDA 2.0-flash reaches 100 billion total parameters (about 6.1 billion activated) by converting an autoregressive checkpoint.^[15]^[16] Whether the diffusion paradigm continues to scale in the same way as autoregressive models into the hundreds of billions of dense parameters, and whether it produces similar emergent capabilities, remains empirically open. Inception Labs has not disclosed parameter counts for Mercury 2, but its capabilities suggest that frontier-scale diffusion is at least feasible at commercial scale.^[5]^[7]^[19]

The second is prompting and inference-time control. Autoregressive models accept a prompt by simply prepending it to the decoded context. Masked diffusion models must instead condition by leaving prompt tokens unmasked throughout the denoising process, a technique sometimes called prefix conditioning. This works for standard chat but raises subtle questions about caching, batching, and tool use that the field is still actively investigating.^[9]^[12]

The third is reasoning. As discussed above, no diffusion language model has yet demonstrated frontier-level chain-of-thought reasoning. The closest analogues are work on continuous latent reasoning (related conceptually to ideas explored in latent reasoning architectures) and Block Diffusion's hybrid approach, but no diffusion-native equivalent of inference-time scaling for reasoning models has been established.^[6]

The fourth is multimodality. LLaDA-V extends LLaDA to vision-language tasks, but the integration of diffusion language modeling with the audio, vision, and tool-use modalities of modern frontier systems is still nascent. Native multimodal diffusion that combines text, image, and code diffusion in a single model remains an active research direction.

The fifth is alignment and safety. Reinforcement learning from human feedback, mixture of experts routing, and the various alignment techniques developed for autoregressive systems do not all transfer cleanly to the diffusion sampler. Whether the diffusion paradigm requires its own alignment toolkit, or whether existing techniques can be adapted, is an open question.

What is the adoption status of diffusion language models in 2025-2026?

By mid-2026 the diffusion language modeling paradigm has moved from research curiosity to a recognized alternative to autoregressive transformer decoders, but it has not displaced them. The research community has produced strong open-weights releases (LLaDA, LLaDA-MoE, LLaDA 2.0, Dream 7B, MDLM, MD4), an ICML 2024 Best Paper (SEDD), and an ICLR 2025 Oral (Block Diffusion). The commercial frontier is led by Inception Labs' Mercury family, now in its Mercury 2 generation and available through Amazon Bedrock, Azure AI Foundry, and a public API, alongside Google DeepMind's experimental Gemini Diffusion and ByteDance's high-speed Seed Diffusion.^[17]^[18]^[19] Open-source ecosystems including Hugging Face have first-class support for masked diffusion models, and NVIDIA has shipped diffusion-based molecular generation tools (Genmol) that share architectural lineage with MDLM.^[9]

Whether diffusion will eventually replace autoregressive generation as the dominant paradigm for large language models depends on whether the speed advantages continue to compound at frontier scale, whether reasoning workflows can be made diffusion-native, and whether multimodal integration matures. As of 2026, the most plausible outcome is hybrid: block diffusion architectures and Mercury-style commercial systems suggest that the future may well consist of models that combine the parallelism of diffusion with the cache-friendly structure of autoregressive decoding, rather than a winner-take-all replacement of one paradigm by the other.^[6]^[7]

References

Austin, Jacob; Johnson, Daniel D.; Ho, Jonathan; Tarlow, Daniel; van den Berg, Rianne. "Structured Denoising Diffusion Models in Discrete State-Spaces." arXiv:2107.03006, NeurIPS 2021. https://arxiv.org/abs/2107.03006 Accessed 2026-05-31. ↩
Li, Xiang Lisa; Thickstun, John; Gulrajani, Ishaan; Liang, Percy; Hashimoto, Tatsunori B. "Diffusion-LM Improves Controllable Text Generation." arXiv:2205.14217, NeurIPS 2022. https://arxiv.org/abs/2205.14217 Accessed 2026-05-31. ↩
Lou, Aaron; Meng, Chenlin; Ermon, Stefano. "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution." arXiv:2310.16834, ICML 2024 Best Paper. https://arxiv.org/abs/2310.16834 Accessed 2026-05-31. ↩
Nie, Shen; Zhu, Fengqi; You, Zebin; Zhang, Xiaolu; Ou, Jingyang; Hu, Jun; Zhou, Jun; Lin, Yankai; Wen, Ji-Rong; Li, Chongxuan. "Large Language Diffusion Models." arXiv:2502.09992, February 2025. https://arxiv.org/abs/2502.09992 Accessed 2026-05-31. ↩
Inception Labs. "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model." Inception Labs Blog, February 26, 2025. https://www.inceptionlabs.ai/blog/introducing-mercury Accessed 2026-05-31. ↩
Arriola, Marianne; Gokaslan, Aaron; Chiu, Justin T.; Yang, Zhihan; Qi, Zhixuan; Han, Jiaqi; Sahoo, Subham Sekhar; Kuleshov, Volodymyr. "Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models." arXiv:2503.09573, ICLR 2025 Oral. https://arxiv.org/abs/2503.09573 Accessed 2026-05-31. ↩
Inception Labs et al. "Mercury: Ultra-Fast Language Models Based on Diffusion." arXiv:2506.17298, 2025. https://arxiv.org/abs/2506.17298 Accessed 2026-05-31. ↩
Gulrajani, Ishaan; Hashimoto, Tatsunori B. "Likelihood-Based Diffusion Language Models." arXiv:2305.18619, NeurIPS 2023 (Plaid). https://arxiv.org/abs/2305.18619 Accessed 2026-05-31. ↩
Sahoo, Subham Sekhar; Arriola, Marianne; Schiff, Yair; Gokaslan, Aaron; Marroquin, Edgar; Chiu, Justin T.; Rush, Alexander; Kuleshov, Volodymyr. "Simple and Effective Masked Diffusion Language Models." arXiv:2406.07524, NeurIPS 2024. https://arxiv.org/abs/2406.07524 Accessed 2026-05-31. ↩
Shi, Jiaxin; Han, Kehang; Wang, Zhe; Doucet, Arnaud; Titsias, Michalis K. "Simplified and Generalized Masked Diffusion for Discrete Data." arXiv:2406.04329, NeurIPS 2024 (MD4). https://arxiv.org/abs/2406.04329 Accessed 2026-05-31. ↩
Maginative. "Inception Labs Launches Mercury, the First Commercial Diffusion-Based Language Model." Maginative, February 2025. https://www.maginative.com/article/inception-labs-launches-mercury-the-first-commercial-diffusion-based-language-model/ Accessed 2026-05-31. ↩
Ye, Jiacheng; Xie, Zhihui; Zheng, Lin; Gao, Jiahui; Wu, Zirui; Jiang, Xin; Li, Zhenguo; Kong, Lingpeng. "Dream 7B: Diffusion Large Language Models." arXiv:2508.15487, August 2025. https://arxiv.org/abs/2508.15487 Accessed 2026-05-31. ↩
Gong, Shansan; Agarwal, Shivam; Zhang, Yizhe; Ye, Jiacheng; Zheng, Lin; Li, Mukai; An, Chenxin; Zhao, Peilin; Bi, Wei; Han, Jiawei; Peng, Hao; Kong, Lingpeng. "Scaling Diffusion Language Models via Adaptation from Autoregressive Models." arXiv:2410.17891, ICLR 2025 (DiffuGPT and DiffuLLaMA). https://arxiv.org/abs/2410.17891 Accessed 2026-05-31. ↩
Zhu, Fengqi; Wang, Rongzhen; Nie, Shen; Zhang, Xiaolu; Wu, Chunwei; Hu, Jun; Zhou, Jun; Chen, Jianfei; Lin, Yankai; Wen, Ji-Rong; Li, Chongxuan. "LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models." arXiv:2505.19223, May 2025. https://arxiv.org/abs/2505.19223 Accessed 2026-05-31. ↩
Zhu, Fengqi; You, Zebin; Xing, Yipeng; Huang, Zenan; Liu, Lin; et al. "LLaDA-MoE: A Sparse MoE Diffusion Language Model." arXiv:2509.24389, September 2025. https://arxiv.org/abs/2509.24389 Accessed 2026-05-31. ↩
InclusionAI (Ant Group). "LLaDA2.0: Scaling Up Diffusion Language Models to 100B." arXiv:2512.15745, December 2025. https://arxiv.org/abs/2512.15745 Accessed 2026-05-31. ↩
ByteDance Seed. "Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference." arXiv:2508.02193, August 2025. https://arxiv.org/abs/2508.02193 Accessed 2026-05-31. ↩
Google DeepMind. "Gemini Diffusion." DeepMind model page and Google blog announcement, May 2025. https://deepmind.google/models/gemini-diffusion/ and https://blog.google/technology/google-deepmind/gemini-diffusion/ Accessed 2026-05-31. ↩
Inception Labs. "Introducing Mercury 2" and "Inception Launches Mercury 2, the Fastest Reasoning LLM." Inception Labs Blog and Business Wire press release, February 2026. https://www.inceptionlabs.ai/blog/introducing-mercury-2 and https://www.businesswire.com/news/home/20260224034496/en/Inception-Launches-Mercury-2-the-Fastest-Reasoning-LLM-5x-Faster-Than-Leading-Speed-Optimized-LLMs-with-Dramatically-Lower-Inference-Cost Accessed 2026-05-31. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

Coconut (Chain of Continuous Thought)Discrete diffusion language model Inception Labs Mercury (Inception Labs)SEDD (Score Entropy Discrete Diffusion)

How do diffusion language models work?

Why use diffusion for text generation?

Continuous diffusion: Diffusion-LM and Plaid

Discrete diffusion: D3PM and the absorbing-state objective

Score Entropy Discrete Diffusion (SEDD)

The Masked Diffusion framework: MDLM and MD4

Adapting autoregressive models: DiffuGPT and DiffuLLaMA

LLaDA: the first 8B-scale demonstration

The LLaDA family: LLaDA 1.5, LLaDA-MoE, and LLaDA 2.0

Mercury: Inception Labs' commercial diffusion language model

Block Diffusion (BD3-LM, 2025)

Dream 7B and the open-weights frontier

Gemini Diffusion

Seed Diffusion

How do diffusion language models compare to autoregressive models?

What open problems remain for diffusion language models?

What is the adoption status of diffusion language models in 2025-2026?

References

Improve this article

Related Articles

Inception Labs

LLaDA (Large Language Diffusion)

Mercury (Inception Labs)

Gemini Diffusion

Stable Diffusion

DALL-E

What links here

Related Articles

Inception Labs

LLaDA (Large Language Diffusion)

Mercury (Inception Labs)

Gemini Diffusion

Stable Diffusion

DALL-E

What links here