Diffusion Language Models
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,735 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,735 words
Add missing citations, update stale details, or suggest a clearer explanation.
Diffusion language models (DLMs, sometimes written dLLMs at frontier scale) are a family of generative models for text that synthesize sequences by reversing a stochastic corruption process rather than by predicting one token at a time. Where conventional large language models such as the gpt family or llama 3 are causal autoregressive systems that factor the joint probability of a sequence into a product of left-to-right conditionals, diffusion language models instead start from a fully corrupted or masked sequence and iteratively denoise it, refining many positions in parallel at each step. The approach inherits its mathematical scaffolding from continuous-space diffusion models such as ddpm, adapted to either continuous word-embedding spaces or to discrete token spaces with absorbing or uniform corruption.[^1][^2][^3]
The motivation for studying diffusion language models has shifted over time. Early work in 2021 and 2022 was driven by the desire for controllable generation and for an alternative to the strict left-to-right factorization of autoregressive models. By 2024 the focus had moved to closing the likelihood gap with autoregressive baselines and to the demonstration that masked discrete diffusion is a competitive paradigm at scale. In 2025 the field reached a frontier-scale milestone with LLaDA, an 8 billion parameter masked diffusion model trained from scratch that matches llama 3 8B on standard benchmarks, and a commercial milestone with Mercury from Inception Labs, the first widely deployed diffusion-based language model API. Additional 2025 releases including Block Diffusion (BD3-LM) from Cornell and Dream 7B from the University of Hong Kong and Huawei Noah's Ark Lab consolidated the case that diffusion can serve as a credible substitute for next-token prediction in many production settings.[^1][^4][^5][^6]
This article surveys the technical foundations of diffusion language models, traces the principal research lines from D3PM (2021) through Diffusion-LM (2022), SEDD (2024), MDLM and MD4 (2024), LLaDA (2025), Block Diffusion (2025), Mercury (2025), and Dream 7B (2025), and discusses the speed and quality trade-offs that distinguish them from autoregressive systems built on the transformer architecture.
Two structural motivations drive interest in diffusion for language. The first is parallel decoding. An autoregressive transformer must produce tokens one at a time, paying the cost of a full forward pass through the model for each emitted token; this is the central bottleneck of long-form decoding and the reason why systems such as flash attention or speculative decoding receive so much engineering attention. A diffusion language model can in principle update every position in a sequence at every denoising step. Sampling is therefore a function of the number of denoising steps and the per-step cost, not of the sequence length, which allows trade-offs between latency and quality that the autoregressive paradigm does not naturally expose. Mercury Coder Mini, for example, reports throughput of approximately 1,109 tokens per second on NVIDIA H100 GPUs, roughly an order of magnitude faster than speed-optimized autoregressive models of similar quality.[^5][^7]
The second motivation is controllability. Because a diffusion sampler maintains a representation of the entire sequence at every step, classifier-guided or gradient-based control can be applied to that representation in much the way it is applied to images in continuous diffusion. Diffusion-LM, the first widely cited diffusion model for text, was introduced specifically to demonstrate fine-grained controllable generation, and subsequent work has explored infilling, length control, and global constraints that are awkward to express in left-to-right decoding.[^2]
A third, more recent motivation is bidirectional reasoning. Autoregressive models are notoriously susceptible to the so-called reversal curse: a model that learns "A is B" during pretraining does not reliably answer the reverse question "B is A?" without explicit data. Because a masked diffusion model does not assume a fixed token ordering during training, it can produce predictions in arbitrary orders. LLaDA reports that on a reversal poem completion task its 8B model outperforms gpt 4o (45.6 vs 34.3), although it underperforms in the forward direction (51.8 vs 82.7), an asymmetry that the authors attribute to the order-agnostic training objective.[^4]
The first published diffusion-based text model was Diffusion-LM by Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto, released in May 2022.[^2] Diffusion-LM adapts the Gaussian diffusion machinery developed for images to text by embedding discrete tokens into a continuous space, running the forward Gaussian corruption process on those embeddings, and learning a denoising network that reverses it. A learned rounding step maps the final denoised vectors back to tokens. The principal experimental result was on six fine-grained controllable generation tasks, where Diffusion-LM significantly outperformed gpt-2 based plug-and-play and fine-tuning baselines by applying classifier guidance directly to the continuous latent variables. It also functioned as a proof of concept that continuous diffusion in embedding space can produce fluent, fixed-length text.[^2]
The continuous embedding approach was scaled and refined by Plaid (Likelihood-Based Diffusion Language Models) from Ishaan Gulrajani and Tatsunori B. Hashimoto, presented at NeurIPS 2023. Plaid 1B reported the first diffusion-based likelihood numbers that exceeded a 124M GPT-2 baseline on standard benchmarks and introduced training recipes oriented around maximum likelihood rather than denoising quality alone, providing a foundation for later scaling studies. Even so, by late 2024 the continuous embedding line was largely overtaken by discrete masked diffusion, which is closer to how text is actually represented in tokenizers and which exhibited better empirical perplexity at large scale.[^8]
The discrete diffusion line begins with D3PM, short for "Structured Denoising Diffusion Models in Discrete State-Spaces," published at NeurIPS 2021 by Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg.[^1] D3PM generalizes the DDPM forward process from Gaussian noise on continuous tensors to Markov transition matrices on discrete tokens. Instead of adding noise to coordinates, the forward chain probabilistically swaps tokens for other tokens, including a designated "mask" token. By selecting different transition matrices, D3PM unifies several corruption strategies: uniform random replacement, swap to nearest-neighbor tokens in an embedding space, and absorbing-state replacement in which any token can be replaced by a special mask symbol but the mask itself never reverts.[^1]
The absorbing-state variant is particularly important because it draws a formal connection between diffusion and masked language modeling. As Austin et al. observe, when the forward process only ever masks tokens, the reverse process is exactly trained to predict masked positions, which is the BERT-style objective up to a choice of masking schedule. The connection clarifies that bert and its descendants can be viewed as one-step approximations to absorbing-state diffusion, and conversely that masked diffusion models generalize masked language models by iterating the denoising step many times to support generation.[^1] D3PM also showed that autoregressive models emerge from a deterministic left-to-right masking schedule, placing the three paradigms (autoregressive, masked, diffusion) on a single spectrum.
A second major step came in 2024 with SEDD, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution," by Aaron Lou, Chenlin Meng, and Stefano Ermon, which received the Best Paper award at ICML 2024.[^3] SEDD addresses a longstanding theoretical obstacle: standard score matching, which is the workhorse loss for continuous diffusion, does not naturally apply to discrete spaces because there is no gradient of the log density to estimate.
Lou et al. introduced a new loss called score entropy that extends score matching to discrete spaces by estimating ratios of the data distribution rather than gradients of its logarithm. The score entropy loss provides a principled and computationally tractable training objective for discrete diffusion models with general (not just absorbing) corruption processes. Empirically, SEDD reduced perplexity by 25 to 75 percent relative to prior discrete diffusion baselines at comparable scales, and obtained roughly six to eight times better unconditional generative perplexity than an unannealed gpt-2. SEDD was the first discrete diffusion result to be broadly competitive with autoregressive baselines on standard language modeling benchmarks at the GPT-2 scale, and its reception as ICML best paper marked the moment that diffusion language modeling moved from a curiosity to a serious research direction.[^3]
Building on D3PM's absorbing-state insight and SEDD's tractable training, two NeurIPS 2024 papers refined the masked diffusion framework into something close to a unified recipe. The first, MDLM (Simple and Effective Masked Diffusion Language Models) by Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and Volodymyr Kuleshov, presents a substitution-based parameterization that reduces the absorbing-state diffusion variational bound to a mixture of standard masked language modeling cross-entropy losses with appropriate weights.[^9] The MDLM paper also derives a Rao-Blackwellized objective that further reduces variance during training. With these simplifications and modern engineering practices, masked diffusion models match or approach autoregressive perplexity on standard benchmarks at GPT-2 scale.[^9]
The second, MD4 (Simplified and Generalized Masked Diffusion for Discrete Data) by Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias, generalizes the same family. MD4 shows that the continuous-time variational objective for masked diffusion is a simple weighted integral of cross-entropy losses, supports state-dependent masking schedules, and reports state-of-the-art discrete diffusion results that exceed autoregressive likelihood on pixel-level image modeling and surpass prior diffusion language models on four out of five zero-shot language tasks at GPT-2 scale.[^10] Together MDLM and MD4 cemented masked diffusion as the dominant practical instantiation of discrete diffusion for text, and provided the simplified training recipes that LLaDA, Mercury, and Dream all build upon.
The frontier-scale moment for diffusion language modeling arrived in February 2025 with LLaDA, "Large Language Diffusion Models," by Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li from the Gaussian Self-Attention Group at Renmin University and Ant Group.[^4] LLaDA is a masked discrete diffusion model whose architecture is a standard decoder-only transformer without causal masking, trained with a forward masking process and a reverse generation process that predicts masked tokens. The training and SFT pipeline closely mirrors a conventional large language model recipe, with diffusion appearing only in the loss and at sampling time.
LLaDA was trained from scratch on 2.3 trillion tokens using approximately 0.13 million H800 GPU-hours, releasing both a 1B and an 8B variant. The 8B base model performs comparably to llama 3 8B on the standard suite of academic benchmarks. On MMLU LLaDA 8B scores 65.9 versus 65.4 for LLaMA3 8B Base; on GSM8K it scores 70.3 versus 48.7; on MATH it scores 31.4 versus 16.0; and on HumanEval it scores 35.4 versus 34.8.[^4] After supervised fine-tuning the model exhibits competitive multi-turn dialogue, instruction following, and the previously discussed reversal-curse behavior. The headline contribution of LLaDA is not any single benchmark number but the demonstration that the diffusion paradigm scales: it is the first publicly reported masked diffusion language model trained at the same scale as competitive autoregressive baselines, and the first to match them on the standard academic suite. Subsequent work extended LLaDA to vision-language inputs (LLaDA-V), to mathematical reasoning, and to instruction-tuned chat checkpoints.[^4]
Where LLaDA established the research case for diffusion at scale, Mercury from Inception Labs established the commercial case. Inception Labs was founded by three researchers with deep credentials in diffusion modeling and language: Stanford professor Stefano Ermon (co-author of the SEDD paper), UCLA professor Aditya Grover, and Cornell professor Volodymyr Kuleshov (co-author of MDLM and Block Diffusion). The company launched Mercury Coder, billed as the first commercial-scale diffusion-based large language model, on February 26, 2025, with backing including a $50M financing round led by Menlo Ventures with participation from Andrew Ng and Andrej Karpathy.[^5][^11]
Mercury Coder ships in Mini and Small variants. According to the technical report "Mercury: Ultra-Fast Language Models Based on Diffusion," Mercury Coder Mini achieves approximately 1,109 tokens per second on NVIDIA H100 GPUs, while Mercury Coder Small achieves approximately 737 tokens per second, in both cases roughly an order of magnitude faster than speed-optimized autoregressive frontier models of comparable quality.[^7] On the Copilot Arena coding leaderboard Mercury Coder Mini placed second on quality and first on speed at the time of launch, matching or exceeding GPT-4o Mini and Claude 3.5 Haiku while running at a fraction of their latency. The company has since released Mercury 2 and Mercury Chat for general chat, made the models available through AWS Bedrock and Azure Foundry, and continues to publish updated technical reports.[^5][^7]
The Mercury report describes the underlying model as a transformer that predicts multiple tokens simultaneously, trained with a masked diffusion objective in the MDLM/MD4 family. It does not give exact parameter counts for public-facing variants, but the design philosophy is clear: amortize compute by removing the strict sequential bottleneck of next-token prediction, then exploit parallelism on existing GPU hardware. The result is a paradigm in which throughput scales with the number of denoising steps rather than with the number of generated tokens, and in which latency is largely decoupled from output length over typical coding workloads.[^7]
Pure diffusion models suffer from two limitations relative to autoregressive systems: they cannot easily generate variable-length sequences, and they cannot reuse KV cache the way a standard transformer decoder can. Block Diffusion (BD3-LM, "Block Discrete Denoising Diffusion Language Models") was proposed to address both. The paper, by Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov, was accepted as an Oral at ICLR 2025.[^6]
The core idea is to partition a token sequence into blocks and to model the joint distribution as an autoregressive product over blocks, with each block generated by a discrete diffusion process internally. When the block size equals the full sequence length, BD3-LM degenerates into a standard masked diffusion model; when the block size equals one, it degenerates into a standard autoregressive transformer. Intermediate block sizes interpolate smoothly between the two regimes, trading off sample efficiency (favored by autoregressive) against parallelism and controllability (favored by diffusion). The authors show that BD3-LM supports flexible-length generation, KV caching, and parallel sampling within each block, achieves state-of-the-art perplexity among diffusion language models, and effectively bridges the autoregressive and diffusion paradigms with a single architecture. The framework also clarified that the production-scale Mercury models could be viewed as instances of block diffusion with carefully chosen block sizes.[^6]
In August 2025, Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong from the University of Hong Kong and Huawei Noah's Ark Lab released Dream 7B, "Dream 7B: Diffusion Large Language Models," with open weights.[^12] Dream 7B is a 7 billion parameter discrete diffusion language model that, like LLaDA, follows the masked diffusion recipe. Two training tricks were emphasized in the report: AR-based LLM initialization, in which the diffusion model is initialized from a pretrained autoregressive checkpoint and then continued under the masked diffusion objective, and context-adaptive token-level noise rescheduling, which adjusts the per-token corruption rate based on local context to stabilize training.
The authors report that Dream 7B consistently outperforms previous open diffusion language models on general, mathematical, and coding benchmarks while remaining competitive with autoregressive llama-class models of similar scale. A companion release, Dream-Coder 7B, specialized the model for code; its Instruct variant attains 21.4 percent pass@1 on LiveCodeBench. Both Dream 7B and Dream-Coder 7B were released with open weights on Hugging Face, making them the most capable open diffusion language models available at the time of release and providing a public counterpart to the closed-source Mercury family.[^12]
The diffusion-versus-autoregressive comparison has three dimensions worth separating: speed, quality, and reasoning.
Speed. Diffusion models can in principle generate many tokens per forward pass. In practice the number of denoising steps determines the actual latency: at a fixed quality, Mercury Coder Mini achieves roughly 1,109 tokens per second on H100, compared to perhaps 100 to 200 tokens per second for autoregressive models of similar quality.[^7] LLaDA and Dream 7B research checkpoints report similar parallel decoding behavior, although the gap narrows when speculative decoding or other autoregressive acceleration techniques are applied to baseline systems. The advantage is largest for short to medium outputs and shrinks for very long generations, where the per-step transformer cost grows with sequence length.
Quality. As of early 2026, the best open and closed diffusion language models are competitive with comparably sized autoregressive models on standard academic benchmarks. LLaDA 8B matches or exceeds llama 3 8B on MMLU, GSM8K, MATH, and HumanEval.[^4] Mercury Coder matches or exceeds GPT-4o Mini and Claude 3.5 Haiku on coding benchmarks while running 5 to 10 times faster.[^7] Dream 7B is broadly competitive with llama-class 7B baselines.[^12] However, no diffusion language model has yet been shown to match the absolute frontier of gpt-4-class or Claude-class systems for general chat, complex reasoning, or agentic workflows.
Reasoning. Diffusion models exhibit some characteristic strengths and weaknesses on reasoning workloads. The bidirectional nature of masked diffusion gives them an advantage on reversal-style tasks and on global-constraint problems; LLaDA's reversal-curse experiments are one example.[^4] On chain-of-thought style reasoning, however, diffusion models do not naturally produce a single causal trace, which complicates the application of standard inference-time scaling techniques developed for autoregressive reasoning models. Several research lines, including DoT-Plaid (chain-of-thought in latent space) and the more recent work on integrating diffusion with state-space models such as mamba and mamba 2, attempt to bridge this gap, but the field has not yet produced a diffusion-native equivalent of OpenAI's o-series or DeepSeek-R1.[^8]
Several open problems remain unresolved as of 2026.
The first is scaling. LLaDA 8B is the largest publicly verified diffusion language model trained from scratch, and Dream 7B is the largest publicly available open one. Whether the diffusion paradigm continues to scale in the same way as autoregressive models at the 30B, 70B, or 400B parameter range, and whether it produces similar emergent capabilities, is empirically open. Inception Labs has not disclosed parameter counts for Mercury 2, but its capabilities suggest that frontier-scale diffusion is at least feasible at commercial scale.[^5][^7]
The second is prompting and inference-time control. Autoregressive models accept a prompt by simply prepending it to the decoded context. Masked diffusion models must instead condition by leaving prompt tokens unmasked throughout the denoising process, a technique sometimes called prefix conditioning. This works for standard chat but raises subtle questions about caching, batching, and tool use that the field is still actively investigating.[^9][^12]
The third is reasoning. As discussed above, no diffusion language model has yet demonstrated frontier-level chain-of-thought reasoning. The closest analogues are work on continuous latent reasoning (related conceptually to ideas explored in latent reasoning architectures) and Block Diffusion's hybrid approach, but no diffusion-native equivalent of inference-time scaling for reasoning models has been established.[^6]
The fourth is multimodality. LLaDA-V extends LLaDA to vision-language tasks, but the integration of diffusion language modeling with the audio, vision, and tool-use modalities of modern frontier systems is still nascent. Native multimodal diffusion that combines text, image, and code diffusion in a single model remains an active research direction.
The fifth is alignment and safety. Reinforcement learning from human feedback, mixture of experts routing, and the various alignment techniques developed for autoregressive systems do not all transfer cleanly to the diffusion sampler. Whether the diffusion paradigm requires its own alignment toolkit, or whether existing techniques can be adapted, is an open question.
By mid-2026 the diffusion language modeling paradigm has moved from research curiosity to a recognized alternative to autoregressive transformer decoders, but it has not displaced them. The research community has produced strong open-weights releases (LLaDA, Dream 7B, MDLM, MD4), an ICML 2024 Best Paper (SEDD), and an ICLR 2025 Oral (Block Diffusion). The commercial frontier is led by Inception Labs' Mercury family, available through Amazon Bedrock, Azure Foundry, and a public API. Open-source ecosystems including Hugging Face have first-class support for masked diffusion models; ByteDance's Seed Diffusion, reportedly the fastest industry-grade diffusion LLM, is based on the MDLM recipe; and NVIDIA has shipped diffusion-based molecular generation tools (Genmol) that share architectural lineage with MDLM.[^9]
Whether diffusion will eventually replace autoregressive generation as the dominant paradigm for large language models depends on whether the speed advantages continue to compound at frontier scale, whether reasoning workflows can be made diffusion-native, and whether multimodal integration matures. As of 2026, the most plausible outcome is hybrid: block diffusion architectures and Mercury-style commercial systems suggest that the future may well consist of models that combine the parallelism of diffusion with the cache-friendly structure of autoregressive decoding, rather than a winner-take-all replacement of one paradigm by the other.[^6][^7]