Discrete diffusion language model
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,761 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,761 words
Add missing citations, update stale details, or suggest a clearer explanation.
A discrete diffusion language model is a class of generative model for text that produces tokens by iteratively denoising a corrupted sequence, rather than by predicting one token at a time from left to right.[1][2] The family includes uniform, Gaussian-structured, and absorbing-state (masked) variants, with the masked formulation now dominant for language because it admits a simple cross-entropy training objective and connects to the masked language modeling literature.[1][3][4] Discrete diffusion was formalized for categorical data by Hoogeboom et al. (multinomial diffusion, 2021) and Austin et al. (D3PM, 2021), advanced by Lou et al. (SEDD, 2023) through score-entropy losses, and simplified by Sahoo et al. (MDLM, 2024) and Shi et al. (MD4, 2024).[1][2][3][4][5] In 2025, the approach moved from research to deployment with the 8-billion-parameter LLaDA model from Renmin University and Ant Group, Inception Labs' commercial Mercury Coder, and Google DeepMind's experimental Gemini Diffusion shown at Google I/O 2025.[6][7][8][9] The central engineering claim is that parallel iterative denoising allows much higher output throughput than autoregressive decoding on the same hardware, with reported figures above one thousand tokens per second on NVIDIA H100 GPUs for Mercury Coder and roughly 1,479 tokens per second for Gemini Diffusion.[7][10][11]
Standard large language models factor the joint probability of a token sequence into a product of conditional probabilities, with each token predicted given all previous tokens. This causal language model formulation, popularized in gpt-2 and downstream systems, requires L forward passes to generate a sequence of length L, since each token depends on the prefix produced before it.[12] At inference time the model maintains a kv cache of past keys and values to avoid recomputing prefix attention, but the sequential dependency between successive tokens still forces O(L) wall-clock steps for batch-one generation.[12] Techniques like speculative decoding partially mitigate that latency by speculating tokens in parallel and verifying them, yet the generative model itself remains causal.[13]
Denoising diffusion probabilistic models, introduced for images by Ho, Jain, and Abbeel in 2020, define a forward Markov chain that gradually adds Gaussian noise to data and learn a reverse Markov chain that removes that noise.[14] The training objective is a weighted variational bound that the authors relate to denoising score matching under Langevin dynamics, and the resulting ddpm achieved state-of-the-art image fidelity on CIFAR-10 and high-resolution LSUN benchmarks.[14] Diffusion has since become the dominant paradigm for image and video synthesis. Applying the same framework to text required generalizing the forward and reverse kernels from continuous Gaussians to discrete state transitions.
Tokens are categorical, not real-valued. Adding Gaussian noise to a one-hot vector and rounding it back is lossy and ill-defined, which made early attempts at "continuous embedding diffusion" for text underperform autoregressive baselines on language modeling. Hoogeboom, Nielsen, Jaini, Forre, and Welling addressed this in February 2021 by introducing multinomial diffusion, a forward process that gradually adds categorical noise by mixing each token's one-hot distribution with a uniform vocabulary distribution at each step, and a denoising network trained to invert that process.[5] The same paper also proposed Argmax Flows, which map continuous densities through an argmax to categorical samples. Multinomial diffusion outperformed dequantization-based baselines on character-level text and segmentation maps, and provided the categorical noise schedule that later work would generalize.[5]
The non-autoregressive generation problem was also studied outside the diffusion framing. Iterative refinement methods like Mask-Predict and the bert-derived non-autoregressive translation literature predict all positions in parallel and refine over several rounds, but used hand-designed schedules rather than a principled probabilistic objective.[21] D3PM and follow-up work cast iterative parallel decoding as the reverse of a tractable forward Markov chain, which gave a single variational training objective and a clean inference algorithm without bespoke heuristics.[1][3][4]
Austin, Johnson, Ho, Tarlow, and van den Berg formalized the family in Structured Denoising Diffusion Probabilistic Models (D3PM), submitted to arXiv on July 7, 2021 and published at NeurIPS 2021.[1] D3PM defines a discrete forward Markov chain x_0 to x_T over tokens with a transition matrix Q_t applied at each timestep, so the marginal q(x_t | x_0) is x_0 multiplied by the cumulative product of the Q matrices.[1] D3PM showed that uniform Q matrices (multinomial diffusion) are only one option, and that the framework admits several structured corruption processes: Gaussian-shaped Q in token-embedding space (analogous to continuous diffusion), nearest-neighbor Q based on learned embeddings, and an absorbing-state Q in which each token has a fixed probability of being replaced by a special MASK token and otherwise stays fixed.[1] The absorbing variant draws a formal connection between discrete diffusion and the bert-style masked language model, since at large t every token has been replaced by MASK and the reverse process amounts to iterative mask-filling.[1] D3PM also introduced a hybrid loss combining the variational lower bound with an auxiliary cross-entropy term on predicted x_0, and demonstrated competitive results on CIFAR-10 and LM1B text modeling.[1]
Lou, Meng, and Ermon proposed score entropy discrete diffusion (SEDD) on arXiv on October 25, 2023.[2] Their contribution was a new loss they called score entropy, which generalizes Hyvarinen-style score matching from continuous spaces to discrete state spaces by parameterizing ratios p_t(y) / p_t(x) of the data distribution rather than gradients of a log density (which are ill-defined when the vocabulary is finite).[2] The training objective is then a tractable Bregman divergence between learned and true ratios, evaluated only on observed transitions, and the denoising sampler at inference uses these ratios to choose which tokens to update.[2] SEDD reduced perplexity by 25 to 75 percent relative to prior diffusion language models and outperformed a comparably sized gpt-2 on generative perplexity while requiring up to 32 times fewer network evaluations for similar generation quality.[2] The paper received a best-paper-class oral at ICML 2024.[15]
Two concurrent works in June 2024 simplified the absorbing-state objective. Sahoo, Arriola, Schiff, Gokaslan, Marroquin, Chiu, Rush, and Kuleshov posted MDLM: Simple and Effective Masked Diffusion Language Models on June 11, 2024.[3] They observed that under the absorbing-state forward process the continuous-time variational lower bound reduces to a weighted mixture of standard masked-language-modeling cross-entropy losses, and that a Rao-Blackwellized estimator further reduces the gradient variance and improves stability.[3] Five days earlier, on June 6, 2024, Shi, Han, Wang, Doucet, and Titsias posted MD4: Simplified and Generalized Masked Diffusion for Discrete Data with a closely related derivation, showing that the continuous-time ELBO is a simple weighted integral of cross-entropy losses with a closed-form weight depending on the masking schedule.[4] MD4 also allowed state-dependent masking schedules. Both papers achieved new state-of-the-art among diffusion language models at the GPT-2 scale on OpenWebText, and MD4 reported 2.75 bits per dimension on CIFAR-10 and 3.40 on ImageNet 64x64 (better than autoregressive baselines of similar size).[4] The MDLM/MD4 result is the operational reason discrete diffusion can be trained today using almost the same code path as a masked encoder: the per-step loss is just a re-weighted cross entropy loss on the masked positions.[3][4]
In the absorbing-state setting, the forward kernel is parameterized by a monotone schedule alpha_t in [0, 1] with alpha_0 = 1 and alpha_T = 0. Given a clean sequence x_0, each token is independently kept with probability alpha_t and replaced with MASK with probability 1 minus alpha_t. The marginal q(x_t | x_0) is therefore simple to sample from at any t. The reverse model is a transformer that takes the partially masked sequence x_t and the time index t and outputs a predicted x_0 distribution over the vocabulary at every masked position.[1][3][4] At training time the loss is a weighted cross-entropy on the masked tokens. At inference time the sampler starts from a fully masked sequence at t = T and iteratively unmasks tokens by sampling from the predicted x_0 and choosing how many positions to commit per step, with the remaining positions held masked or re-masked according to a schedule.[3][4][6]
A common choice in MDLM and MD4 is the linear schedule alpha_t = 1 minus t over T, which makes the expected mask rate proportional to t and gives a uniform weighting over noise levels.[3][4] The cosine schedule borrowed from ddpm is also used and tends to spend more denoising steps on lightly masked sequences, where most of the per-token information is concentrated.[4] Under the linear schedule the continuous-time ELBO reduces to the integral from 0 to 1 of (1 / (1 minus alpha)) times the expected cross-entropy on masked positions, which is the closed form Sahoo et al. and Shi et al. show is equivalent to a re-weighted classical masked language modeling loss.[3][4]
Training a masked discrete diffusion language model uses an almost identical recipe to training a bert-style masked encoder, with three differences.[3][4] First, the mask ratio is sampled from the noise schedule rather than fixed (BERT uses a fixed 15 percent mask rate). Second, the per-step loss is weighted by the derivative of the masking schedule, which gives high weight to lightly masked timesteps and lower weight to heavily masked ones. Third, the network can be either encoder-only or decoder-only architecture, since the reverse process is non-causal and benefits from bidirectional attention over the masked sequence.[3][4] In practice LLaDA and Mercury both use a decoder-only transformer with non-causal attention.[6][7] Training corpora and optimizer settings closely match autoregressive LLM pipelines: LLaDA 8B Base was pre-trained on 2.3 trillion tokens with a transformer optimized for autoregressive-like throughput on the GPU.[16]
Inference proceeds in K denoising steps, where K can be much smaller than the sequence length L. Each step takes the current partially masked sequence, runs one forward pass through the transformer, and produces a probability distribution over the vocabulary at every masked position.[3][4] The sampler then commits some subset of positions, typically those with the highest confidence (top-k unmasking by predicted token probability), and leaves the rest masked for subsequent steps.[6] When the schedule is fully completed at t = 0, no MASK tokens remain. Because every step operates on the entire sequence in parallel, the wall-clock cost of generating L tokens is K forward passes regardless of L (up to memory limits on attention), in contrast to L forward passes for an autoregressive model with the same architecture.[7] LLaDA's official sampler uses iterative remasking, where low-confidence unmasked positions can be re-corrupted and predicted again in later steps to allow error correction.[6]
The number of denoising steps K is the principal quality-speed knob. SEDD reported similar quality at 32 times fewer network evaluations compared with prior diffusion baselines and outperformed annealed gpt-2 at the equivalent compute budget.[2] MDLM and MD4 demonstrated that the same training objective supports semi-autoregressive sampling where the sequence is generated in left-to-right blocks of variable length, recovering classical generation at one extreme and full parallel sampling at the other.[3][4] In production, Mercury Coder Mini and Mercury Coder Small reach 1,109 and 737 tokens per second respectively on NVIDIA H100 GPUs, figures that Inception Labs reports as five to ten times faster than autoregressive speed-optimized models of similar quality.[7][10]
Concretely, an inference run starts with x_T fully masked and proceeds for K steps. At each step the transformer outputs a categorical distribution p(y | x_t) over the vocabulary for every position; the sampler then selects positions to commit using either top-k confidence (commit the n_t positions with the highest predicted log-probability) or top-p (commit positions whose predicted log-probability exceeds a threshold).[6][10] LLaDA's official implementation additionally allows remasking: positions whose predicted token has low confidence after a commit step can be re-corrupted to MASK and predicted again later, which trades step count for accuracy on hard positions.[6] The Mercury technical report describes a similar confidence-aware scheduler tuned per task and reports that aggressive scheduling (K around 32 for short outputs) is sufficient for coding tasks while longer text typically uses K in the 64 to 128 range.[10]
Nie, Zhu, You, Zhang, Ou, Hu, Zhou, Lin, Wen, and Li released Large Language Diffusion Models on arXiv on February 14, 2025, with authors from Renmin University of China and Ant Group.[6] LLaDA 8B Base is a masked diffusion transformer trained from scratch on 2.3 trillion tokens, the first published 8B-parameter diffusion language model evaluated head-to-head against llama 3 8B.[16] On standard zero-shot and few-shot benchmarks the base model reported 65.9 on 5-shot mmlu (versus 65.4 for LLaMA 3 8B Base and 45.9 for llama 2 7B), 70.3 on 4-shot gsm8k (versus 48.7 and 13.1), 35.4 on 0-shot humaneval (versus 34.8 and 12.8), and 31.4 on 4-shot MATH (versus 16.0 and 4.3).[16] The paper claims LLaDA is competitive with LLaMA 3 8B in in-context learning and supervised fine-tuning, and that it mitigates the reversal curse, completing a poem when shown its ending and asked for its beginning more reliably than gpt 4o.[6] LLaDA-8B-Base and LLaDA-8B-Instruct are released on Hugging Face with an official PyTorch implementation on GitHub.[17]
Inception Labs launched Mercury Coder in February 2025, billing it as the first commercially available diffusion-based LLM.[7][18] The company was founded by Stefano Ermon (Stanford), Aditya Grover (UCLA), and Volodymyr Kuleshov (Cornell), the academic groups behind SEDD and MDLM respectively.[18] Two coding variants shipped at launch: Mercury Coder Mini at 1,109 tokens per second and Mercury Coder Small at 737 tokens per second, both on nvidia h100 hardware.[7] On Copilot Arena human evaluations Mercury Coder Mini tied for second place by quality among coding assistants, ahead of speed-optimized models like GPT-4o Mini and Gemini 1.5 Flash, with an average latency of about 25 milliseconds per response.[7][18] A follow-up technical report Mercury: Ultra-Fast Language Models Based on Diffusion appeared on arXiv on June 17, 2025, providing a detailed account of the architecture and benchmarks.[10] On the MultiPL-E coding benchmark Mercury Coder Small reached 82.0 on C++, 83.9 on JavaScript, and 82.6 on TypeScript, and 84.8 average on fill-in-the-middle tasks, exceeding Codestral 2501 at 82.5.[10] See inception labs and mercury inception for company and product detail.
Google DeepMind announced Gemini Diffusion at Google I/O 2025 on May 20, 2025, as an experimental text diffusion model behind a waitlist.[9][11] DeepMind reported an average sampling speed of 1,479 tokens per second (with 0.84 seconds of fixed overhead per request), four to five times faster than the company's prior Gemini Flash family while matching its coding performance.[11][19] Reported scores on the demo include 89.6 percent on humaneval and 76.0 percent on mbpp coding tasks, and 23.3 percent on AIME 2025 math (vs 20.0 percent for Gemini 2.0 Flash-Lite).[11] As of May 2025 Gemini Diffusion was available only via the experimental waitlist; the full gemini family otherwise comprises autoregressive transformer models.[19]
The structural difference between discrete diffusion and autoregressive models centers on three properties.
| Property | Autoregressive LLM | Masked discrete diffusion LLM |
|---|---|---|
| Generation order | Left-to-right, one token per step | Any order, multiple tokens per step |
| Forward passes per L tokens | L | K (chosen at sample time; K << L typical) |
| Attention pattern | Causal (lower triangular) | Bidirectional over current x_t |
| Per-step loss | Next-token cross-entropy | Weighted cross-entropy on masked positions |
| KV cache reuse | Standard, large win | Limited; the masked sequence changes each step |
| Reversal queries | Often fails (reversal curse) | Trains symmetrically over positions |
Because every denoising step processes the full sequence in parallel, throughput in tokens per second scales with sequence length up to the attention bottleneck, while autoregressive systems require either prefix-shared batching or speculative decoding to amortize the per-step cost.[7][11] Conversely, the kv cache optimization that drives modern autoregressive inference is harder to apply to diffusion models because tokens later in the schedule can change the conditioning of tokens already partially decoded, so each step typically recomputes attention over the full sequence.[10] Diffusion models also do not have a natural "stop" token; they emit a sequence of fixed length L set in advance, although MDLM and the Mercury technical report describe semi-autoregressive variants that generate in fixed-length blocks and append additional blocks until an end condition is met.[3][10]
Quality per parameter at the 1B-to-8B scale is the main remaining gap. LLaDA 8B Base is competitive with llama 3 8B Base on mmlu, gsm8k, humaneval, and MATH after training on a comparable token budget, but as of mid-2025 no diffusion language model has been demonstrated at the 70B-plus scale that defines frontier autoregressive systems, and scaling-law behavior for discrete diffusion is still an open research question.[6][16][20]
Four open issues remained as of mid-2025.
First, scaling laws are not well characterized. The autoregressive scaling laws of Hoffmann et al. and Kaplan et al. provided a predictable cost-to-quality tradeoff for next-token-prediction models, but no published analogue exists for masked diffusion at frontier scale.[20] LLaDA's authors reported that diffusion scaling closely tracked the autoregressive baseline they trained at smaller scales, but only up to 8B parameters and 2.3 trillion tokens.[16]
Second, per-parameter quality lags slightly. Even where LLaDA matched LLaMA 3 8B on aggregate benchmarks, individual evaluations (notably code and tool use) sometimes favored the autoregressive baseline, and Mercury Coder Mini was tied for second rather than first on Copilot Arena despite training the same model family.[7][16] SEDD likewise improved over GPT-2 but did not displace larger autoregressive baselines at matched compute.[2]
Third, inference is harder to batch and cache. Because each denoising step operates on a different masked sequence, the kv cache cannot be reused across steps in the way it is reused across tokens in autoregressive inference. Mercury and Gemini Diffusion's reported throughputs come from running the full forward pass each step on a single H100 (or comparable hardware) without long-prefix KV reuse, which constrains how the throughput advantage translates to long-context applications.[10][11]
Fourth, output length must be fixed in advance. Autoregressive models can stop at any point by emitting a special end-of-sequence token, but a masked diffusion model in its pure form denoises a buffer of length L chosen at sample time. The Mercury technical report and MDLM both discuss block-wise semi-autoregressive sampling as a workaround: the model denoises a fixed-size block, then appends another block conditioned on the committed prefix, and so on until an end token appears within a block.[3][10] This recovers variable-length output at the cost of partially serializing the inference loop.[10]
Discrete diffusion language models had three principal deployments by mid-2025.
The first is code generation, where the Mercury Coder family from Inception Labs targets latency-sensitive coding workflows (autocomplete, in-IDE edit) that benefit from the low time-to-first-token and the fill-in-the-middle ability inherent to mask-based modeling.[7][10] On fill-in-the-middle benchmarks Mercury Coder Small reported 84.8 percent average accuracy, exceeding Codestral 2501.[10] See ai code generation.
The second is infilling and controllable editing. SEDD demonstrated controllable infilling without any architectural change because the masked positions can be anywhere in the sequence and the model conditions on all visible context simultaneously, in contrast to causal left-to-right models that require special-purpose fill-in-the-middle training.[2] MDLM and MD4 inherit the same property.[3][4]
The third is research demonstration at frontier scale, exemplified by Google DeepMind's Gemini Diffusion at Google I/O 2025 and by LLaDA 8B's open release on Hugging Face.[6][11][17] These models exist mainly to demonstrate that the diffusion paradigm scales to multi-billion parameters and to gather user feedback before broader deployment.[11][19] By May 2026, the Mercury family had matured to second-generation models (Mercury 2 and Mercury Edit 2) optimized for reasoning and code editing, indicating active commercial development.[18]