# Discrete diffusion language model

> Source: https://aiwiki.ai/wiki/discrete_diffusion
> Updated: 2026-06-09
> Categories: Diffusion Models, Generative AI, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **discrete diffusion language model** is a class of generative model for text that produces tokens by iteratively denoising a corrupted sequence, rather than by predicting one token at a time from left to right.[^1][^2] The family includes uniform, Gaussian-structured, and absorbing-state (masked) variants, with the masked formulation now dominant for language because it admits a simple cross-entropy training objective and connects to the masked language modeling literature.[^1][^3][^4] Discrete diffusion was formalized for categorical data by Hoogeboom et al. (multinomial diffusion, 2021) and Austin et al. (D3PM, 2021), advanced by Lou et al. (SEDD, 2023) through score-entropy losses, and simplified by Sahoo et al. (MDLM, 2024) and Shi et al. (MD4, 2024).[^1][^2][^3][^4][^5] In 2025, the approach moved from research to deployment with the 8-billion-parameter LLaDA model from Renmin University and Ant Group, Inception Labs' commercial Mercury Coder, and Google DeepMind's experimental [Gemini Diffusion](/wiki/gemini_diffusion) shown at Google I/O 2025.[^6][^7][^8][^9] The central engineering claim is that parallel iterative denoising allows much higher output throughput than autoregressive decoding on the same hardware, with reported figures above one thousand tokens per second on NVIDIA H100 GPUs for Mercury Coder and roughly 1,479 tokens per second for Gemini Diffusion.[^7][^10][^11]

## Background

### Autoregressive language models and sequential generation

Standard [large language model](/wiki/large_language_model)s factor the joint probability of a token sequence into a product of conditional probabilities, with each token predicted given all previous tokens. This [causal language model](/wiki/causal_language_model) formulation, popularized in [gpt-2](/wiki/gpt-2) and downstream systems, requires L forward passes to generate a sequence of length L, since each token depends on the prefix produced before it.[^12] At inference time the model maintains a [kv cache](/wiki/kv_cache) of past keys and values to avoid recomputing prefix attention, but the sequential dependency between successive tokens still forces O(L) wall-clock steps for batch-one generation.[^12] Techniques like [speculative decoding](/wiki/speculative_decoding) partially mitigate that latency by speculating tokens in parallel and verifying them, yet the generative model itself remains causal.[^13]

### Diffusion models for continuous data

Denoising diffusion probabilistic models, introduced for images by Ho, Jain, and Abbeel in 2020, define a forward Markov chain that gradually adds Gaussian noise to data and learn a reverse Markov chain that removes that noise.[^14] The training objective is a weighted variational bound that the authors relate to denoising [score matching](/wiki/score_matching) under Langevin dynamics, and the resulting [ddpm](/wiki/ddpm) achieved state-of-the-art image fidelity on CIFAR-10 and high-resolution LSUN benchmarks.[^14] Diffusion has since become the dominant paradigm for image and video synthesis. Applying the same framework to text required generalizing the forward and reverse kernels from continuous Gaussians to discrete state transitions.

### The discrete-data challenge

Tokens are categorical, not real-valued. Adding Gaussian noise to a one-hot vector and rounding it back is lossy and ill-defined, which made early attempts at "continuous embedding diffusion" for text underperform autoregressive baselines on language modeling. Hoogeboom, Nielsen, Jaini, Forre, and Welling addressed this in February 2021 by introducing **multinomial diffusion**, a forward process that gradually adds categorical noise by mixing each token's one-hot distribution with a uniform vocabulary distribution at each step, and a denoising network trained to invert that process.[^5] The same paper also proposed Argmax Flows, which map continuous densities through an argmax to categorical samples. Multinomial diffusion outperformed dequantization-based baselines on character-level text and segmentation maps, and provided the categorical noise schedule that later work would generalize.[^5]

The non-autoregressive generation problem was also studied outside the diffusion framing. Iterative refinement methods like Mask-Predict and the [bert](/wiki/bert)-derived non-autoregressive translation literature predict all positions in parallel and refine over several rounds, but used hand-designed schedules rather than a principled probabilistic objective.[^21] D3PM and follow-up work cast iterative parallel decoding as the reverse of a tractable forward Markov chain, which gave a single variational training objective and a clean inference algorithm without bespoke heuristics.[^1][^3][^4]

## Mathematical formulation

### D3PM transition kernels

Austin, Johnson, Ho, Tarlow, and van den Berg formalized the family in **Structured Denoising Diffusion Probabilistic Models** (D3PM), submitted to arXiv on July 7, 2021 and published at NeurIPS 2021.[^1] D3PM defines a discrete forward Markov chain x_0 to x_T over tokens with a transition matrix Q_t applied at each timestep, so the marginal q(x_t | x_0) is x_0 multiplied by the cumulative product of the Q matrices.[^1] D3PM showed that uniform Q matrices (multinomial diffusion) are only one option, and that the framework admits several structured corruption processes: Gaussian-shaped Q in token-embedding space (analogous to continuous diffusion), nearest-neighbor Q based on learned embeddings, and an **absorbing-state** Q in which each token has a fixed probability of being replaced by a special MASK token and otherwise stays fixed.[^1] The absorbing variant draws a formal connection between discrete diffusion and the [bert](/wiki/bert)-style [masked language model](/wiki/masked_language_model), since at large t every token has been replaced by MASK and the reverse process amounts to iterative mask-filling.[^1] D3PM also introduced a hybrid loss combining the variational lower bound with an auxiliary [cross-entropy](/wiki/cross-entropy) term on predicted x_0, and demonstrated competitive results on CIFAR-10 and LM1B text modeling.[^1]

### Score entropy and SEDD

Lou, Meng, and Ermon proposed **score entropy discrete diffusion** (SEDD) on arXiv on October 25, 2023.[^2] Their contribution was a new loss they called **score entropy**, which generalizes Hyvarinen-style score matching from continuous spaces to discrete state spaces by parameterizing ratios p_t(y) / p_t(x) of the data distribution rather than gradients of a log density (which are ill-defined when the vocabulary is finite).[^2] The training objective is then a tractable Bregman divergence between learned and true ratios, evaluated only on observed transitions, and the denoising sampler at inference uses these ratios to choose which tokens to update.[^2] SEDD reduced [perplexity](/wiki/perplexity) by 25 to 75 percent relative to prior diffusion language models and outperformed a comparably sized [gpt-2](/wiki/gpt-2) on generative perplexity while requiring up to 32 times fewer network evaluations for similar generation quality.[^2] The paper received a best-paper-class oral at ICML 2024.[^15]

### The masked-diffusion ELBO simplification (MDLM and MD4)

Two concurrent works in June 2024 simplified the absorbing-state objective. Sahoo, Arriola, Schiff, Gokaslan, Marroquin, Chiu, Rush, and Kuleshov posted **MDLM: Simple and Effective Masked Diffusion Language Models** on June 11, 2024.[^3] They observed that under the absorbing-state forward process the continuous-time variational lower bound reduces to a weighted mixture of standard masked-language-modeling cross-entropy losses, and that a Rao-Blackwellized estimator further reduces the gradient variance and improves stability.[^3] Five days earlier, on June 6, 2024, Shi, Han, Wang, Doucet, and Titsias posted **MD4: Simplified and Generalized Masked Diffusion for Discrete Data** with a closely related derivation, showing that the continuous-time ELBO is a simple weighted integral of cross-entropy losses with a closed-form weight depending on the masking schedule.[^4] MD4 also allowed state-dependent masking schedules. Both papers achieved new state-of-the-art among diffusion language models at the GPT-2 scale on OpenWebText, and MD4 reported 2.75 bits per dimension on CIFAR-10 and 3.40 on ImageNet 64x64 (better than autoregressive baselines of similar size).[^4] The MDLM/MD4 result is the operational reason discrete diffusion can be trained today using almost the same code path as a masked encoder: the per-step loss is just a re-weighted [cross entropy loss](/wiki/cross_entropy_loss) on the masked positions.[^3][^4]

### Forward and reverse processes in the masked formulation

In the absorbing-state setting, the forward kernel is parameterized by a monotone schedule alpha_t in [0, 1] with alpha_0 = 1 and alpha_T = 0. Given a clean sequence x_0, each token is independently kept with probability alpha_t and replaced with MASK with probability 1 minus alpha_t. The marginal q(x_t | x_0) is therefore simple to sample from at any t. The reverse model is a [transformer](/wiki/transformer) that takes the partially masked sequence x_t and the time index t and outputs a predicted x_0 distribution over the vocabulary at every masked position.[^1][^3][^4] At training time the loss is a weighted cross-entropy on the masked tokens. At inference time the sampler starts from a fully masked sequence at t = T and iteratively unmasks tokens by sampling from the predicted x_0 and choosing how many positions to commit per step, with the remaining positions held masked or re-masked according to a schedule.[^3][^4][^6]

A common choice in MDLM and MD4 is the linear schedule alpha_t = 1 minus t over T, which makes the expected mask rate proportional to t and gives a uniform weighting over noise levels.[^3][^4] The cosine schedule borrowed from [ddpm](/wiki/ddpm) is also used and tends to spend more denoising steps on lightly masked sequences, where most of the per-token information is concentrated.[^4] Under the linear schedule the continuous-time ELBO reduces to the integral from 0 to 1 of (1 / (1 minus alpha)) times the expected cross-entropy on masked positions, which is the closed form Sahoo et al. and Shi et al. show is equivalent to a re-weighted classical masked language modeling loss.[^3][^4]

## Algorithms

### Training

Training a masked discrete diffusion language model uses an almost identical recipe to training a [bert](/wiki/bert)-style masked encoder, with three differences.[^3][^4] First, the mask ratio is sampled from the noise schedule rather than fixed (BERT uses a fixed 15 percent mask rate). Second, the per-step loss is weighted by the derivative of the masking schedule, which gives high weight to lightly masked timesteps and lower weight to heavily masked ones. Third, the network can be either encoder-only or decoder-only architecture, since the reverse process is non-causal and benefits from bidirectional attention over the masked sequence.[^3][^4] In practice LLaDA and Mercury both use a decoder-only [transformer](/wiki/transformer) with non-causal attention.[^6][^7] Training corpora and optimizer settings closely match autoregressive LLM pipelines: LLaDA 8B Base was pre-trained on 2.3 trillion tokens with a transformer optimized for autoregressive-like throughput on the GPU.[^16]

### Inference

Inference proceeds in K denoising steps, where K can be much smaller than the sequence length L. Each step takes the current partially masked sequence, runs one forward pass through the transformer, and produces a probability distribution over the vocabulary at every masked position.[^3][^4] The sampler then commits some subset of positions, typically those with the highest confidence (top-k unmasking by predicted token probability), and leaves the rest masked for subsequent steps.[^6] When the schedule is fully completed at t = 0, no MASK tokens remain. Because every step operates on the entire sequence in parallel, the wall-clock cost of generating L tokens is K forward passes regardless of L (up to memory limits on attention), in contrast to L forward passes for an autoregressive model with the same architecture.[^7] LLaDA's official sampler uses iterative remasking, where low-confidence unmasked positions can be re-corrupted and predicted again in later steps to allow error correction.[^6]

### Steps versus quality

The number of denoising steps K is the principal quality-speed knob. SEDD reported similar quality at 32 times fewer network evaluations compared with prior diffusion baselines and outperformed annealed [gpt-2](/wiki/gpt-2) at the equivalent compute budget.[^2] MDLM and MD4 demonstrated that the same training objective supports semi-autoregressive sampling where the sequence is generated in left-to-right blocks of variable length, recovering classical generation at one extreme and full parallel sampling at the other.[^3][^4] In production, Mercury Coder Mini and Mercury Coder Small reach 1,109 and 737 tokens per second respectively on NVIDIA H100 GPUs, figures that Inception Labs reports as five to ten times faster than autoregressive speed-optimized models of similar quality.[^7][^10]

### Confidence-based unmasking and remasking

Concretely, an inference run starts with x_T fully masked and proceeds for K steps. At each step the transformer outputs a categorical distribution p(y | x_t) over the vocabulary for every position; the sampler then selects positions to commit using either top-k confidence (commit the n_t positions with the highest predicted log-probability) or top-p (commit positions whose predicted log-probability exceeds a threshold).[^6][^10] LLaDA's official implementation additionally allows **remasking**: positions whose predicted token has low confidence after a commit step can be re-corrupted to MASK and predicted again later, which trades step count for accuracy on hard positions.[^6] The Mercury technical report describes a similar confidence-aware scheduler tuned per task and reports that aggressive scheduling (K around 32 for short outputs) is sufficient for coding tasks while longer text typically uses K in the 64 to 128 range.[^10]

## Notable models

### LLaDA (February 2025)

Nie, Zhu, You, Zhang, Ou, Hu, Zhou, Lin, Wen, and Li released **Large Language Diffusion Models** on arXiv on February 14, 2025, with authors from Renmin University of China and Ant Group.[^6] LLaDA 8B Base is a masked diffusion transformer trained from scratch on 2.3 trillion tokens, the first published 8B-parameter diffusion language model evaluated head-to-head against [llama 3](/wiki/llama_3) 8B.[^16] On standard zero-shot and few-shot benchmarks the base model reported 65.9 on 5-shot [mmlu](/wiki/mmlu) (versus 65.4 for LLaMA 3 8B Base and 45.9 for [llama 2](/wiki/llama_2) 7B), 70.3 on 4-shot [gsm8k](/wiki/gsm8k) (versus 48.7 and 13.1), 35.4 on 0-shot [humaneval](/wiki/humaneval) (versus 34.8 and 12.8), and 31.4 on 4-shot MATH (versus 16.0 and 4.3).[^16] The paper claims LLaDA is competitive with LLaMA 3 8B in in-context learning and supervised fine-tuning, and that it mitigates the reversal curse, completing a poem when shown its ending and asked for its beginning more reliably than [gpt 4o](/wiki/gpt_4o).[^6] LLaDA-8B-Base and LLaDA-8B-Instruct are released on Hugging Face with an official PyTorch implementation on GitHub.[^17]

### Mercury and Mercury Coder (February 2025)

Inception Labs launched **Mercury Coder** in February 2025, billing it as the first commercially available diffusion-based LLM.[^7][^18] The company was founded by Stefano Ermon (Stanford), Aditya Grover (UCLA), and Volodymyr Kuleshov (Cornell), the academic groups behind SEDD and MDLM respectively.[^18] Two coding variants shipped at launch: Mercury Coder Mini at 1,109 tokens per second and Mercury Coder Small at 737 tokens per second, both on [nvidia h100](/wiki/nvidia_h100) hardware.[^7] On Copilot Arena human evaluations Mercury Coder Mini tied for second place by quality among coding assistants, ahead of speed-optimized models like GPT-4o Mini and Gemini 1.5 Flash, with an average latency of about 25 milliseconds per response.[^7][^18] A follow-up technical report **Mercury: Ultra-Fast Language Models Based on Diffusion** appeared on arXiv on June 17, 2025, providing a detailed account of the architecture and benchmarks.[^10] On the MultiPL-E coding benchmark Mercury Coder Small reached 82.0 on C++, 83.9 on JavaScript, and 82.6 on TypeScript, and 84.8 average on fill-in-the-middle tasks, exceeding Codestral 2501 at 82.5.[^10] See [inception labs](/wiki/inception_labs) and [mercury inception](/wiki/mercury_inception) for company and product detail.

### Gemini Diffusion (May 2025)

Google DeepMind announced **Gemini Diffusion** at Google I/O 2025 on May 20, 2025, as an experimental text diffusion model behind a waitlist.[^9][^11] DeepMind reported an average sampling speed of 1,479 tokens per second (with 0.84 seconds of fixed overhead per request), four to five times faster than the company's prior Gemini Flash family while matching its coding performance.[^11][^19] Reported scores on the demo include 89.6 percent on [humaneval](/wiki/humaneval) and 76.0 percent on [mbpp](/wiki/mbpp) coding tasks, and 23.3 percent on AIME 2025 math (vs 20.0 percent for Gemini 2.0 Flash-Lite).[^11] As of May 2025 Gemini Diffusion was available only via the experimental waitlist; the full [gemini](/wiki/gemini) family otherwise comprises autoregressive transformer models.[^19]

## Comparison to autoregressive models

The structural difference between discrete diffusion and autoregressive models centers on three properties.

| Property | Autoregressive LLM | Masked discrete diffusion LLM |
|---|---|---|
| Generation order | Left-to-right, one token per step | Any order, multiple tokens per step |
| Forward passes per L tokens | L | K (chosen at sample time; K << L typical) |
| Attention pattern | Causal (lower triangular) | Bidirectional over current x_t |
| Per-step loss | Next-token cross-entropy | Weighted cross-entropy on masked positions |
| KV cache reuse | Standard, large win | Limited; the masked sequence changes each step |
| Reversal queries | Often fails (reversal curse) | Trains symmetrically over positions |

Because every denoising step processes the full sequence in parallel, throughput in tokens per second scales with sequence length up to the attention bottleneck, while autoregressive systems require either prefix-shared batching or [speculative decoding](/wiki/speculative_decoding) to amortize the per-step cost.[^7][^11] Conversely, the [kv cache](/wiki/kv_cache) optimization that drives modern autoregressive inference is harder to apply to diffusion models because tokens later in the schedule can change the conditioning of tokens already partially decoded, so each step typically recomputes attention over the full sequence.[^10] Diffusion models also do not have a natural "stop" token; they emit a sequence of fixed length L set in advance, although MDLM and the Mercury technical report describe semi-autoregressive variants that generate in fixed-length blocks and append additional blocks until an end condition is met.[^3][^10]

Quality per parameter at the 1B-to-8B scale is the main remaining gap. LLaDA 8B Base is competitive with [llama 3](/wiki/llama_3) 8B Base on [mmlu](/wiki/mmlu), [gsm8k](/wiki/gsm8k), [humaneval](/wiki/humaneval), and MATH after training on a comparable token budget, but as of mid-2025 no diffusion language model has been demonstrated at the 70B-plus scale that defines frontier autoregressive systems, and scaling-law behavior for discrete diffusion is still an open research question.[^6][^16][^20]

## Limitations

Four open issues remained as of mid-2025.

First, **scaling laws are not well characterized**. The autoregressive scaling laws of Hoffmann et al. and Kaplan et al. provided a predictable cost-to-quality tradeoff for next-token-prediction models, but no published analogue exists for masked diffusion at frontier scale.[^20] LLaDA's authors reported that diffusion scaling closely tracked the autoregressive baseline they trained at smaller scales, but only up to 8B parameters and 2.3 trillion tokens.[^16]

Second, **per-parameter quality lags slightly**. Even where LLaDA matched LLaMA 3 8B on aggregate benchmarks, individual evaluations (notably code and tool use) sometimes favored the autoregressive baseline, and Mercury Coder Mini was tied for second rather than first on Copilot Arena despite training the same model family.[^7][^16] SEDD likewise improved over GPT-2 but did not displace larger autoregressive baselines at matched compute.[^2]

Third, **inference is harder to batch and cache**. Because each denoising step operates on a different masked sequence, the [kv cache](/wiki/kv_cache) cannot be reused across steps in the way it is reused across tokens in autoregressive inference. Mercury and Gemini Diffusion's reported throughputs come from running the full forward pass each step on a single H100 (or comparable hardware) without long-prefix KV reuse, which constrains how the throughput advantage translates to long-context applications.[^10][^11]

Fourth, **output length must be fixed in advance**. Autoregressive models can stop at any point by emitting a special end-of-sequence token, but a masked diffusion model in its pure form denoises a buffer of length L chosen at sample time. The Mercury technical report and MDLM both discuss block-wise semi-autoregressive sampling as a workaround: the model denoises a fixed-size block, then appends another block conditioned on the committed prefix, and so on until an end token appears within a block.[^3][^10] This recovers variable-length output at the cost of partially serializing the inference loop.[^10]

## Adoption and applications

Discrete diffusion language models had three principal deployments by mid-2025.

The first is **code generation**, where the Mercury Coder family from Inception Labs targets latency-sensitive coding workflows (autocomplete, in-IDE edit) that benefit from the low time-to-first-token and the fill-in-the-middle ability inherent to mask-based modeling.[^7][^10] On fill-in-the-middle benchmarks Mercury Coder Small reported 84.8 percent average accuracy, exceeding Codestral 2501.[^10] See [ai code generation](/wiki/ai_code_generation).

The second is **infilling and controllable editing**. SEDD demonstrated controllable infilling without any architectural change because the masked positions can be anywhere in the sequence and the model conditions on all visible context simultaneously, in contrast to causal left-to-right models that require special-purpose fill-in-the-middle training.[^2] MDLM and MD4 inherit the same property.[^3][^4]

The third is **research demonstration at frontier scale**, exemplified by Google DeepMind's Gemini Diffusion at Google I/O 2025 and by LLaDA 8B's open release on Hugging Face.[^6][^11][^17] These models exist mainly to demonstrate that the diffusion paradigm scales to multi-billion parameters and to gather user feedback before broader deployment.[^11][^19] By May 2026, the Mercury family had matured to second-generation models (Mercury 2 and Mercury Edit 2) optimized for reasoning and code editing, indicating active commercial development.[^18]

## See also

- [diffusion model](/wiki/diffusion_model)
- [ddpm](/wiki/ddpm)
- [masked language model](/wiki/masked_language_model)
- [bert](/wiki/bert)
- [causal language model](/wiki/causal_language_model)
- [transformer](/wiki/transformer)
- [gpt-2](/wiki/gpt-2)
- [llama 3](/wiki/llama_3)
- [gemini](/wiki/gemini)
- [inception labs](/wiki/inception_labs)
- [mercury inception](/wiki/mercury_inception)
- [score matching](/wiki/score_matching)
- [cross entropy loss](/wiki/cross_entropy_loss)
- [diffusion language models](/wiki/diffusion_language_models)
- [speculative decoding](/wiki/speculative_decoding)
- [humaneval](/wiki/humaneval)
- [mbpp](/wiki/mbpp)
- [mmlu](/wiki/mmlu)
- [gsm8k](/wiki/gsm8k)

## References

[^1]: Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, Rianne van den Berg, "Structured Denoising Diffusion Models in Discrete State-Spaces", arXiv, 2021-07-07. https://arxiv.org/abs/2107.03006. Accessed 2026-05-26.

[^2]: Aaron Lou, Chenlin Meng, Stefano Ermon, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution", arXiv, 2023-10-25. https://arxiv.org/abs/2310.16834. Accessed 2026-05-26.

[^3]: Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, Volodymyr Kuleshov, "Simple and Effective Masked Diffusion Language Models", arXiv, 2024-06-11. https://arxiv.org/abs/2406.07524. Accessed 2026-05-26.

[^4]: Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, Michalis K. Titsias, "Simplified and Generalized Masked Diffusion for Discrete Data", arXiv, 2024-06-06. https://arxiv.org/abs/2406.04329. Accessed 2026-05-26.

[^5]: Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forre, Max Welling, "Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions", arXiv, 2021-02-10. https://arxiv.org/abs/2102.05379. Accessed 2026-05-26.

[^6]: Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li, "Large Language Diffusion Models", arXiv, 2025-02-14. https://arxiv.org/abs/2502.09992. Accessed 2026-05-26.

[^7]: Inception Labs, "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model", Inception Labs blog, 2025-02-26. https://www.inceptionlabs.ai/blog/introducing-mercury. Accessed 2026-05-26.

[^8]: Maginative, "Inception Labs Launches Mercury, the First Commercial Diffusion-Based Language Model", Maginative, 2025-02-26. https://www.maginative.com/article/inception-labs-launches-mercury-the-first-commercial-diffusion-based-language-model/. Accessed 2026-05-26.

[^9]: Google, "Google I/O 2025: 100 things Google announced", Google Keyword blog, 2025-05-21. https://blog.google/technology/ai/google-io-2025-all-our-announcements/. Accessed 2026-05-26.

[^10]: Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, Volodymyr Kuleshov, "Mercury: Ultra-Fast Language Models Based on Diffusion", arXiv, 2025-06-17. https://arxiv.org/abs/2506.17298. Accessed 2026-05-26.

[^11]: Google DeepMind, "Gemini Diffusion", DeepMind models page. https://deepmind.google/models/gemini-diffusion/. Accessed 2026-05-26.

[^12]: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, "Language Models are Unsupervised Multitask Learners", OpenAI technical report (GPT-2), 2019-02-14. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. Accessed 2026-05-26.

[^13]: Yaniv Leviathan, Matan Kalman, Yossi Matias, "Fast Inference from Transformers via Speculative Decoding", arXiv, 2022-11-30. https://arxiv.org/abs/2211.17192. Accessed 2026-05-26.

[^14]: Jonathan Ho, Ajay Jain, Pieter Abbeel, "Denoising Diffusion Probabilistic Models", arXiv, 2020-06-19. https://arxiv.org/abs/2006.11239. Accessed 2026-05-26.

[^15]: International Conference on Machine Learning, "ICML 2024 Outstanding Paper Awards", ICML announcement, 2024-07-23. https://icml.cc/virtual/2024/awards_detail. Accessed 2026-05-26.

[^16]: Shen Nie et al., "Large Language Diffusion Models", arXiv:2502.09992 HTML version, 2025-02-14. https://arxiv.org/html/2502.09992. Accessed 2026-05-26.

[^17]: ML-GSAI, "LLaDA official PyTorch implementation and model release", GitHub repository, 2025-02-14. https://github.com/ML-GSAI/LLaDA. Accessed 2026-05-26.

[^18]: Inception Labs, "Inception company page", Inception Labs, 2025. https://www.inceptionlabs.ai/. Accessed 2026-05-26.

[^19]: Sharon Goldman, "At Google I/O, Gemini Diffusion's speed and coding skills hint at the next phase of the AI model wars", Fortune, 2025-05-21. https://fortune.com/2025/05/21/gemini-diffusion-google-io-sleeper-hit-blazing-speed-ai-model-wars/. Accessed 2026-05-26.

[^20]: Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al., "Training Compute-Optimal Large Language Models" (Chinchilla scaling laws), arXiv, 2022-03-29. https://arxiv.org/abs/2203.15556. Accessed 2026-05-26.

[^21]: Marjan Ghazvininejad, Omer Levy, Yinhan Liu, Luke Zettlemoyer, "Mask-Predict: Parallel Decoding of Conditional Masked Language Models", EMNLP-IJCNLP 2019, arXiv, 2019-04-19. https://arxiv.org/abs/1904.09324. Accessed 2026-05-26.

