LLaDA (Large Language Diffusion)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,971 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,971 words
Add missing citations, update stale details, or suggest a clearer explanation.
LLaDA (Large Language Diffusion with mAsking) is a family of non-autoregressive large language models that generate text by iteratively denoising a sequence of mask tokens rather than predicting tokens left to right. Introduced by researchers at Renmin University of China and Ant Group in the February 2025 paper "Large Language Diffusion Models" (arXiv:2502.09992), LLaDA showed that a masked discrete diffusion model trained from scratch at the 8B parameter scale on roughly 2.3 trillion tokens can rival a comparably sized autoregressive baseline such as Llama 3 8B on standard general, mathematical, and coding benchmarks.[^1][^2] The work has been cited as the first credible large-scale demonstration that the core capabilities of LLMs (scaling, in-context learning, instruction following, compression) are not inherent to the autoregressive objective and can emerge from a bidirectional generative model.[^1][^3] LLaDA has since been extended into LLaDA 1.5 (with preference optimization), the multimodal LLaDA-V, a native mixture-of-experts variant LLaDA-MoE, and 100B-parameter LLaDA 2.0 models, and it is widely considered the open analogue to commercial diffusion LLMs such as Inception Labs' Mercury.[^4][^5][^6][^7]
Diffusion models were initially developed for continuous data such as images and audio and were popularized through denoising diffusion probabilistic models (DDPM) and related score-based formulations.[^8] Extending diffusion to discrete sequences proved difficult: while masked language modeling (as in BERT) is structurally similar to a single denoising step, building a fully generative, iterative diffusion process for text required new theoretical tools. The Score Entropy Discrete Diffusion (SEDD) framework, presented by Aaron Lou, Chenlin Meng, and Stefano Ermon at ICML 2024, introduced a score entropy loss for discrete data and demonstrated that a discrete diffusion model could outperform comparably sized GPT-2 baselines on language modeling perplexity.[^9] SEDD established proof of concept at small scales but did not approach modern LLM benchmarks.
Other relevant precursors included absorbing-state diffusion models, where the forward process gradually replaces tokens with a special [MASK] symbol; such formulations are equivalent to a generalized masked language model with a random mask ratio and have been shown to admit a particularly simple variational training objective.[^1] LLaDA built on this masked-diffusion view rather than on the score matching approach of SEDD, and combined it with the modern Transformer architecture and the pretraining-plus-supervised fine-tuning paradigm used by Llama 3 and related LLMs.[^1]
The LLaDA project is hosted at the GitHub repository ML-GSAI/LLaDA, maintained by the Machine Learning Group at the Gaoling School of Artificial Intelligence (Renmin University of China). The release timeline through May 2026 is:
| Date | Release | Notes |
|---|---|---|
| 2025-02-14 | LLaDA 8B Base and 8B Instruct, arXiv v1[^1] | First 8B-scale masked-diffusion LLM trained from scratch on ~2.3T tokens. |
| 2025-02-18 | arXiv v2 of "Large Language Diffusion Models"[^1] | Revised benchmark tables and additional ablations. |
| 2025-05-22 | LLaDA-V paper (arXiv:2505.16933)[^5] | Visual instruction tuning on top of the LLaDA 8B backbone. |
| 2025-05-25 | LLaDA 1.5 paper (arXiv:2505.19223)[^4] | Variance-Reduced Preference Optimization (VRPO) for preference alignment. |
| 2025-09 | LLaDA-MoE 7B-A1B announced at the Bund Conference[^6] | First native Mixture-of-Experts diffusion language model. |
| 2025-10-18 | arXiv v3 of the LLaDA paper[^1] | Camera-ready version for NeurIPS 2025. |
| 2025-12 | LLaDA 2.0 mini and LLaDA 2.0 flash (100B params)[^7] | Joint work by Ant Group, Renmin University, Zhejiang University, Westlake University. |
| 2025-12 | LLaDA 2.0 added as a day-0 supported model in SGLang via the LMSYS dLLM framework[^10] | First production-grade serving framework support. |
The original LLaDA paper was accepted as a NeurIPS 2025 Oral, presented in San Diego in December 2025, by which time it had accumulated more than 100 citations.[^11][^3]
LLaDA defines a continuous-time forward process indexed by t in [0, 1]. Starting from a clean token sequence x_0, at time t each token is independently replaced by a special mask token M with probability t and retained with probability 1 - t. At t = 0 the sequence is fully clean, and at t = 1 the sequence is fully masked.[^1][^2] Because tokens are corrupted independently and the only corruption is replacement with the absorbing mask state, the marginal distribution at any time t is fully described by the mask ratio, and the process admits a particularly tractable variational lower bound.[^1]
LLaDA trains a Transformer "mask predictor" p_theta(x_0 | x_t) that, conditioned on a partially masked sequence x_t, predicts the original tokens at the masked positions. The training loss is a reweighted cross-entropy on the masked positions only:
L(theta) = - E_{t, x_0, x_t} [ (1/t) * sum_i 1[x_t^i = M] * log p_theta(x_0^i | x_t) ]
The factor 1/t accounts for the random mask ratio and makes the objective a tight upper bound on the negative log-likelihood, so minimizing it corresponds to principled maximum-likelihood training rather than the conventional masked language modeling heuristic with a fixed mask rate.[^1][^2] Pretraining samples t uniformly from [0, 1]; supervised fine-tuning constrains masking to response tokens only so that prompts remain visible.[^1]
The mask predictor is a decoder-only Transformer with bidirectional self-attention (no causal mask). For the 8B variant the authors deliberately chose vanilla multi-head attention rather than grouped-query attention, reduced the feed-forward dimension to keep the total parameter count comparable to LLaMA 3 8B, and used a custom tokenizer adapted to their training corpus. Sequences were pretrained at a fixed length of 4096 tokens.[^2] Because the model is a mask-based discrete diffusion model rather than a continuous Gaussian diffusion, prior theoretical work shows that explicit time conditioning is not required, and LLaDA omits it.[^12]
Generation begins from a fully masked sequence at t = 1 and proceeds backward in N discrete steps to t = 0. At each step the mask predictor predicts a distribution over the original tokens at every masked position simultaneously. Predictions are then committed to a fraction (1 - s)/(1 - t) of the masked positions, where s < t is the next time step, and the remaining positions are remasked. Two remasking strategies are supported:[^1][^2]
For practical use the authors also support a semi-autoregressive block sampler: the response is divided into fixed-size blocks generated left to right, with diffusion denoising inside each block. This restores some of the locality benefits of autoregressive decoding while preserving bidirectional modeling within a block.[^2][^13] On standard benchmarks LLaDA 8B Base running with this scheme has been reported to reach 54.75 tokens/second versus 33.79 tokens/second for LLaMA 3 8B at comparable accuracy when combined with dLLM-Cache acceleration.[^13]
LLaDA 8B Base was pretrained on approximately 2.3 trillion tokens of filtered web data using roughly 0.13 million H800 GPU hours and a compute budget of about 10^23 FLOPs. The data mix and quality filters were guided by scaled-down autoregressive baselines. Training experienced one crash at the 1.2T-token mark, which was mitigated by checkpoint resumption and a learning-rate reduction from 4e-4 to 1e-4.[^2]
Supervised fine-tuning used about 4.5 million instruction-response pairs covering code, math, and dialogue. During SFT, mask sampling is restricted to response tokens so the model learns conditional generation given a fixed prompt.[^2]
The two original release artifacts, hosted at GSAI-ML/LLaDA-8B-Base and GSAI-ML/LLaDA-8B-Instruct on Hugging Face, are MIT-licensed and ship as standard safetensors checkpoints loadable through the Hugging Face Transformers library with custom modeling code. The repository provides a generate() helper, a get_log_likelihood() utility for conditional likelihood evaluation, and chat.py / app.py demos.[^14][^15]
LLaDA 1.5, released in May 2025, applies preference optimization to the 8B Instruct checkpoint. The challenge addressed by the paper is that aligning a masked diffusion model with DPO-style methods requires likelihood ratios, which can only be estimated through the high-variance ELBO over random mask samples. The authors introduce Variance-Reduced Preference Optimization (VRPO), which formalizes the bias-variance trade-off of the preference gradient and proposes two unbiased variance-reduction techniques: an optimal Monte Carlo budget allocation between numerator and denominator and antithetic sampling of mask patterns. Applying VRPO to LLaDA 8B Instruct yields reported gains of +4.7 on GSM8K, +3.0 on HumanEval, +1.8 on MBPP, +4.0 on IFEval, and +4.3 on Arena-Hard over the SFT-only baseline.[^4]
LLaDA-V (arXiv:2505.16933, May 2025) extends LLaDA into a multimodal model by adding a vision encoder and an MLP projector that maps visual features into the language embedding space, in the LLaVA tradition. Training uses visual instruction tuning data while keeping the masked-diffusion training objective for the language backbone. Despite LLaDA's textual benchmarks lagging Qwen2-7B, LLaDA-V is reported to be highly competitive with LLaMA3-V and to narrow the gap to Qwen2-VL, and is presented as the state-of-the-art purely diffusion-based multimodal LLM at the time of publication.[^5]
In September 2025 Ant Group and Renmin University announced LLaDA-MoE-7B-A1B, described as the first native Mixture of Experts diffusion language model, with roughly 1B active parameters per token out of a 7B total.[^6] This was followed in late 2025 by LLaDA 2.0 mini and LLaDA 2.0 flash, the latter a 100B-parameter model jointly developed by Ant Group, Renmin University, Zhejiang University, and Westlake University. LMSYS announced day-0 support for the LLaDA 2.0 series in the SGLang serving framework via a new block-diffusion runtime.[^7][^10]
The headline experimental comparison in the original paper is between LLaDA 8B Base and Llama 3 8B Base, both evaluated zero-shot or few-shot on standard pretraining benchmarks. Selected results reported in the paper:[^1][^2]
| Benchmark | LLaDA 8B Base | LLaMA3 8B Base |
|---|---|---|
| MMLU | 65.9 | 65.4 |
| CMMLU | 69.9 | 50.7 |
| C-Eval | 70.5 | 51.7 |
| ARC-C | 47.9 | 53.1 |
| PIQA | 74.3 | 80.6 |
| GSM8K | 70.3 | 48.7 |
| HumanEval | 35.4 | 34.8 |
| MBPP | 40.0 | 48.8 |
LLaDA 8B Base is broadly competitive on English knowledge tasks, ahead on Chinese benchmarks and elementary math, and slightly behind on PIQA and MBPP. The authors interpret these results as evidence that scaling, in-context learning, and instruction following can emerge from a bidirectional masked-diffusion objective rather than uniquely from next-token prediction.[^1]
After SFT, LLaDA 8B Instruct trails LLaMA 3 8B Instruct on many alignment benchmarks because the LLaMA 3 release uses post-training RLHF, which LLaDA 8B Instruct does not, but the diffusion model nevertheless produces fluent multi-turn dialogue in qualitative case studies.[^1][^2] The LLaDA 1.5 paper subsequently closed part of this gap using VRPO.[^4]
A motivating empirical phenomenon for LLaDA is the so-called reversal curse: autoregressive LLMs trained on facts of the form "A is B" frequently fail to answer "Who is B?" because next-token prediction induces a strong left-to-right inductive bias. The LLaDA paper evaluates this on a Chinese poem completion task with 496 pairs of adjacent lines, querying each model either with the previous line (forward) or with the following line (reversal). The reported accuracies are:[^1][^16]
LLaDA achieves substantially more symmetric forward and reverse accuracy than either autoregressive baseline, and outperforms GPT-4o on the reversal direction. The authors and subsequent commentary attribute this to LLaDA treating all positions uniformly during training, without a left-to-right directional bias.[^1][^16]
LLaDA's most cited contribution is conceptual: it provides the first large-scale, instruction-tuned demonstration that a non-autoregressive LLM can approach the in-context-learning quality of a comparable AR baseline, weakening the long-standing argument that the autoregressive factorization is necessary for general LLM capability.[^1][^3] Downstream consequences include:
LLaDA is also frequently cited as the open academic counterpart to commercial diffusion LLMs. Inception Labs, co-founded by Stefano Ermon, released Mercury and Mercury Coder in early 2025 as "the world's first commercial-scale diffusion large language model", marketing throughputs of 1109 tokens/second (Mercury Coder Mini) and 737 tokens/second (Mercury Coder Small) on NVIDIA H100 GPUs while remaining competitive with speed-optimized models such as GPT-4o Mini and Claude 3.5 Haiku on Copilot Arena.[^17] Mercury 2, launched in February 2026, advertises 5x speedups over leading speed-optimized LLMs.[^18] Mercury's underlying technique is not openly published in the same detail as LLaDA, but Inception founders co-invented earlier diffusion-model and discrete-sampling techniques, and the design space (block-parallel iterative denoising) closely parallels the LLaDA family.[^17]
The original LLaDA paper and follow-up commentary identify several limitations that remain open as of mid-2026: