LLaDA (Large Language Diffusion)

Diffusion Models Large Language Models

15 min read

Updated Jun 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 7, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v2 · 2,971 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LLaDA (Large Language Diffusion with mAsking) is a family of non-autoregressive large language models that generate text by iteratively denoising a sequence of mask tokens rather than predicting tokens left to right. Introduced by researchers at Renmin University of China and Ant Group in the February 2025 paper "Large Language Diffusion Models" (arXiv:2502.09992), LLaDA showed that a masked discrete diffusion model trained from scratch at the 8B parameter scale on roughly 2.3 trillion tokens can rival a comparably sized autoregressive baseline such as Llama 3 8B on standard general, mathematical, and coding benchmarks.^[1]^[2] The work has been cited as the first credible large-scale demonstration that the core capabilities of LLMs (scaling, in-context learning, instruction following, compression) are not inherent to the autoregressive objective and can emerge from a bidirectional generative model.^[1]^[3] LLaDA has since been extended into LLaDA 1.5 (with preference optimization), the multimodal LLaDA-V, a native mixture-of-experts variant LLaDA-MoE, and 100B-parameter LLaDA 2.0 models, and it is widely considered the open analogue to commercial diffusion LLMs such as Inception Labs' Mercury.^[4]^[5]^[6]^[7]

Background

Diffusion models were initially developed for continuous data such as images and audio and were popularized through denoising diffusion probabilistic models (DDPM) and related score-based formulations.^[8] Extending diffusion to discrete sequences proved difficult: while masked language modeling (as in BERT) is structurally similar to a single denoising step, building a fully generative, iterative diffusion process for text required new theoretical tools. The Score Entropy Discrete Diffusion (SEDD) framework, presented by Aaron Lou, Chenlin Meng, and Stefano Ermon at ICML 2024, introduced a score entropy loss for discrete data and demonstrated that a discrete diffusion model could outperform comparably sized GPT-2 baselines on language modeling perplexity.^[9] SEDD established proof of concept at small scales but did not approach modern LLM benchmarks.

Other relevant precursors included absorbing-state diffusion models, where the forward process gradually replaces tokens with a special [MASK] symbol; such formulations are equivalent to a generalized masked language model with a random mask ratio and have been shown to admit a particularly simple variational training objective.^[1] LLaDA built on this masked-diffusion view rather than on the score matching approach of SEDD, and combined it with the modern Transformer architecture and the pretraining-plus-supervised fine-tuning paradigm used by Llama 3 and related LLMs.^[1]

History and Releases

The LLaDA project is hosted at the GitHub repository ML-GSAI/LLaDA, maintained by the Machine Learning Group at the Gaoling School of Artificial Intelligence (Renmin University of China). The release timeline through May 2026 is:

Date	Release	Notes
2025-02-14	LLaDA 8B Base and 8B Instruct, arXiv v1^[1]	First 8B-scale masked-diffusion LLM trained from scratch on ~2.3T tokens.
2025-02-18	arXiv v2 of "Large Language Diffusion Models"^[1]	Revised benchmark tables and additional ablations.
2025-05-22	LLaDA-V paper (arXiv:2505.16933)^[5]	Visual instruction tuning on top of the LLaDA 8B backbone.
2025-05-25	LLaDA 1.5 paper (arXiv:2505.19223)^[4]	Variance-Reduced Preference Optimization (VRPO) for preference alignment.
2025-09	LLaDA-MoE 7B-A1B announced at the Bund Conference^[6]	First native Mixture-of-Experts diffusion language model.
2025-10-18	arXiv v3 of the LLaDA paper^[1]	Camera-ready version for NeurIPS 2025.
2025-12	LLaDA 2.0 mini and LLaDA 2.0 flash (100B params)^[7]	Joint work by Ant Group, Renmin University, Zhejiang University, Westlake University.
2025-12	LLaDA 2.0 added as a day-0 supported model in SGLang via the LMSYS dLLM framework^[10]	First production-grade serving framework support.

The original LLaDA paper was accepted as a NeurIPS 2025 Oral, presented in San Diego in December 2025, by which time it had accumulated more than 100 citations.^[11]^[3]

Technical Details

Masked-diffusion forward process

LLaDA defines a continuous-time forward process indexed by t in [0, 1]. Starting from a clean token sequence x_0, at time t each token is independently replaced by a special mask token M with probability t and retained with probability 1 - t. At t = 0 the sequence is fully clean, and at t = 1 the sequence is fully masked.^[1]^[2] Because tokens are corrupted independently and the only corruption is replacement with the absorbing mask state, the marginal distribution at any time t is fully described by the mask ratio, and the process admits a particularly tractable variational lower bound.^[1]

Training objective

LLaDA trains a Transformer "mask predictor" p_theta(x_0 | x_t) that, conditioned on a partially masked sequence x_t, predicts the original tokens at the masked positions. The training loss is a reweighted cross-entropy on the masked positions only:

L(theta) = - E_{t, x_0, x_t} [ (1/t) * sum_i 1[x_t^i = M] * log p_theta(x_0^i | x_t) ]

The factor 1/t accounts for the random mask ratio and makes the objective a tight upper bound on the negative log-likelihood, so minimizing it corresponds to principled maximum-likelihood training rather than the conventional masked language modeling heuristic with a fixed mask rate.^[1]^[2] Pretraining samples t uniformly from [0, 1]; supervised fine-tuning constrains masking to response tokens only so that prompts remain visible.^[1]

Mask predictor architecture

The mask predictor is a decoder-only Transformer with bidirectional self-attention (no causal mask). For the 8B variant the authors deliberately chose vanilla multi-head attention rather than grouped-query attention, reduced the feed-forward dimension to keep the total parameter count comparable to LLaMA 3 8B, and used a custom tokenizer adapted to their training corpus. Sequences were pretrained at a fixed length of 4096 tokens.^[2] Because the model is a mask-based discrete diffusion model rather than a continuous Gaussian diffusion, prior theoretical work shows that explicit time conditioning is not required, and LLaDA omits it.^[12]

Sampling: iterative denoising and remasking

Generation begins from a fully masked sequence at t = 1 and proceeds backward in N discrete steps to t = 0. At each step the mask predictor predicts a distribution over the original tokens at every masked position simultaneously. Predictions are then committed to a fraction (1 - s)/(1 - t) of the masked positions, where s < t is the next time step, and the remaining positions are remasked. Two remasking strategies are supported:^[1]^[2]

Random remasking: tokens to remask are chosen uniformly among the just-predicted positions, recovering the principled variational sampler.
Low-confidence remasking: tokens with the lowest model confidence are remasked, biasing the sampler toward keeping high-confidence predictions and improving downstream task accuracy at small step counts.

For practical use the authors also support a semi-autoregressive block sampler: the response is divided into fixed-size blocks generated left to right, with diffusion denoising inside each block. This restores some of the locality benefits of autoregressive decoding while preserving bidirectional modeling within a block.^[2]^[13] On standard benchmarks LLaDA 8B Base running with this scheme has been reported to reach 54.75 tokens/second versus 33.79 tokens/second for LLaMA 3 8B at comparable accuracy when combined with dLLM-Cache acceleration.^[13]

Pretraining scale

LLaDA 8B Base was pretrained on approximately 2.3 trillion tokens of filtered web data using roughly 0.13 million H800 GPU hours and a compute budget of about 10^23 FLOPs. The data mix and quality filters were guided by scaled-down autoregressive baselines. Training experienced one crash at the 1.2T-token mark, which was mitigated by checkpoint resumption and a learning-rate reduction from 4e-4 to 1e-4.^[2]

Supervised fine-tuning used about 4.5 million instruction-response pairs covering code, math, and dialogue. During SFT, mask sampling is restricted to response tokens so the model learns conditional generation given a fixed prompt.^[2]

Variants

LLaDA 8B Base and 8B Instruct

The two original release artifacts, hosted at GSAI-ML/LLaDA-8B-Base and GSAI-ML/LLaDA-8B-Instruct on Hugging Face, are MIT-licensed and ship as standard safetensors checkpoints loadable through the Hugging Face Transformers library with custom modeling code. The repository provides a generate() helper, a get_log_likelihood() utility for conditional likelihood evaluation, and chat.py / app.py demos.^[14]^[15]

LLaDA 1.5 and VRPO

LLaDA 1.5, released in May 2025, applies preference optimization to the 8B Instruct checkpoint. The challenge addressed by the paper is that aligning a masked diffusion model with DPO-style methods requires likelihood ratios, which can only be estimated through the high-variance ELBO over random mask samples. The authors introduce Variance-Reduced Preference Optimization (VRPO), which formalizes the bias-variance trade-off of the preference gradient and proposes two unbiased variance-reduction techniques: an optimal Monte Carlo budget allocation between numerator and denominator and antithetic sampling of mask patterns. Applying VRPO to LLaDA 8B Instruct yields reported gains of +4.7 on GSM8K, +3.0 on HumanEval, +1.8 on MBPP, +4.0 on IFEval, and +4.3 on Arena-Hard over the SFT-only baseline.^[4]

LLaDA-V (vision)

LLaDA-V (arXiv:2505.16933, May 2025) extends LLaDA into a multimodal model by adding a vision encoder and an MLP projector that maps visual features into the language embedding space, in the LLaVA tradition. Training uses visual instruction tuning data while keeping the masked-diffusion training objective for the language backbone. Despite LLaDA's textual benchmarks lagging Qwen2-7B, LLaDA-V is reported to be highly competitive with LLaMA3-V and to narrow the gap to Qwen2-VL, and is presented as the state-of-the-art purely diffusion-based multimodal LLM at the time of publication.^[5]

LLaDA-MoE and LLaDA 2.0

In September 2025 Ant Group and Renmin University announced LLaDA-MoE-7B-A1B, described as the first native Mixture of Experts diffusion language model, with roughly 1B active parameters per token out of a 7B total.^[6] This was followed in late 2025 by LLaDA 2.0 mini and LLaDA 2.0 flash, the latter a 100B-parameter model jointly developed by Ant Group, Renmin University, Zhejiang University, and Westlake University. LMSYS announced day-0 support for the LLaDA 2.0 series in the SGLang serving framework via a new block-diffusion runtime.^[7]^[10]

Comparison with Autoregressive LLMs

The headline experimental comparison in the original paper is between LLaDA 8B Base and Llama 3 8B Base, both evaluated zero-shot or few-shot on standard pretraining benchmarks. Selected results reported in the paper:^[1]^[2]

Benchmark	LLaDA 8B Base	LLaMA3 8B Base
MMLU	65.9	65.4
CMMLU	69.9	50.7
C-Eval	70.5	51.7
ARC-C	47.9	53.1
PIQA	74.3	80.6
GSM8K	70.3	48.7
HumanEval	35.4	34.8
MBPP	40.0	48.8

LLaDA 8B Base is broadly competitive on English knowledge tasks, ahead on Chinese benchmarks and elementary math, and slightly behind on PIQA and MBPP. The authors interpret these results as evidence that scaling, in-context learning, and instruction following can emerge from a bidirectional masked-diffusion objective rather than uniquely from next-token prediction.^[1]

After SFT, LLaDA 8B Instruct trails LLaMA 3 8B Instruct on many alignment benchmarks because the LLaMA 3 release uses post-training RLHF, which LLaDA 8B Instruct does not, but the diffusion model nevertheless produces fluent multi-turn dialogue in qualitative case studies.^[1]^[2] The LLaDA 1.5 paper subsequently closed part of this gap using VRPO.^[4]

Reversal curse experiments

A motivating empirical phenomenon for LLaDA is the so-called reversal curse: autoregressive LLMs trained on facts of the form "A is B" frequently fail to answer "Who is B?" because next-token prediction induces a strong left-to-right inductive bias. The LLaDA paper evaluates this on a Chinese poem completion task with 496 pairs of adjacent lines, querying each model either with the previous line (forward) or with the following line (reversal). The reported accuracies are:^[1]^[16]

Model	Forward (%)	Reversal (%)
GPT-4o	82.7	34.3
Qwen 2.5	75.9	38.0
LLaDA 8B	51.8	45.6

LLaDA achieves substantially more symmetric forward and reverse accuracy than either autoregressive baseline, and outperforms GPT-4o on the reversal direction. The authors and subsequent commentary attribute this to LLaDA treating all positions uniformly during training, without a left-to-right directional bias.^[1]^[16]

Applications and Significance

LLaDA's most cited contribution is conceptual: it provides the first large-scale, instruction-tuned demonstration that a non-autoregressive LLM can approach the in-context-learning quality of a comparable AR baseline, weakening the long-standing argument that the autoregressive factorization is necessary for general LLM capability.^[1]^[3] Downstream consequences include:

A growing line of follow-up work on masked-diffusion LLMs that reuses LLaDA's training recipe and checkpoints, including blockwise SFT methods for reconciling bidirectional attention with autoregressive decoding and adaptive parallel decoders for diffusion LLMs.^[13]
Native Mixture-of-Experts diffusion language models (LLaDA-MoE, LLaDA 2.0 flash) that demonstrate the recipe scales to 100B parameters.^[6]^[7]
Multimodal diffusion LLMs (LLaDA-V) that suggest purely diffusion-based stacks can be competitive with hybrid autoregressive-diffusion vision-language models.^[5]
Production serving stacks (SGLang's block-diffusion runtime, the LMSYS dLLM framework) that target diffusion LLMs as a first-class deployment target rather than as a research curiosity.^[10]

LLaDA is also frequently cited as the open academic counterpart to commercial diffusion LLMs. Inception Labs, co-founded by Stefano Ermon, released Mercury and Mercury Coder in early 2025 as "the world's first commercial-scale diffusion large language model", marketing throughputs of 1109 tokens/second (Mercury Coder Mini) and 737 tokens/second (Mercury Coder Small) on NVIDIA H100 GPUs while remaining competitive with speed-optimized models such as GPT-4o Mini and Claude 3.5 Haiku on Copilot Arena.^[17] Mercury 2, launched in February 2026, advertises 5x speedups over leading speed-optimized LLMs.^[18] Mercury's underlying technique is not openly published in the same detail as LLaDA, but Inception founders co-invented earlier diffusion-model and discrete-sampling techniques, and the design space (block-parallel iterative denoising) closely parallels the LLaDA family.^[17]

Limitations

The original LLaDA paper and follow-up commentary identify several limitations that remain open as of mid-2026:

Inference compute. Achieving the reported accuracies typically requires a number of denoising steps equal to the response length, which can make a naive LLaDA forward pass more expensive per generated token than an AR decode. The repository explicitly notes that step reduction is possible at some performance cost, motivating the semi-autoregressive block sampler and external accelerators such as dLLM-Cache and SlowFast Sampling.^[13]^[15]
KV-cache incompatibility. Because LLaDA uses full bidirectional attention and re-attends to the entire sequence each denoising step, standard autoregressive KV-caching does not apply directly, and specialized caching schemes are still an active research area.^[10]^[13]
Alignment recipe. Mainstream RLHF and DPO methods assume tractable likelihoods. Adapting them to masked diffusion required the VRPO machinery introduced in LLaDA 1.5, and even with VRPO some preference benchmarks lag well-tuned AR models that have access to mature post-training pipelines.^[4]
Benchmark coverage. Reported results emphasize knowledge, math, and code benchmarks; long-context, agentic, and frontier reasoning benchmarks are less thoroughly characterized in the public literature than for comparable AR models.^[1]^[11]

Diffusion model and DDPM: the broader family of generative models that LLaDA adapts to discrete language data.
Score matching: the continuous-data analogue of the discrete loss used in SEDD, the closest precursor in the discrete-diffusion-for-language literature.^[9]
Masked Language Model and BERT: structurally similar to a single LLaDA denoising step at a fixed mask rate, but trained without an iterative generative process and without the 1/t reweighting that turns the loss into a likelihood bound.^[1]
Llama 3 and GPT-4o: the autoregressive baselines against which LLaDA is benchmarked in the original paper.^[1]
Mixture of Experts: the architectural pattern adopted in LLaDA-MoE and LLaDA 2.0 flash.^[6]^[7]
Direct Preference Optimization (DPO) and RLHF: the alignment paradigms whose adaptation to masked diffusion motivated VRPO in LLaDA 1.5.^[4]

References

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li, "Large Language Diffusion Models", arXiv:2502.09992, 2025-02-14 (v1) / 2025-10-18 (v3). https://arxiv.org/abs/2502.09992. Accessed 2026-05-21. ↩
Shen Nie et al., "Large Language Diffusion Models", HTML version, arXiv, 2025-10-18. https://arxiv.org/html/2502.09992. Accessed 2026-05-21. ↩
L.J., "NeurIPS 2025 oral: Diffusion model officially enters LLM", Medium, 2025-12. https://medium.com/@zljdanceholic/neurips-2025-oral-diffusion-model-officially-enters-llm-fc68441cebc0. Accessed 2026-05-21. ↩
Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li, "LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models", arXiv:2505.19223, 2025-05-25. https://arxiv.org/abs/2505.19223. Accessed 2026-05-21. ↩
Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li, "LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning", arXiv:2505.16933, 2025-05-22. https://arxiv.org/abs/2505.16933. Accessed 2026-05-21. ↩
AIBase, "Challenging Conventional Wisdom! Ant Group and Renmin University to Launch the First Native MoE Diffusion Language Model in the Industry at the 2025 Bund Conference", AIBase News, 2025-09. https://www.aibase.com/news/21246. Accessed 2026-05-21. ↩
36Kr Europe, "Milestone: First 100B Diffusion Language Model Unveiled, Technical Report Details Inside", 36Kr, 2025-12. https://eu.36kr.com/en/p/3592063556468736. Accessed 2026-05-21. ↩
Jonathan Ho, Ajay Jain, Pieter Abbeel, "Denoising Diffusion Probabilistic Models", arXiv:2006.11239, 2020-06-19. https://arxiv.org/abs/2006.11239. Accessed 2026-05-21. ↩
Aaron Lou, Chenlin Meng, Stefano Ermon, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution", arXiv:2310.16834, 2023-10-25 (ICML 2024 Best Paper). https://arxiv.org/abs/2310.16834. Accessed 2026-05-21. ↩
LMSYS Org, "Power Up Diffusion LLMs: Day-0 Support for LLaDA 2.0", LMSYS Blog, 2025-12-19. https://www.lmsys.org/blog/2025-12-19-diffusion-llm/. Accessed 2026-05-21. ↩
NeurIPS 2025, "Large Language Diffusion Models, Oral Presentation, San Diego", NeurIPS Virtual Site, 2025-12. https://neurips.cc/virtual/2025/loc/san-diego/oral/118609. Accessed 2026-05-21. ↩
GSAI-ML, "LLaDA-8B-Instruct model card and discussion", Hugging Face, 2025. https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct. Accessed 2026-05-21. ↩
Daniel Israel et al., "Accelerating Diffusion LLMs via Adaptive Parallel Decoding", arXiv:2506.00413, 2025-06. https://arxiv.org/pdf/2506.00413. Accessed 2026-05-21. ↩
ML-GSAI, "LLaDA GitHub repository", GitHub, 2025. https://github.com/ML-GSAI/LLaDA. Accessed 2026-05-21. ↩
GSAI-ML, "LLaDA-8B-Base model card", Hugging Face, 2025. https://huggingface.co/GSAI-ML/LLaDA-8B-Base. Accessed 2026-05-21. ↩
ML-GSAI, "LLaDA project page", Renmin University ML group, 2025. https://ml-gsai.github.io/LLaDA-demo/. Accessed 2026-05-21. ↩
Inception Labs, "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model", Inception Blog, 2025-02. https://www.inceptionlabs.ai/blog/introducing-mercury. Accessed 2026-05-21. ↩
Inception Labs / BusinessWire, "Inception Launches Mercury 2, the Fastest Reasoning LLM, 5x Faster Than Leading Speed-Optimized LLMs", BusinessWire, 2026-02-24. https://www.businesswire.com/news/home/20260224034496/en/Inception-Launches-Mercury-2-the-Fastest-Reasoning-LLM-5x-Faster-Than-Leading-Speed-Optimized-LLMs-with-Dramatically-Lower-Inference-Cost. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributor · full history

Suggest edit

What links here

Autoregressive Model Mercury (Inception Labs)

Background

History and Releases

Technical Details

Masked-diffusion forward process

Training objective

Mask predictor architecture

Sampling: iterative denoising and remasking

Pretraining scale

Variants

LLaDA 8B Base and 8B Instruct

LLaDA 1.5 and VRPO

LLaDA-V (vision)

LLaDA-MoE and LLaDA 2.0

Comparison with Autoregressive LLMs

Reversal curse experiments

Applications and Significance

Limitations

Related Work

See also

References

Improve this article

Related Articles

Diffusion Language Models

Inception Labs

Mercury (Inception Labs)

Gemini Diffusion

Stable Diffusion

DALL-E

What links here

Related Articles

Diffusion Language Models

Inception Labs

Mercury (Inception Labs)

Gemini Diffusion

Stable Diffusion

DALL-E

What links here