SEDD (Score Entropy Discrete Diffusion)

Diffusion Models Natural Language Processing

14 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v3 · 2,850 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Score Entropy Discrete Diffusion (SEDD) is a discrete diffusion model for language and other discrete data introduced by Aaron Lou, Chenlin Meng, and Stefano Ermon at Stanford University in the paper Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (arXiv:2310.16834, October 2023).^[1] The core contribution is a novel loss function called score entropy, which the authors describe as a principled extension of score matching to discrete state spaces; this loss trains a network to estimate the ratios between marginal data distributions at adjacent noise levels, rather than estimating the distribution itself.^[1]^[2] SEDD was the first non-autoregressive language model to match a well-known GPT-2-scale autoregressive transformer on standard perplexity benchmarks; the paper received a Best Paper Award at ICML 2024, one of ten such honors that year.^[3]^[4]^[5] The released codebase and Hugging Face checkpoints made SEDD a reference implementation that subsequent discrete diffusion language models (notably MDLM and MD4) compare against and build upon.^[6]^[7]^[8]

Background

Continuous-state denoising diffusion models had, by 2023, become the dominant approach for image, audio, and video generation, but they had not been competitive with autoregressive transformers on natural language. The standard recipe of corrupting data with Gaussian noise and recovering it by learning a Stein score does not transfer to text, because tokens live in a finite vocabulary with no continuous structure.^[1]^[2] Several prior frameworks adapted diffusion to discrete data: D3PM defined discrete Markov forward processes with categorical transition kernels; Concrete Score Matching attempted to learn neighboring-probability ratios with an L2 loss; and continuous-relaxation approaches such as PLAID embedded tokens into a continuous space before diffusing. None of these had matched the quality of a similarly sized autoregressive baseline on text, with reported perplexities typically much higher than GPT-2 small at the time SEDD was released.^[1]^[2]

The SEDD paper traces its motivation to this gap. The authors argue that the right object to learn for a discrete diffusion model is the concrete score, a vector of ratios p_t(y)/p_t(x) for neighboring states y of the current state x, and that the obstacle had been the absence of a stable, mode-covering loss for those positive-valued targets.^[1]^[9] Score entropy is the loss they propose to close that gap.^[1]

Aaron Lou completed the work as a Stanford computer science PhD student advised by Stefano Ermon. Chenlin Meng, also a Stanford PhD with Ermon, is a co-author and a co-founder of Pika; Lou has since taken a position leading the Strategic Explorations team at OpenAI.^[10]

Technical Details

Discrete diffusion framework

SEDD operates on sequences of tokens from a finite vocabulary. The forward process is a continuous-time Markov chain (CTMC) governed by a family of transition-rate matrices Q_t, with non-negative off-diagonal entries and zero column sums, so the marginal densities p_t evolve under the linear ODE dp_t/dt = Q_t p_t. The reverse-time process is itself a CTMC whose rates depend on the data only through the ratios p_t(y)/p_t(x) of the marginal distribution at adjacent states; learning those ratios is therefore sufficient to simulate generation, much as learning the Stein score is sufficient in continuous diffusion.^[1]^[9]

The vocabulary-sized Q_t is generally too large to store, so SEDD uses two structured graph kernels that admit closed-form transition matrices:

Uniform: every token transitions to any other token in the vocabulary at an equal rate, with a uniform stationary distribution.
Absorbing (mask): tokens are independently absorbed into a special MASK state, in the spirit of BERT-style masked language modeling. The stationary distribution puts all mass on the fully masked sequence.^[1]^[9]

Empirically, the absorbing variant (SEDD Absorb) outperforms the uniform variant on text and is the configuration used for the headline GPT-2 comparisons.^[1]^[11]

Why naive extensions of score matching fail

In continuous diffusion, score matching learns the gradient of the log density, which is unconstrained in sign. In the discrete setting the analogous target is a positive ratio, and a naive L2 loss on those ratios (Concrete Score Matching) has no mechanism to penalize negative or zero predictions. The SEDD paper shows that this leads to estimators that can place mass on invalid outputs and that the gradients are not well behaved near zero.^[1]^[9] Earlier discrete-diffusion paradigms such as D3PM avoid the ratio entirely and instead learn a mean prediction of the clean data, but doing so requires variational bounds that become loose in the continuous-time limit and that do not factor neatly across positions.^[1]

The score entropy objective

The proposed loss is a Bregman divergence built from K(a) = a(log a - 1), defined for a network output s_θ(x) that predicts the vector of neighboring ratios at state x:

L_SE = E_{x~p}[ sum_{y != x} w_{xy} ( s_θ(x)_y  -  (p(y)/p(x)) log s_θ(x)_y  +  K(p(y)/p(x)) ) ]

with non-negative weights w_{xy} and a positive network parameterization (for example, an exponentiated output). Three properties are emphasized in the paper:^[1]^[9]

Proper. The unique minimizer over s_θ equals the true ratios p(y)/p(x). The convexity of the underlying Bregman divergence ensures stable optimization and a log-barrier that excludes negative outputs.
Denoising. Theorem 3.4 of the paper shows that, when p is a noisy marginal p_t = E_{x_0}[p_t(. | x_0)] produced by the forward CTMC, the score entropy is equivalent up to an additive constant to a denoising score entropy in which the unknown ratios p_t(y)/p_t(x) are replaced by the conditional ratios p_t(y | x_0)/p_t(x | x_0). Those conditional ratios are available in closed form for the uniform and absorbing kernels, so a Monte Carlo estimator needs only one forward pass per sample.
Weighted. Choosing the weights w_{xy} to match the transition rates of the forward CTMC yields an evidence lower bound on the model likelihood, so minimizing the weighted score entropy bounds perplexity.^[1]^[9]

For the absorbing kernel the loss reduces (up to terms determined by the noise schedule) to a sum of cross-entropy losses against the clean tokens at masked positions, with weights set by the schedule; this connection is the bridge that later masked-diffusion work (MDLM and MD4) builds on.^[7]^[8]

Sampling

At inference time SEDD initializes a sequence from the stationary distribution (all tokens drawn uniformly, or for the absorbing kernel a fully masked sequence) and integrates the reverse CTMC using the learned ratios. The paper introduces an analytical sampler that exploits the closed-form transition matrices and a Tweedie-style correction; this allows running with far fewer denoising steps than the sequence length while still matching baseline quality. The authors report that 1024-token samples reach GPT-2 quality with roughly 32 times fewer network evaluations than the sequence length.^[1]^[9] Because the model parameterizes ratios directly, conditioning on tokens at arbitrary positions (left, right, middle, or scattered) is a Bayes-rule rearrangement of the ratios and requires no additional training, enabling zero-shot infilling.^[1]^[2]

Architecture and training setup

The network is a Diffusion Transformer (DiT)-style encoder-only Transformer with adaptive layer-norm time conditioning and rotary positional embeddings. SEDD Small (about the parameter count of GPT-2 small) and SEDD Medium (about the parameter count of GPT-2 medium) are released. Training uses a batch size of 512, learning rate 3e-4, linear warmup over the first 2,000 iterations, gradient-norm clipping at 1, and an EMA decay of 0.9999, on nodes of 8 A100 80GB or 16 A100 40GB GPUs.^[11]^[9]

Experiments and Results

SEDD is trained on OpenWebText for the zero-shot GPT-2 comparison, on the One Billion Words corpus for likelihood evaluation, and on character-level text8 for ablations. The headline result is that on five of the standard zero-shot perplexity datasets used to evaluate GPT-2, the upper-bound perplexities of SEDD Absorb match or beat the GPT-2 numbers at comparable model size.^[1]^[9]

Dataset	GPT-2 Small	SEDD Absorb Small	GPT-2 Medium	SEDD Absorb Medium
LAMBADA	45.04	<= 50.92	35.66	<= 42.77
WikiText2	42.43	<= 41.84	31.80	<= 31.04
PTB	138.43	<= 114.24	123.14	<= 87.12
WikiText103	41.60	<= 40.62	31.39	<= 29.98
One Billion Words	75.20	<= 79.29	55.72	<= 61.19

Values are perplexities; lower is better. SEDD numbers are upper bounds because the model defines a likelihood through an ELBO. Sources: arXiv:2310.16834 v3 and the project blog post.^[1]^[9]

Against the existing crop of discrete diffusion baselines (D3PM, SEDD's predecessors, and continuous-embedding diffusion such as PLAID), the paper reports 25 to 75 percent reductions in perplexity at comparable model sizes.^[1]^[4] On unconditional text generation, the authors report that SEDD produces samples whose generative perplexity (as measured by a stronger evaluator model) is 6 to 8 times better than un-annealed GPT-2 samples; that is, SEDD does not require temperature scaling or nucleus sampling to avoid degenerate outputs. The MAUVE divergence against the reference distribution is comparable to that of nucleus-sampled GPT-2.^[1]^[9] The paper also demonstrates infilling tasks where prompts appear at arbitrary positions, which is awkward for a left-to-right GPT model but native to SEDD.^[1]

Variants and Follow-Up Work

SEDD anchored a wave of follow-up work that simplified or generalized the framework.

MDLM (Masked Diffusion Language Models). Sahoo et al. at Cornell and partners, published at NeurIPS 2024, focused exclusively on the absorbing kernel and derived a Rao-Blackwellized continuous-time objective that they showed equals a weighted average of standard masked language modeling cross-entropy losses. They report a roughly 17 percent improvement on the LM1B perplexity bound relative to SEDD trained for 33 billion tokens, without invoking the CTMC machinery.^[7]
MD4 (Simplified and Generalized Masked Diffusion). Shi, Han, Wang, Doucet, and Titsias at Google DeepMind independently established that the continuous-time variational objective for masked discrete diffusion reduces to a simple weighted integral of cross-entropies and extended the framework to state-dependent masking schedules. MD4 outperforms SEDD on most of the reported language benchmarks and competes with autoregressive models.^[8]
LLaDA (Large Language Diffusion with mAsking). Released in 2025 by researchers at Renmin University and Ant Group, LLaDA is an 8 billion parameter masked diffusion language model that scales the absorbing-kernel recipe and reaches parity with LLaMA3 8B on a number of zero-shot benchmarks. It is descended methodologically from SEDD and the MDLM/MD4 line.^[12]
Inception Labs Mercury. Mercury and Mercury Coder, released in 2025 by Inception Labs (cofounded by Stefano Ermon, an author on SEDD), are commercial diffusion language models that the company markets as providing very high throughput for code and chat. Reporting on Mercury and its lineage points back to the score entropy paper as the technical predecessor.^[13]

The original SEDD reference implementation lives in the GitHub repository louaaron/Score-Entropy-Discrete-Diffusion under the MIT license, with pretrained checkpoints louaaron/sedd-small and louaaron/sedd-medium released on Hugging Face. The codebase exposes both the absorbing and uniform graphs and the analytical sampler.^[6]^[14]

Applications and Significance

The headline application of SEDD is non-autoregressive language modeling. By matching a GPT-2-scale autoregressive baseline at the same parameter count, SEDD was widely read as the first credible empirical demonstration that the autoregressive next-token paradigm is not the only viable route to a competitive language model.^[1]^[2]^[15] Three properties of SEDD are repeatedly cited in subsequent work and commentary:

Controllable, prompt-agnostic generation. Because the model parameterizes ratios, it can condition on observed tokens at any positions without the train-time tricks (suffix prompting, fill-in-the-middle, prefix LM objectives) that autoregressive models need.^[1]^[9]
Compute and quality trade-off. The number of reverse-diffusion steps can be reduced, with a graceful quality degradation, in a way that is structurally unavailable to a strictly sequential autoregressive sampler.^[1]^[4]
Faithful sampling without annealing. SEDD reaches sample quality competitive with nucleus-sampled GPT-2 from the unmodified model distribution, sidestepping the temperature and top-p hyperparameters that autoregressive samplers rely on to avoid degeneration.^[1]^[9]

Beyond language, the score entropy framework applies to any discrete generative task. Subsequent work has explored protein sequence design and other categorical-data domains using SEDD-style ratio modeling.^[16]

Limitations

The SEDD paper and independent evaluations are explicit about several weaknesses.

Short-prompt conditional generation. A 2024 survey of diffusion language modeling notes that SEDD is slightly weaker than GPT-2 when conditioning on short prompts, and is less diverse on some prompted generation metrics.^[15]
FLOP and KV-cache inefficiency. Iterative denoising recomputes attention over the full sequence at every step, and the mask tokens that dominate early sampling steps consume compute without contributing semantic content. Non-causal attention also makes KV caching nontrivial relative to a decoder-only autoregressive model.^[15]
Fixed sequence length. A SEDD generation operates on a pre-allocated buffer of fixed length, in contrast to autoregressive sampling that can extend a sequence indefinitely.^[15]
Scaling beyond GPT-2 scale was not demonstrated in the original paper. The released checkpoints stop at GPT-2 medium parameter counts; large-scale evidence that the recipe continues to compete with modern autoregressive large language models came later, through MDLM, MD4, and LLaDA rather than from SEDD itself.^[7]^[8]^[12]

Reception and Awards

The paper appeared on arXiv on 25 October 2023; the third revision was posted on 6 June 2024.^[1] It was accepted as an oral and a Best Paper at the 41st International Conference on Machine Learning, held in Vienna, Austria from 21 to 27 July 2024, where it was one of ten Best Paper award recipients.^[4]^[5] Stanford AI Lab published a congratulatory announcement dated 26 July 2024.^[3] The paper is published in the ICML 2024 proceedings.^[17]

Comparison

System	Year	Loss / training objective	Forward process	Notes
D3PM (Austin et al.)	2021	Variational ELBO on mean prediction	Discrete Markov chain (categorical)	Tractable but loose continuous-time bounds.
Concrete Score Matching	2022	L2 on ratio estimates	Continuous-time discrete	No positivity constraint; unstable.
SEDD	2023	Score entropy / denoising score entropy	CTMC, absorbing or uniform	First non-AR LM to match GPT-2 at comparable size.
MDLM	2024	Rao-Blackwellized weighted cross-entropy	Absorbing only	Outperforms SEDD on LM1B at matched compute.
MD4	2024	Simple weighted integral of cross-entropies	Absorbing, with state-dependent schedules	Outperforms SEDD on most reported benchmarks.
LLaDA	2025	Masked diffusion objective	Absorbing	8B parameters; competitive with LLaMA3 8B zero-shot.

Sources: arXiv:2310.16834 v3 for SEDD;^[1] the MDLM and MD4 arXiv preprints;^[7]^[8] LLaDA reporting.^[12]

SEDD sits at the intersection of three lines of research: continuous-time denoising diffusion generative models, masked language modeling in the spirit of BERT, and energy-based probabilistic modeling, where modeling unnormalized ratios is a familiar device for avoiding intractable partition functions. The Diffusion Transformer architecture from Peebles and Xie supplied the conditioning backbone. Followers in the diffusion language models line include MDLM, MD4, LLaDA, and the commercial Mercury family from Inception Labs.^[1]^[7]^[8]^[12]^[13]

References

Aaron Lou, Chenlin Meng, Stefano Ermon, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution", arXiv, 2023-10-25 (v1) / 2024-06-06 (v3). https://arxiv.org/abs/2310.16834. Accessed 2026-05-21. ↩
Aaron Lou, "Language Modeling by Estimating the Ratios of the Data Distribution", aaronlou.com blog, 2024. https://aaronlou.com/blog/2024/discrete-diffusion/. Accessed 2026-05-21. ↩
Cindy Duong, "Congratulations to Aaron Lou, Chenlin Meng, and Stefano Ermon for an ICML 2024 Best Paper Award!", Stanford Artificial Intelligence Laboratory, 2024-07-26. https://ai.stanford.edu/news/congratulations-to-aaron-lou-chenlin-meng-and-stefano-ermon-for-an-icml-2024-best-paper-award/. Accessed 2026-05-21. ↩
AIhub editorial team, "Congratulations to the #ICML2024 award winners", AIhub, 2024-07-25. https://aihub.org/2024/07/25/congratulations-to-the-icml2024-award-winners/. Accessed 2026-05-21. ↩
ICML, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (Poster, Best Paper)", icml.cc virtual program, 2024. https://icml.cc/virtual/2024/poster/34686. Accessed 2026-05-21. ↩
Aaron Lou, "louaaron/Score-Entropy-Discrete-Diffusion (README)", GitHub, 2024. https://github.com/louaaron/Score-Entropy-Discrete-Diffusion. Accessed 2026-05-21. ↩
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, Volodymyr Kuleshov, "Simple and Effective Masked Diffusion Language Models", arXiv, 2024-06-11 (revised 2024-11-10). https://arxiv.org/abs/2406.07524. Accessed 2026-05-21. ↩
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, Michalis K. Titsias, "Simplified and Generalized Masked Diffusion for Discrete Data", arXiv, 2024-06-06 (revised 2025-01-16). https://arxiv.org/abs/2406.04329. Accessed 2026-05-21. ↩
Aaron Lou, Chenlin Meng, Stefano Ermon, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (HTML v3)", arXiv, 2024. https://arxiv.org/html/2310.16834v3. Accessed 2026-05-21. ↩
Aaron Lou, "Aaron Lou (personal site)", aaronlou.com, 2024. https://aaronlou.com/. Accessed 2026-05-21. ↩
Greg Schoeninger, "ArXiv Dives: Text Diffusion with SEDD", Oxen.ai blog, 2024. https://ghost.oxen.ai/arxiv-dives-text-diffusion-with-sedd/. Accessed 2026-05-21. ↩
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li, "Large Language Diffusion Models (LLaDA)", arXiv, 2025-02. https://arxiv.org/abs/2502.09992. Accessed 2026-05-21. ↩
The Batch editorial team, "Mercury Coder May Be the First Commercially Available Language Diffusion Model", DeepLearning.AI The Batch, 2025. https://www.deeplearning.ai/the-batch/mercury-coder-may-be-the-first-commercially-available-language-diffusion-model/. Accessed 2026-05-21. ↩
Aaron Lou, "louaaron/sedd-small (model card)", Hugging Face, 2024. https://huggingface.co/louaaron/sedd-small. Accessed 2026-05-21. ↩
Justin Deschenaux, Caglar Gulcehre, "Promises, Outlooks and Challenges of Diffusion Language Modeling", arXiv, 2024-06-17. https://arxiv.org/html/2406.11473v1. Accessed 2026-05-21. ↩
Alex Carlin, "Score entropy discrete diffusion models for protein design", alexcarlin.bearblog.dev, 2024. https://alexcarlin.bearblog.dev/score-entropy-discrete-diffusion-models-for-protein-design/. Accessed 2026-05-21. ↩
Aaron Lou, Chenlin Meng, Stefano Ermon, "Discrete diffusion modeling by estimating the ratios of the data distribution", Proceedings of the 41st International Conference on Machine Learning (ICML 2024), 2024. https://dl.acm.org/doi/10.5555/3692070.3693403. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Mercury (Inception Labs)