SEDD (Score Entropy Discrete Diffusion)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,856 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,856 words
Add missing citations, update stale details, or suggest a clearer explanation.
Score Entropy Discrete Diffusion (SEDD) is a discrete diffusion model for language and other discrete data introduced by Aaron Lou, Chenlin Meng, and Stefano Ermon at Stanford University in the paper Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (arXiv:2310.16834, October 2023).[^1] The core contribution is a novel loss function called score entropy, which the authors describe as a principled extension of score matching to discrete state spaces; this loss trains a network to estimate the ratios between marginal data distributions at adjacent noise levels, rather than estimating the distribution itself.[^1][^2] SEDD was the first non-autoregressive language model to match a well-known GPT-2-scale autoregressive transformer on standard perplexity benchmarks; the paper received a Best Paper Award at ICML 2024, one of ten such honors that year.[^3][^4][^5] The released codebase and Hugging Face checkpoints made SEDD a reference implementation that subsequent discrete diffusion language models (notably MDLM and MD4) compare against and build upon.[^6][^7][^8]
Continuous-state denoising diffusion models had, by 2023, become the dominant approach for image, audio, and video generation, but they had not been competitive with autoregressive transformers on natural language. The standard recipe of corrupting data with Gaussian noise and recovering it by learning a Stein score does not transfer to text, because tokens live in a finite vocabulary with no continuous structure.[^1][^2] Several prior frameworks adapted diffusion to discrete data: D3PM defined discrete Markov forward processes with categorical transition kernels; Concrete Score Matching attempted to learn neighboring-probability ratios with an L2 loss; and continuous-relaxation approaches such as PLAID embedded tokens into a continuous space before diffusing. None of these had matched the quality of a similarly sized autoregressive baseline on text, with reported perplexities typically much higher than GPT-2 small at the time SEDD was released.[^1][^2]
The SEDD paper traces its motivation to this gap. The authors argue that the right object to learn for a discrete diffusion model is the concrete score, a vector of ratios p_t(y)/p_t(x) for neighboring states y of the current state x, and that the obstacle had been the absence of a stable, mode-covering loss for those positive-valued targets.[^1][^9] Score entropy is the loss they propose to close that gap.[^1]
Aaron Lou completed the work as a Stanford computer science PhD student advised by Stefano Ermon. Chenlin Meng, also a Stanford PhD with Ermon, is a co-author and a co-founder of Pika; Lou has since taken a position leading the Strategic Explorations team at OpenAI.[^10]
SEDD operates on sequences of tokens from a finite vocabulary. The forward process is a continuous-time Markov chain (CTMC) governed by a family of transition-rate matrices Q_t, with non-negative off-diagonal entries and zero column sums, so the marginal densities p_t evolve under the linear ODE dp_t/dt = Q_t p_t. The reverse-time process is itself a CTMC whose rates depend on the data only through the ratios p_t(y)/p_t(x) of the marginal distribution at adjacent states; learning those ratios is therefore sufficient to simulate generation, much as learning the Stein score is sufficient in continuous diffusion.[^1][^9]
The vocabulary-sized Q_t is generally too large to store, so SEDD uses two structured graph kernels that admit closed-form transition matrices:
MASK state, in the spirit of BERT-style masked language modeling. The stationary distribution puts all mass on the fully masked sequence.[^1][^9]Empirically, the absorbing variant (SEDD Absorb) outperforms the uniform variant on text and is the configuration used for the headline GPT-2 comparisons.[^1][^11]
In continuous diffusion, score matching learns the gradient of the log density, which is unconstrained in sign. In the discrete setting the analogous target is a positive ratio, and a naive L2 loss on those ratios (Concrete Score Matching) has no mechanism to penalize negative or zero predictions. The SEDD paper shows that this leads to estimators that can place mass on invalid outputs and that the gradients are not well behaved near zero.[^1][^9] Earlier discrete-diffusion paradigms such as D3PM avoid the ratio entirely and instead learn a mean prediction of the clean data, but doing so requires variational bounds that become loose in the continuous-time limit and that do not factor neatly across positions.[^1]
The proposed loss is a Bregman divergence built from K(a) = a(log a - 1), defined for a network output s_θ(x) that predicts the vector of neighboring ratios at state x:
L_SE = E_{x~p}[ sum_{y != x} w_{xy} ( s_θ(x)_y - (p(y)/p(x)) log s_θ(x)_y + K(p(y)/p(x)) ) ]
with non-negative weights w_{xy} and a positive network parameterization (for example, an exponentiated output). Three properties are emphasized in the paper:[^1][^9]
s_θ equals the true ratios p(y)/p(x). The convexity of the underlying Bregman divergence ensures stable optimization and a log-barrier that excludes negative outputs.p is a noisy marginal p_t = E_{x_0}[p_t(. | x_0)] produced by the forward CTMC, the score entropy is equivalent up to an additive constant to a denoising score entropy in which the unknown ratios p_t(y)/p_t(x) are replaced by the conditional ratios p_t(y | x_0)/p_t(x | x_0). Those conditional ratios are available in closed form for the uniform and absorbing kernels, so a Monte Carlo estimator needs only one forward pass per sample.w_{xy} to match the transition rates of the forward CTMC yields an evidence lower bound on the model likelihood, so minimizing the weighted score entropy bounds perplexity.[^1][^9]For the absorbing kernel the loss reduces (up to terms determined by the noise schedule) to a sum of cross-entropy losses against the clean tokens at masked positions, with weights set by the schedule; this connection is the bridge that later masked-diffusion work (MDLM and MD4) builds on.[^7][^8]
At inference time SEDD initializes a sequence from the stationary distribution (all tokens drawn uniformly, or for the absorbing kernel a fully masked sequence) and integrates the reverse CTMC using the learned ratios. The paper introduces an analytical sampler that exploits the closed-form transition matrices and a Tweedie-style correction; this allows running with far fewer denoising steps than the sequence length while still matching baseline quality. The authors report that 1024-token samples reach GPT-2 quality with roughly 32 times fewer network evaluations than the sequence length.[^1][^9] Because the model parameterizes ratios directly, conditioning on tokens at arbitrary positions (left, right, middle, or scattered) is a Bayes-rule rearrangement of the ratios and requires no additional training, enabling zero-shot infilling.[^1][^2]
The network is a Diffusion Transformer (DiT)-style encoder-only Transformer with adaptive layer-norm time conditioning and rotary positional embeddings. SEDD Small (about the parameter count of GPT-2 small) and SEDD Medium (about the parameter count of GPT-2 medium) are released. Training uses a batch size of 512, learning rate 3e-4, linear warmup over the first 2,000 iterations, gradient-norm clipping at 1, and an EMA decay of 0.9999, on nodes of 8 A100 80GB or 16 A100 40GB GPUs.[^11][^9]
SEDD is trained on OpenWebText for the zero-shot GPT-2 comparison, on the One Billion Words corpus for likelihood evaluation, and on character-level text8 for ablations. The headline result is that on five of the standard zero-shot perplexity datasets used to evaluate GPT-2, the upper-bound perplexities of SEDD Absorb match or beat the GPT-2 numbers at comparable model size.[^1][^9]
| Dataset | GPT-2 Small | SEDD Absorb Small | GPT-2 Medium | SEDD Absorb Medium |
|---|---|---|---|---|
| LAMBADA | 45.04 | <= 50.92 | 35.66 | <= 42.77 |
| WikiText2 | 42.43 | <= 41.84 | 31.80 | <= 31.04 |
| PTB | 138.43 | <= 114.24 | 123.14 | <= 87.12 |
| WikiText103 | 41.60 | <= 40.62 | 31.39 | <= 29.98 |
| One Billion Words | 75.20 | <= 79.29 | 55.72 | <= 61.19 |
Values are perplexities; lower is better. SEDD numbers are upper bounds because the model defines a likelihood through an ELBO. Sources: arXiv:2310.16834 v3 and the project blog post.[^1][^9]
Against the existing crop of discrete diffusion baselines (D3PM, SEDD's predecessors, and continuous-embedding diffusion such as PLAID), the paper reports 25 to 75 percent reductions in perplexity at comparable model sizes.[^1][^4] On unconditional text generation, the authors report that SEDD produces samples whose generative perplexity (as measured by a stronger evaluator model) is 6 to 8 times better than un-annealed GPT-2 samples; that is, SEDD does not require temperature scaling or nucleus sampling to avoid degenerate outputs. The MAUVE divergence against the reference distribution is comparable to that of nucleus-sampled GPT-2.[^1][^9] The paper also demonstrates infilling tasks where prompts appear at arbitrary positions, which is awkward for a left-to-right GPT model but native to SEDD.[^1]
SEDD anchored a wave of follow-up work that simplified or generalized the framework.
The original SEDD reference implementation lives in the GitHub repository louaaron/Score-Entropy-Discrete-Diffusion under the MIT license, with pretrained checkpoints louaaron/sedd-small and louaaron/sedd-medium released on Hugging Face. The codebase exposes both the absorbing and uniform graphs and the analytical sampler.[^6][^14]
The headline application of SEDD is non-autoregressive language modeling. By matching a GPT-2-scale autoregressive baseline at the same parameter count, SEDD was widely read as the first credible empirical demonstration that the autoregressive next-token paradigm is not the only viable route to a competitive language model.[^1][^2][^15] Three properties of SEDD are repeatedly cited in subsequent work and commentary:
Beyond language, the score entropy framework applies to any discrete generative task. Subsequent work has explored protein sequence design and other categorical-data domains using SEDD-style ratio modeling.[^16]
The SEDD paper and independent evaluations are explicit about several weaknesses.
The paper appeared on arXiv on 25 October 2023; the third revision was posted on 6 June 2024.[^1] It was accepted as an oral and a Best Paper at the 41st International Conference on Machine Learning, held in Vienna, Austria from 21 to 27 July 2024, where it was one of ten Best Paper award recipients.[^4][^5] Stanford AI Lab published a congratulatory announcement dated 26 July 2024.[^3] The paper is published in the ICML 2024 proceedings.[^17]
| System | Year | Loss / training objective | Forward process | Notes |
|---|---|---|---|---|
| D3PM (Austin et al.) | 2021 | Variational ELBO on mean prediction | Discrete Markov chain (categorical) | Tractable but loose continuous-time bounds. |
| Concrete Score Matching | 2022 | L2 on ratio estimates | Continuous-time discrete | No positivity constraint; unstable. |
| SEDD | 2023 | Score entropy / denoising score entropy | CTMC, absorbing or uniform | First non-AR LM to match GPT-2 at comparable size. |
| MDLM | 2024 | Rao-Blackwellized weighted cross-entropy | Absorbing only | Outperforms SEDD on LM1B at matched compute. |
| MD4 | 2024 | Simple weighted integral of cross-entropies | Absorbing, with state-dependent schedules | Outperforms SEDD on most reported benchmarks. |
| LLaDA | 2025 | Masked diffusion objective | Absorbing | 8B parameters; competitive with LLaMA3 8B zero-shot. |
Sources: arXiv:2310.16834 v3 for SEDD;[^1] the MDLM and MD4 arXiv preprints;[^7][^8] LLaDA reporting.[^12]
SEDD sits at the intersection of three lines of research: continuous-time denoising diffusion generative models, masked language modeling in the spirit of BERT, and energy-based probabilistic modeling, where modeling unnormalized ratios is a familiar device for avoiding intractable partition functions. The Diffusion Transformer architecture from Peebles and Xie supplied the conditioning backbone. Followers in the diffusion language models line include MDLM, MD4, LLaDA, and the commercial Mercury family from Inception Labs.[^1][^7][^8][^12][^13]