# SEDD (Score Entropy Discrete Diffusion)

> Source: https://aiwiki.ai/wiki/sedd
> Updated: 2026-07-16
> Categories: Diffusion Models, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Score Entropy Discrete Diffusion (SEDD)** is a discrete [diffusion model](/wiki/diffusion_model) for language and other discrete data introduced by Aaron Lou, Chenlin Meng, and Stefano Ermon at [Stanford University](/wiki/stanford_university) in the paper *Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution* (arXiv:2310.16834, October 2023).[^1] The core contribution is a novel loss function called *score entropy*, which the authors describe as a principled extension of [score matching](/wiki/score_matching) to discrete state spaces; this loss trains a network to estimate the ratios between marginal data distributions at adjacent noise levels, rather than estimating the distribution itself.[^1][^2] SEDD was the first non-autoregressive language model to match a well-known [GPT-2](/wiki/gpt-2)-scale [autoregressive transformer](/wiki/generative_pre-trained_transformer) on standard perplexity benchmarks; the paper received a Best Paper Award at [ICML](/wiki/icml) 2024, one of ten such honors that year.[^3][^4][^5] The released codebase and Hugging Face checkpoints made SEDD a reference implementation that subsequent discrete [diffusion language models](/wiki/diffusion_language_models) (notably MDLM and MD4) compare against and build upon.[^6][^7][^8]

## Background

Continuous-state [denoising diffusion](/wiki/ddpm) models had, by 2023, become the dominant approach for image, audio, and video generation, but they had not been competitive with autoregressive [transformers](/wiki/transformer) on natural language. The standard recipe of corrupting data with Gaussian noise and recovering it by learning a Stein score does not transfer to text, because tokens live in a finite vocabulary with no continuous structure.[^1][^2] Several prior frameworks adapted diffusion to discrete data: D3PM defined discrete Markov forward processes with categorical transition kernels; Concrete Score Matching attempted to learn neighboring-probability ratios with an L2 loss; and continuous-relaxation approaches such as PLAID embedded tokens into a continuous space before diffusing. None of these had matched the quality of a similarly sized autoregressive baseline on text, with reported perplexities typically much higher than [GPT-2](/wiki/gpt-2) small at the time SEDD was released.[^1][^2]

The SEDD paper traces its motivation to this gap. The authors argue that the right object to learn for a discrete diffusion model is the *concrete score*, a vector of ratios `p_t(y)/p_t(x)` for neighboring states `y` of the current state `x`, and that the obstacle had been the absence of a stable, mode-covering loss for those positive-valued targets.[^1][^9] Score entropy is the loss they propose to close that gap.[^1]

Aaron Lou completed the work as a [Stanford](/wiki/stanford_university) computer science PhD student advised by Stefano Ermon. Chenlin Meng, also a Stanford PhD with Ermon, is a co-author and a co-founder of Pika; Lou has since taken a position leading the Strategic Explorations team at [OpenAI](/wiki/openai).[^10]

## Technical Details

### Discrete diffusion framework

SEDD operates on sequences of tokens from a finite vocabulary. The forward process is a continuous-time Markov chain (CTMC) governed by a family of transition-rate matrices `Q_t`, with non-negative off-diagonal entries and zero column sums, so the marginal densities `p_t` evolve under the linear ODE `dp_t/dt = Q_t p_t`. The reverse-time process is itself a CTMC whose rates depend on the data only through the ratios `p_t(y)/p_t(x)` of the marginal distribution at adjacent states; learning those ratios is therefore sufficient to simulate generation, much as learning the Stein score is sufficient in continuous diffusion.[^1][^9]

The vocabulary-sized `Q_t` is generally too large to store, so SEDD uses two structured graph kernels that admit closed-form transition matrices:

- **Uniform**: every token transitions to any other token in the vocabulary at an equal rate, with a uniform stationary distribution.
- **Absorbing (mask)**: tokens are independently absorbed into a special `MASK` state, in the spirit of [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers)-style [masked language modeling](/wiki/masked_language_model). The stationary distribution puts all mass on the fully masked sequence.[^1][^9]

Empirically, the absorbing variant (SEDD Absorb) outperforms the uniform variant on text and is the configuration used for the headline GPT-2 comparisons.[^1][^11]

### Why naive extensions of score matching fail

In continuous diffusion, score matching learns the gradient of the log density, which is unconstrained in sign. In the discrete setting the analogous target is a positive ratio, and a naive L2 loss on those ratios (Concrete Score Matching) has no mechanism to penalize negative or zero predictions. The SEDD paper shows that this leads to estimators that can place mass on invalid outputs and that the gradients are not well behaved near zero.[^1][^9] Earlier discrete-diffusion paradigms such as D3PM avoid the ratio entirely and instead learn a *mean* prediction of the clean data, but doing so requires variational bounds that become loose in the continuous-time limit and that do not factor neatly across positions.[^1]

### The score entropy objective

The proposed loss is a Bregman divergence built from `K(a) = a(log a - 1)`, defined for a network output `s_θ(x)` that predicts the vector of neighboring ratios at state `x`:

```
L_SE = E_{x~p}[ sum_{y != x} w_{xy} ( s_θ(x)_y  -  (p(y)/p(x)) log s_θ(x)_y  +  K(p(y)/p(x)) ) ]
```

with non-negative weights `w_{xy}` and a positive network parameterization (for example, an exponentiated output). Three properties are emphasized in the paper:[^1][^9]

1. **Proper.** The unique minimizer over `s_θ` equals the true ratios `p(y)/p(x)`. The convexity of the underlying Bregman divergence ensures stable optimization and a log-barrier that excludes negative outputs.
2. **Denoising.** Theorem 3.4 of the paper shows that, when `p` is a noisy marginal `p_t = E_{x_0}[p_t(. | x_0)]` produced by the forward CTMC, the score entropy is equivalent up to an additive constant to a *denoising score entropy* in which the unknown ratios `p_t(y)/p_t(x)` are replaced by the conditional ratios `p_t(y | x_0)/p_t(x | x_0)`. Those conditional ratios are available in closed form for the uniform and absorbing kernels, so a Monte Carlo estimator needs only one forward pass per sample.
3. **Weighted.** Choosing the weights `w_{xy}` to match the transition rates of the forward CTMC yields an evidence lower bound on the model likelihood, so minimizing the weighted score entropy bounds perplexity.[^1][^9]

For the absorbing kernel the loss reduces (up to terms determined by the noise schedule) to a sum of [cross-entropy](/wiki/cross-entropy) losses against the clean tokens at masked positions, with weights set by the schedule; this connection is the bridge that later masked-diffusion work (MDLM and MD4) builds on.[^7][^8]

### Sampling

At inference time SEDD initializes a sequence from the stationary distribution (all tokens drawn uniformly, or for the absorbing kernel a fully masked sequence) and integrates the reverse CTMC using the learned ratios. The paper introduces an *analytical sampler* that exploits the closed-form transition matrices and a Tweedie-style correction; this allows running with far fewer denoising steps than the sequence length while still matching baseline quality. The authors report that 1024-token samples reach GPT-2 quality with roughly 32 times fewer network evaluations than the sequence length.[^1][^9] Because the model parameterizes ratios directly, conditioning on tokens at arbitrary positions (left, right, middle, or scattered) is a Bayes-rule rearrangement of the ratios and requires no additional training, enabling zero-shot infilling.[^1][^2]

### Architecture and training setup

The network is a [Diffusion Transformer (DiT)](/wiki/diffusion_transformer)-style encoder-only [Transformer](/wiki/transformer) with adaptive layer-norm time conditioning and rotary positional embeddings. SEDD Small (about the parameter count of GPT-2 small) and SEDD Medium (about the parameter count of GPT-2 medium) are released. Training uses a batch size of 512, learning rate `3e-4`, linear warmup over the first 2,000 iterations, gradient-norm clipping at 1, and an EMA decay of 0.9999, on nodes of 8 [A100](/wiki/nvidia_a100) 80GB or 16 A100 40GB GPUs.[^11][^9]

## Experiments and Results

SEDD is trained on OpenWebText for the zero-shot GPT-2 comparison, on the One Billion Words corpus for likelihood evaluation, and on character-level text8 for ablations. The headline result is that on five of the standard zero-shot perplexity datasets used to evaluate GPT-2, the upper-bound perplexities of SEDD Absorb match or beat the GPT-2 numbers at comparable model size.[^1][^9]

| Dataset | GPT-2 Small | SEDD Absorb Small | GPT-2 Medium | SEDD Absorb Medium |
|---|---|---|---|---|
| LAMBADA | 45.04 | <= 50.92 | 35.66 | <= 42.77 |
| WikiText2 | 42.43 | <= 41.84 | 31.80 | <= 31.04 |
| PTB | 138.43 | <= 114.24 | 123.14 | <= 87.12 |
| WikiText103 | 41.60 | <= 40.62 | 31.39 | <= 29.98 |
| One Billion Words | 75.20 | <= 79.29 | 55.72 | <= 61.19 |

Values are perplexities; lower is better. SEDD numbers are upper bounds because the model defines a likelihood through an ELBO. Sources: arXiv:2310.16834 v3 and the project blog post.[^1][^9]

Against the existing crop of discrete diffusion baselines (D3PM, SEDD's predecessors, and continuous-embedding diffusion such as PLAID), the paper reports 25 to 75 percent reductions in perplexity at comparable model sizes.[^1][^4] On unconditional text generation, the authors report that SEDD produces samples whose generative perplexity (as measured by a stronger evaluator model) is 6 to 8 times better than un-annealed GPT-2 samples; that is, SEDD does not require temperature scaling or nucleus sampling to avoid degenerate outputs. The MAUVE divergence against the reference distribution is comparable to that of nucleus-sampled GPT-2.[^1][^9] The paper also demonstrates infilling tasks where prompts appear at arbitrary positions, which is awkward for a left-to-right [GPT](/wiki/gpt_generative_pre-trained_transformer) model but native to SEDD.[^1]

## Variants and Follow-Up Work

SEDD anchored a wave of follow-up work that simplified or generalized the framework.

- **MDLM (Masked Diffusion Language Models).** Sahoo et al. at Cornell and partners, published at NeurIPS 2024, focused exclusively on the absorbing kernel and derived a Rao-Blackwellized continuous-time objective that they showed equals a weighted average of standard masked language modeling cross-entropy losses. They report a roughly 17 percent improvement on the LM1B perplexity bound relative to SEDD trained for 33 billion tokens, without invoking the CTMC machinery.[^7]
- **MD4 (Simplified and Generalized Masked Diffusion).** Shi, Han, Wang, Doucet, and Titsias at [Google DeepMind](/wiki/google_deepmind) independently established that the continuous-time variational objective for masked discrete diffusion reduces to a simple weighted integral of cross-entropies and extended the framework to state-dependent masking schedules. MD4 outperforms SEDD on most of the reported language benchmarks and competes with autoregressive models.[^8]
- **LLaDA (Large Language Diffusion with mAsking).** Released in 2025 by researchers at Renmin University and Ant Group, LLaDA is an 8 billion parameter masked diffusion language model that scales the absorbing-kernel recipe and reaches parity with [LLaMA3](/wiki/llama_3) 8B on a number of zero-shot benchmarks. It is descended methodologically from SEDD and the MDLM/MD4 line.[^12]
- **[Inception Labs](/wiki/inception_labs) Mercury.** Mercury and Mercury Coder, released in 2025 by Inception Labs (cofounded by Stefano Ermon, an author on SEDD), are commercial diffusion language models that the company markets as providing very high throughput for code and chat. Reporting on Mercury and its lineage points back to the score entropy paper as the technical predecessor.[^13]

The original SEDD reference implementation lives in the GitHub repository `louaaron/Score-Entropy-Discrete-Diffusion` under the MIT license, with pretrained checkpoints `louaaron/sedd-small` and `louaaron/sedd-medium` released on Hugging Face. The codebase exposes both the absorbing and uniform graphs and the analytical sampler.[^6][^14]

## Applications and Significance

The headline application of SEDD is non-autoregressive language modeling. By matching a GPT-2-scale autoregressive baseline at the same parameter count, SEDD was widely read as the first credible empirical demonstration that the autoregressive next-token paradigm is not the only viable route to a competitive language model.[^1][^2][^15] Three properties of SEDD are repeatedly cited in subsequent work and commentary:

1. **Controllable, prompt-agnostic generation.** Because the model parameterizes ratios, it can condition on observed tokens at any positions without the train-time tricks (suffix prompting, fill-in-the-middle, prefix LM objectives) that autoregressive models need.[^1][^9]
2. **Compute and quality trade-off.** The number of reverse-diffusion steps can be reduced, with a graceful quality degradation, in a way that is structurally unavailable to a strictly sequential autoregressive sampler.[^1][^4]
3. **Faithful sampling without annealing.** SEDD reaches sample quality competitive with nucleus-sampled GPT-2 from the unmodified model distribution, sidestepping the temperature and top-p hyperparameters that autoregressive samplers rely on to avoid degeneration.[^1][^9]

Beyond language, the score entropy framework applies to any discrete generative task. Subsequent work has explored protein sequence design and other categorical-data domains using SEDD-style ratio modeling.[^16]

## Limitations

The SEDD paper and independent evaluations are explicit about several weaknesses.

- **Short-prompt conditional generation.** A 2024 survey of diffusion language modeling notes that SEDD is slightly weaker than GPT-2 when conditioning on short prompts, and is less diverse on some prompted generation metrics.[^15]
- **FLOP and KV-cache inefficiency.** Iterative denoising recomputes attention over the full sequence at every step, and the mask tokens that dominate early sampling steps consume compute without contributing semantic content. Non-causal attention also makes KV caching nontrivial relative to a decoder-only autoregressive model.[^15]
- **Fixed sequence length.** A SEDD generation operates on a pre-allocated buffer of fixed length, in contrast to autoregressive sampling that can extend a sequence indefinitely.[^15]
- **Scaling beyond GPT-2 scale was not demonstrated in the original paper.** The released checkpoints stop at GPT-2 medium parameter counts; large-scale evidence that the recipe continues to compete with modern autoregressive [large language models](/wiki/large_language_model) came later, through MDLM, MD4, and LLaDA rather than from SEDD itself.[^7][^8][^12]

## Reception and Awards

The paper appeared on arXiv on 25 October 2023; the third revision was posted on 6 June 2024.[^1] It was accepted as an oral and a Best Paper at the 41st International Conference on Machine Learning, held in Vienna, Austria from 21 to 27 July 2024, where it was one of ten Best Paper award recipients.[^4][^5] [Stanford AI Lab](/wiki/stanford_hai) published a congratulatory announcement dated 26 July 2024.[^3] The paper is published in the ICML 2024 proceedings.[^17]

## Comparison

| System | Year | Loss / training objective | Forward process | Notes |
|---|---|---|---|---|
| D3PM (Austin et al.) | 2021 | Variational ELBO on mean prediction | Discrete Markov chain (categorical) | Tractable but loose continuous-time bounds. |
| Concrete Score Matching | 2022 | L2 on ratio estimates | Continuous-time discrete | No positivity constraint; unstable. |
| **SEDD** | 2023 | **Score entropy / denoising score entropy** | **CTMC, absorbing or uniform** | **First non-AR LM to match GPT-2 at comparable size.** |
| MDLM | 2024 | Rao-Blackwellized weighted cross-entropy | Absorbing only | Outperforms SEDD on LM1B at matched compute. |
| MD4 | 2024 | Simple weighted integral of cross-entropies | Absorbing, with state-dependent schedules | Outperforms SEDD on most reported benchmarks. |
| LLaDA | 2025 | Masked diffusion objective | Absorbing | 8B parameters; competitive with [LLaMA3](/wiki/llama_3) 8B zero-shot. |

Sources: arXiv:2310.16834 v3 for SEDD;[^1] the MDLM and MD4 arXiv preprints;[^7][^8] LLaDA reporting.[^12]

## Related Work

SEDD sits at the intersection of three lines of research: continuous-time [denoising diffusion](/wiki/ddpm) generative models, [masked language modeling](/wiki/masked_language_model) in the spirit of [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), and energy-based probabilistic modeling, where modeling unnormalized ratios is a familiar device for avoiding intractable partition functions. The [Diffusion Transformer](/wiki/diffusion_transformer) architecture from Peebles and Xie supplied the conditioning backbone. Followers in the [diffusion language models](/wiki/diffusion_language_models) line include MDLM, MD4, LLaDA, and the commercial Mercury family from [Inception Labs](/wiki/inception_labs).[^1][^7][^8][^12][^13]

## See also

- [Diffusion Language Models](/wiki/diffusion_language_models)
- [DDPM](/wiki/ddpm)
- [Diffusion Transformer (DiT)](/wiki/diffusion_transformer)
- [Score matching](/wiki/score_matching)
- [Masked Language Model](/wiki/masked_language_model)
- [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers)
- [GPT-2](/wiki/gpt-2)
- [Inception Labs](/wiki/inception_labs)
- [Direct Preference Optimization (DPO)](/wiki/direct_preference_optimization_dpo)
- [Markov chain](/wiki/markov_chain)

## References

[^1]: Aaron Lou, Chenlin Meng, Stefano Ermon, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution", arXiv, 2023-10-25 (v1) / 2024-06-06 (v3). https://arxiv.org/abs/2310.16834. Accessed 2026-05-21.
[^2]: Aaron Lou, "Language Modeling by Estimating the Ratios of the Data Distribution", aaronlou.com blog, 2024. https://aaronlou.com/blog/2024/discrete-diffusion/. Accessed 2026-05-21.
[^3]: Cindy Duong, "Congratulations to Aaron Lou, Chenlin Meng, and Stefano Ermon for an ICML 2024 Best Paper Award!", Stanford Artificial Intelligence Laboratory, 2024-07-26. https://ai.stanford.edu/news/congratulations-to-aaron-lou-chenlin-meng-and-stefano-ermon-for-an-icml-2024-best-paper-award/. Accessed 2026-05-21.
[^4]: AIhub editorial team, "Congratulations to the #ICML2024 award winners", AIhub, 2024-07-25. https://aihub.org/2024/07/25/congratulations-to-the-icml2024-award-winners/. Accessed 2026-05-21.
[^5]: ICML, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (Poster, Best Paper)", icml.cc virtual program, 2024. https://icml.cc/virtual/2024/poster/34686. Accessed 2026-05-21.
[^6]: Aaron Lou, "louaaron/Score-Entropy-Discrete-Diffusion (README)", GitHub, 2024. https://github.com/louaaron/Score-Entropy-Discrete-Diffusion. Accessed 2026-05-21.
[^7]: Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, Volodymyr Kuleshov, "Simple and Effective Masked Diffusion Language Models", arXiv, 2024-06-11 (revised 2024-11-10). https://arxiv.org/abs/2406.07524. Accessed 2026-05-21.
[^8]: Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, Michalis K. Titsias, "Simplified and Generalized Masked Diffusion for Discrete Data", arXiv, 2024-06-06 (revised 2025-01-16). https://arxiv.org/abs/2406.04329. Accessed 2026-05-21.
[^9]: Aaron Lou, Chenlin Meng, Stefano Ermon, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (HTML v3)", arXiv, 2024. https://arxiv.org/html/2310.16834v3. Accessed 2026-05-21.
[^10]: Aaron Lou, "Aaron Lou (personal site)", aaronlou.com, 2024. https://aaronlou.com/. Accessed 2026-05-21.
[^11]: Greg Schoeninger, "ArXiv Dives: Text Diffusion with SEDD", Oxen.ai blog, 2024. https://ghost.oxen.ai/arxiv-dives-text-diffusion-with-sedd/. Accessed 2026-05-21.
[^12]: Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li, "Large Language Diffusion Models (LLaDA)", arXiv, 2025-02. https://arxiv.org/abs/2502.09992. Accessed 2026-05-21.
[^13]: The Batch editorial team, "Mercury Coder May Be the First Commercially Available Language Diffusion Model", DeepLearning.AI The Batch, 2025. https://www.deeplearning.ai/the-batch/mercury-coder-may-be-the-first-commercially-available-language-diffusion-model/. Accessed 2026-05-21.
[^14]: Aaron Lou, "louaaron/sedd-small (model card)", Hugging Face, 2024. https://huggingface.co/louaaron/sedd-small. Accessed 2026-05-21.
[^15]: Justin Deschenaux, Caglar Gulcehre, "Promises, Outlooks and Challenges of Diffusion Language Modeling", arXiv, 2024-06-17. https://arxiv.org/html/2406.11473v1. Accessed 2026-05-21.
[^16]: Alex Carlin, "Score entropy discrete diffusion models for protein design", alexcarlin.bearblog.dev, 2024. https://alexcarlin.bearblog.dev/score-entropy-discrete-diffusion-models-for-protein-design/. Accessed 2026-05-21.
[^17]: Aaron Lou, Chenlin Meng, Stefano Ermon, "Discrete diffusion modeling by estimating the ratios of the data distribution", Proceedings of the 41st International Conference on Machine Learning (ICML 2024), 2024. https://dl.acm.org/doi/10.5555/3692070.3693403. Accessed 2026-05-21.