# Stefano Ermon

> Source: https://aiwiki.ai/wiki/stefano_ermon
> Updated: 2026-06-27
> Categories: Generative AI, Machine Learning, People
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Stefano Ermon** is an Italian computer scientist and an associate professor of computer science at [Stanford University](/wiki/stanford_university), best known for foundational work on score-based [generative models](/wiki/generative_model), the approach that underpins much of the modern theory of [diffusion models](/wiki/diffusion_model). With his doctoral student Yang Song he introduced noise conditional score networks in 2019 and a continuous-time, stochastic differential equation formulation of score-based models in 2021, and he teaches Stanford's CS236 Deep Generative Models course. In 2024 he co-founded Inception (also known as Inception Labs), a Palo Alto company that builds diffusion-based [large language models](/wiki/large_language_model) such as Mercury, and he serves as its chief executive. [1][2][3]

## Who is Stefano Ermon?

Ermon is a researcher in machine learning and generative AI whose work spans probabilistic modeling, [imitation learning](/wiki/imitation_learning), and artificial intelligence for sustainability and social good. He is affiliated with the Stanford Artificial Intelligence Laboratory and is a fellow of the Woods Institute for the Environment. His best-known scientific contribution, score-based generative modeling, reframed sample generation as the task of estimating and following the gradient of a data distribution, and it became one of the theoretical pillars of diffusion-based image, audio, and video synthesis. [1][4][5]

## Where did Stefano Ermon study?

Ermon studied electrical engineering at the University of Padova in Italy, where he earned a Bachelor of Science in 2006 and a Master of Science in 2008, both summa cum laude. [4][5]

He then moved to the United States for doctoral study at Cornell University. He completed a PhD in computer science, with a minor in applied mathematics, between 2008 and early 2015. His dissertation was titled "Decision Making and Inference under Limited Information and High Dimensionality," and his advisors were Carla P. Gomes and Bart Selman. While at Cornell he held a McMullen Fellowship and worked within the group that helped define computational sustainability, a research area that applies computational methods to environmental, economic, and societal problems. [4][6]

## What does Stefano Ermon research at Stanford?

Ermon joined Stanford University as an assistant professor in November 2014 and was later promoted to associate professor. He is affiliated with the Stanford Artificial Intelligence Laboratory and is a fellow of the Woods Institute for the Environment. His group studies methods for probabilistic modeling, generative modeling, and decision making, with applications that range from image synthesis to satellite imagery analysis. He also teaches CS236 Deep Generative Models, a graduate course that covers autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks, energy-based models, and score-based diffusion models, and whose lecture materials are widely used outside Stanford. [4][5]

### What are score-based generative models?

The contribution most associated with Ermon is score-based generative modeling. In a 2019 paper titled "Generative Modeling by Estimating Gradients of the Data Distribution," Ermon and his student Yang Song proposed learning the gradient of the log probability density of data, a quantity known as the score, rather than the density itself. In the paper's own words, "samples are produced via Langevin dynamics using gradients of the data distribution estimated with [score matching](/wiki/score_matching)," and the authors reported that their models "produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets." Earlier generative methods often modeled the probability density directly or relied on adversarial training, and both approaches carried known difficulties around normalization constants and training stability. Song and Ermon argued that the score function avoids the intractable normalization constant entirely, since the gradient of a log density does not depend on it. [7][8]

The method has two main parts. First, the authors trained a single neural network, the noise conditional score network (NCSN), to estimate scores across many levels of added Gaussian noise. Perturbing the data with a range of noise scales addresses a practical problem, namely that score estimates are unreliable in regions of low data density, and the added noise spreads probability mass so that the network sees informative training signal everywhere. Second, for generation they used annealed Langevin dynamics, a sampling procedure that starts from large noise and gradually reduces it, following the estimated score at each level to move random samples toward the data distribution. The paper appeared at the Conference on Neural Information Processing Systems (NeurIPS) in 2019 as an oral presentation, and a 2020 follow-up, "Improved Techniques for Training Score-Based Generative Models," refined the technique and scaled it to higher-resolution images. [7][8]

In 2021 Song, Ermon, and several coauthors unified score-based models and diffusion models within a single framework based on stochastic differential equations. The paper, "Score-Based Generative Modeling through Stochastic Differential Equations," described data corruption as a continuous-time forward process that slowly injects noise until the data becomes a simple known distribution. Sample generation runs the corresponding reverse-time process, which removes noise and depends only on the time-varying score of the perturbed data. The authors showed that many earlier methods, including the noise conditional score network and denoising diffusion, are discrete instances of this continuous view. They also introduced a predictor-corrector sampler that combines numerical integration with score-based correction steps, and a probability flow ordinary differential equation that shares the same marginal distributions as the stochastic process and allows exact likelihood computation. The paper received an Outstanding Paper Award at the International Conference on Learning Representations (ICLR) in 2021. This line of work sits beside the denoising diffusion approach developed elsewhere, and together these efforts established the theoretical basis for diffusion models that later powered systems such as [Stable Diffusion](/wiki/stable_diffusion). [9][10]

### What is discrete diffusion and the SEDD paper?

Ermon's group later extended score-based ideas from continuous data such as images to discrete data such as text. The 2024 paper "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution," with Aaron Lou and Chenlin Meng, introduced a loss called score entropy that extends score matching to discrete spaces and builds discrete diffusion models, abbreviated SEDD. The authors reported that SEDD reduced perplexity relative to existing language diffusion methods by 25 to 75 percent, was competitive with autoregressive language models, and outperformed GPT-2 while allowing similar quality with far fewer network evaluations. The paper won a Best Paper Award at the International Conference on Machine Learning (ICML) in 2024, and the work is part of the research lineage behind diffusion-based language modeling. [17][18]

### How does Ermon's imitation learning work relate to GANs?

With his student [Jonathan Ho](/wiki/jonathan_ho), Ermon introduced generative adversarial imitation learning, or GAIL, at the 2016 Conference on Neural Information Processing Systems. Imitation learning seeks to recover a policy from demonstrations of expert behavior without access to a reward signal. Classical approaches often first infer a reward function through inverse reinforcement learning and then optimize a policy against it, which is computationally heavy. GAIL skips the explicit reward step. It adapts ideas from [generative adversarial networks](/wiki/generative_adversarial_network) by training a discriminator to distinguish expert state-action pairs from those produced by the learner, while the policy is updated to fool the discriminator and so to match the expert behavior. The method became a widely cited reference point in imitation learning and reinforcement learning. [11]

Ermon has also published extensively on probabilistic inference and [variational inference](/wiki/variational_inference), on discrete optimization and counting problems, and on techniques for improving the training and evaluation of generative models. Jonathan Ho, an early member of his group, went on to co-author the denoising diffusion probabilistic models paper that is often cited alongside the score-based work as a starting point for diffusion methods. [4]

### How does Ermon apply AI to sustainability and social good?

A second strand of Ermon's research applies machine learning to sustainability and development. His group used satellite imagery and street-level images to estimate poverty, crop yields, and other livelihood indicators in regions where ground data is scarce. Work on poverty mapping in Africa was selected by Scientific American as one of its world-changing ideas for 2016, and crop-yield prediction models from his group won first place in the World Bank Big Data Innovation Challenge in 2017. In 2018 Ermon co-founded Atlas AI, a company focused on economic and agricultural forecasting from geospatial data, where he served as a chief technical advisor. [4][12]

## What is Inception Labs?

Inception, also known as Inception Labs, is a Palo Alto, California company that Ermon co-founded in 2024 with Aditya Grover and Volodymyr Kuleshov, two former collaborators who had worked with him at Stanford. Grover became a professor at the University of California, Los Angeles, and Kuleshov a professor at Cornell. The company emerged from stealth in February 2025, and Ermon serves as its chief executive. [2][3][13]

Inception applies diffusion methods to text generation. Most large language models are autoregressive, producing one token at a time in sequence, so each token depends on the tokens before it. A diffusion language model instead starts from a noisy or masked draft and refines many tokens in parallel across several denoising steps. Inception argues that this parallel design can raise generation speed and lower cost relative to sequential decoding. [13][14]

### What is Mercury?

The company's first product family is named [Mercury](/wiki/mercury_inception). Inception introduced it in February 2025 under the banner "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model," and released variants aimed at coding, including Mercury Coder Mini and Mercury Coder Small. The underlying network is a Transformer trained to refine a draft from noise over a small number of denoising steps, modifying many tokens at once rather than left to right. The company stated that "Mercury is up to 10x faster than frontier speed-optimized LLMs" and that "our models run at over 1000 tokens/sec on NVIDIA H100s," compared with at most about 200 tokens per second for speed-optimized autoregressive models. Inception reported throughput of about 1,109 tokens per second for Mercury Coder Mini and about 737 tokens per second for Mercury Coder Small on NVIDIA H100 hardware. On the HumanEval coding benchmark, for instance, the company reported a score of 88.0 for Mercury Coder Mini against 90.0 for a competing model of similar class. [13][15]

In November 2025 Inception announced a 50 million dollar funding round led by Menlo Ventures, with participation from Mayfield, Innovation Endeavors, M12, Snowflake Ventures, Databricks, and NVentures, along with angel investments from Andrew Ng and Andrej Karpathy. Alongside the round the company released an updated Mercury model aimed at software development and reported integrations with several developer tools. In February 2026 Inception launched Mercury 2, which it described as a reasoning-capable diffusion model that is several times faster than leading speed-optimized large language models while lowering inference cost. [3][12][19]

## Why is Stefano Ermon significant for generative AI?

Score-based generative modeling reframed sample generation as the task of estimating and following the gradient of a data distribution rather than directly modeling the distribution. The stochastic differential equation formulation that followed connected this view to diffusion processes and gave a single mathematical language for a family of generative methods. These ideas, developed in Ermon's group alongside parallel work on denoising diffusion, became part of the standard toolkit for image, audio, and video generation. His later move to apply diffusion to language, both in the SEDD research and in the commercial Mercury models, placed him among researchers exploring alternatives to the autoregressive design that dominates large language models. [9][16]

## What awards has Stefano Ermon received?

Ermon has received several awards for his research. In 2018 he received the IJCAI Computers and Thought Award, given to artificial intelligence researchers under the age of 35, for foundational work in probabilistic reasoning, machine learning, and decision making with broad societal impact. He held a Sloan Research Fellowship in 2020 and a Microsoft Research Faculty Fellowship in 2019, and he received the National Science Foundation CAREER Award in 2017. His papers have won an Outstanding Paper Award at the International Conference on Learning Representations in 2021 and a Best Paper Award at the International Conference on Machine Learning in 2024, among other conference honors. [4][6][17]

## Selected facts

| Field | Detail |
| --- | --- |
| Full name | Stefano Ermon |
| Occupation | Computer scientist |
| Known for | Score-based generative models, diffusion language models |
| Position | Associate professor of computer science, Stanford University |
| Affiliations | Stanford Artificial Intelligence Laboratory, Woods Institute for the Environment |
| Teaching | CS236 Deep Generative Models |
| PhD | Cornell University, computer science, 2015 |
| PhD advisors | Carla P. Gomes and Bart Selman |
| Earlier degrees | BSc 2006 and MSc 2008, electrical engineering, University of Padova |
| Companies | Inception (co-founder and chief executive, 2024), Atlas AI (co-founder, 2018) |
| Notable awards | IJCAI Computers and Thought Award (2018), Sloan Research Fellowship (2020), NSF CAREER Award (2017), ICLR Outstanding Paper (2021), ICML Best Paper (2024) |

## References

1. "Stefano Ermon." Stanford Computer Science. https://cs.stanford.edu/~ermon/
2. "Inception emerges from stealth with a new type of AI model." TechCrunch. February 26, 2025. https://techcrunch.com/2025/02/26/inception-emerges-from-stealth-with-a-new-type-of-ai-model/
3. "Inception raises $50 million to build diffusion models for code and text." TechCrunch. November 6, 2025. https://techcrunch.com/2025/11/06/inception-raises-50-million-to-build-diffusion-models-for-code-and-text/
4. "Stefano Ermon, Curriculum Vitae." Stanford Computer Science. https://cs.stanford.edu/~ermon/cv.pdf
5. "Stefano Ermon." Stanford Computer Science people directory. https://cs.stanford.edu/people/ermon/
6. "Stefano Ermon." Cornell University Computer Science. https://www.cs.cornell.edu/~ermonste/
7. Yang Song and Stefano Ermon. "Generative Modeling by Estimating Gradients of the Data Distribution." arXiv:1907.05600. 2019. https://arxiv.org/abs/1907.05600
8. Yang Song and Stefano Ermon. "Generative Modeling by Estimating Gradients of the Data Distribution." Advances in Neural Information Processing Systems 32, 2019. https://proceedings.neurips.cc/paper_files/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf
9. Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. "Score-Based Generative Modeling through Stochastic Differential Equations." International Conference on Learning Representations, 2021. https://openreview.net/forum?id=PxTIG12RRHS
10. "Generative Modeling by Estimating Gradients of the Data Distribution." Yang Song, blog explainer. https://yang-song.net/blog/2021/score/
11. Jonathan Ho and Stefano Ermon. "Generative Adversarial Imitation Learning." Advances in Neural Information Processing Systems 29, 2016. https://proceedings.neurips.cc/paper/2016/hash/cc7e2b878868cbae992d1fb743995d8f-Abstract.html
12. "From the Lab to the Frontier: The Story Behind Inception." Menlo Ventures. 2025. https://menlovc.com/perspective/from-the-lab-to-the-frontier-the-story-behind-inception/
13. "Mercury: Ultra-Fast Language Models Based on Diffusion." Inception Labs. arXiv:2506.17298. 2025. https://arxiv.org/abs/2506.17298
14. "Inception Labs: Making LLMs Faster and More Cost-Efficient." The New Stack. 2025. https://thenewstack.io/inception-labs-making-llms-faster-and-more-cost-efficient/
15. "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model." Inception Labs. 2025. https://www.inceptionlabs.ai/blog/introducing-mercury
16. Jonathan Ho, Ajay Jain, and Pieter Abbeel. "Denoising Diffusion Probabilistic Models." Advances in Neural Information Processing Systems 33, 2020. https://arxiv.org/abs/2006.11239
17. "Congratulations to Aaron Lou, Chenlin Meng, and Stefano Ermon for an ICML 2024 Best Paper Award." Stanford Artificial Intelligence Laboratory. 2024. https://ai.stanford.edu/news/congratulations-to-aaron-lou-chenlin-meng-and-stefano-ermon-for-an-icml-2024-best-paper-award/
18. Aaron Lou, Chenlin Meng, and Stefano Ermon. "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution." International Conference on Machine Learning, 2024. arXiv:2310.16834. https://arxiv.org/abs/2310.16834
19. "Inception Launches Mercury 2, the Fastest Reasoning LLM." Inception Labs / Business Wire. February 24, 2026. https://www.businesswire.com/news/home/20260224034496/en/Inception-Launches-Mercury-2-the-Fastest-Reasoning-LLM-5x-Faster-Than-Leading-Speed-Optimized-LLMs-with-Dramatically-Lower-Inference-Cost

