Mercury (Inception Labs)

Diffusion Models Large Language Models

12 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v4 · 2,411 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Mercury is a family of commercial-scale diffusion-based large language models developed by Inception Labs, a Palo Alto startup co-founded in 2024 by Stanford professor Stefano Ermon together with his former students Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell).^[1]^[2] The first model in the family, Mercury Coder, was unveiled on February 26, 2025 and was promoted by the company as the world's first commercial-grade diffusion LLM (dLLM).^[3]^[1] Mercury generates text by iteratively denoising blocks of tokens in parallel rather than predicting one token at a time, which the company reports yields throughput of more than 1,000 tokens per second on a single NVIDIA H100, several times the rate of speed-optimized autoregressive baselines such as GPT-4o Mini and Claude 3.5 Haiku.^[3]^[1]^[4] A technical report describing the system, "Mercury: Ultra-Fast Language Models Based on Diffusion," was posted to arXiv in June 2025.^[4]

Infobox

Field	Value
Developer	Inception Labs (Inception AI, Inc.)
Co-founders	Stefano Ermon, Aditya Grover, Volodymyr Kuleshov^[1]^[2]
Headquarters	Palo Alto, California^[1]^[2]
First public model	Mercury Coder (Mini and Small)^[3]^[4]
Initial release	February 26, 2025^[3]^[1]
Architecture	Diffusion language model with Transformer backbone^[4]
Reported throughput (H100)	1,109 tokens/s (Mini), 737 tokens/s (Small)^[4]
Technical report	arXiv:2506.17298 (June 17, 2025)^[4]
Seed funding	$50 million, led by Menlo Ventures (November 2025)^[1]^[5]
Listed price (Mercury)	$0.25 / 1M input tokens; $1.00 / 1M output tokens^[6]

Background

Diffusion language models before Mercury

Diffusion models had become the dominant paradigm for image and video generation by the early 2020s. Ermon's group at Stanford was one of the originators of the family, with score-matching and score-based generative modeling foundations that fed into systems such as Stable Diffusion and DDPM.^[7] Extending diffusion to discrete text was a research priority for the same group: in October 2023, Aaron Lou, Chenlin Meng and Ermon released the SEDD (Score Entropy Discrete Diffusion) paper, which introduced a score-entropy loss for discrete diffusion and reported language-modeling perplexities competitive with autoregressive GPT-2.^[8] SEDD was recognized as an ICML 2024 Best Paper.^[8] A separate line of academic work culminated in February 2025 with the LLaDA (Large Language Diffusion) model from Renmin University and Ant Group, an 8-billion-parameter masked diffusion LLM trained from scratch.^[9]

Mercury Coder appeared in the same month as LLaDA but was positioned differently: it was a hosted commercial product targeting production code-completion workloads, not an academic checkpoint.^[3]^[10]

Founding of Inception Labs

Inception Labs was founded in 2024 by Ermon, "tapping two former students" to co-lead the company: Aditya Grover (Assistant Professor at UCLA) and Volodymyr Kuleshov (Assistant Professor at Cornell Tech).^[1]^[2] All three held tenure-track university positions and have collaborated on generative modeling research for more than a decade.^[11] Menlo Ventures described the trio as scientists who "left to commercialize the technology, recognizing that scaling diffusion language models to production required compute, capital, and engineering capacity that simply didn't exist inside a university."^[11]

The company's early backing came from Mayfield Fund, which led an undisclosed initial round before the public launch.^[2] On November 6, 2025, Inception announced a $50 million seed round led by Menlo Ventures, with participation from Mayfield, Innovation Endeavors, Microsoft's M12, Snowflake Ventures, Databricks Investment and Nvidia's NVentures; angel investors included Andrew Ng and Andrej Karpathy.^[1]^[5]

The Mercury Coder launch (February 26, 2025)

Inception emerged from stealth on February 26, 2025 with a blog post titled "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model," accompanied by simultaneous coverage in TechCrunch and other outlets.^[3]^[1] The initial release comprised two code-focused variants, Mercury Coder Mini and Mercury Coder Small, available through a hosted playground at chat.inceptionlabs.ai (run in partnership with Lambda Labs) and through an API endpoint for enterprise customers.^[3]

Ermon framed the launch as a paradigm change in inference speed. Speaking to TechCrunch, he explained that conventional autoregressive LLMs are inherently sequential because "you cannot generate the second word until you've generated the first one," whereas a dLLM emits and refines an entire block of tokens at once.^[1] An Inception spokesperson told TechCrunch that "our 'small' coding model is as good as GPT-4o mini while more than 10 times as fast" and that the company already had unnamed Fortune 100 customers seeking lower-latency AI.^[1]

The launch post reported throughputs of 1,109 tokens per second for Mercury Coder Mini and 737 tokens per second for Mercury Coder Small on a single NVIDIA H100, throughputs the company described as previously achievable only on custom inference silicon.^[3]^[4]

Technical details

Architecture

The Mercury models keep a standard Transformer backbone but replace autoregressive next-token prediction with a discrete diffusion training objective. According to the arXiv technical report, the models are "trained to predict multiple tokens in parallel," and generation proceeds "in a coarse-to-fine way" through iterative denoising of a noised block of tokens.^[4]^[3] Inputs and outputs are token sequences exactly as in a conventional LLM, so Mercury is, in the authors' words, "a drop-in replacement for a typical autoregressive LLM."^[3]

Unlike a left-to-right sampler that emits one token per forward pass, a Mercury sampling step updates all (or many) positions in the output window simultaneously. The number of denoising iterations is configurable; fewer iterations trade quality for additional speed, and more iterations recover quality at the cost of more compute per request. The technical report frames this as a tunable speed/quality lever that does not exist in pure autoregressive decoders.^[4]

Throughput on NVIDIA H100

The headline efficiency numbers reported in the arXiv paper and the Inception blog are:

Model	HumanEval	MBPP	MultiPL-E	EvalPlus	Throughput (H100)
Mercury Coder Mini	88.0	77.1	74.1	78.6	1,109 tok/s
Mercury Coder Small	90.0	76.6	76.2	80.4	737 tok/s

^[4]^[3]^[10]

For context, the same launch post reported Claude 3.5 Haiku at roughly 61 tokens per second and GPT-4o Mini at roughly 59 tokens per second on equivalent measurements, an order-of-magnitude gap that Inception attributes to parallel decoding rather than to a custom accelerator.^[3]

Coding benchmarks

The Mercury technical report evaluates the two Coder variants on a standard suite that includes HumanEval, MBPP, EvalPlus, MultiPL-E (a multi-language translation of HumanEval covering C++, Java, JavaScript, PHP, Bash and TypeScript), Fill-in-the-Middle, LiveCodeBench and BigCodeBench.^[4]^[10] On HumanEval, Mercury Coder Small scores 90.0 (matching the GPT-4o Mini result quoted alongside it), while Mercury Coder Mini ties with the company's reported Claude 3.5 Haiku score of 88.0.^[3]^[4] On MultiPL-E, Mercury Coder Small reaches 76.2 and on EvalPlus it reaches 80.4.^[4]^[10]

The Fill-in-the-Middle setting, which is the load Inception emphasizes as most representative of real code-completion traffic, is where the diffusion architecture excels: Mercury Coder Small reports 84.8 average accuracy, ahead of Mistral's Codestral 2501 at 82.5.^[4]^[10]

Copilot Arena

An independent evaluation came from Copilot Arena, the third-party live code-LLM benchmarking platform. The launch post and the technical report both note that Mercury Coder Mini placed second on user-preference quality while having the lowest measured latency, with an average completion latency of 25 milliseconds, roughly four times faster than GPT-4o Mini.^[3]^[4]

Variants and platform

Mercury Coder Mini and Mercury Coder Small

The February 2025 launch included two sizes targeted at edit-style and completion-style coding workloads. Mini emphasizes throughput at roughly 1,100 tokens per second and is the workhorse used in Copilot Arena's apply-edit evaluations; Small trades some throughput for higher accuracy on HumanEval and MultiPL-E.^[3]^[4]

Mercury (general chat) and Mercury 2

Inception subsequently expanded the family. A general-chat Mercury was introduced after the Coder launch, and on May 14, 2026 the company posted "The Next Step for dLLMs: Scaling up Mercury," an upgrade with reported improvements in coding, instruction following, mathematical problem solving and knowledge recall, priced at $0.25 per million input tokens and $1.00 per million output tokens.^[6] Inception's documentation lists Mercury with a 128K-token context window.^[12] On February 24, 2026, Inception announced Mercury 2, described as "the fastest reasoning LLM" and the first reasoning dLLM, with reported throughput around 1,000 tokens per second versus roughly 89 tokens per second for Claude Haiku reasoning and roughly 71 for GPT-5 Mini on the company's benchmarks.^[13]

Distribution and integrations

Mercury is delivered through several channels. Inception runs its own API platform and the chat.inceptionlabs.ai playground (Lambda Labs hosted), and the model is also routed through OpenRouter and Models.dev.^[6]^[3] Mercury foundation models were added to Amazon Bedrock Marketplace and Amazon SageMaker JumpStart.^[14] The November 2025 funding announcement specifically credited customer integrations including the low-code site builder Buildglare, the open-source coding-agent extension Kilo Code and the IDE-side assistant ProxyAI as commercial signals supporting the round.^[1]^[15]

Significance

Mercury matters in two distinct ways. First, it operationalized at commercial scale a research idea (discrete diffusion for natural language) that had until then existed only as academic checkpoints such as SEDD and LLaDA.^[8]^[9] Second, it offers a different lever for cheap and fast inference than the techniques that dominate autoregressive serving stacks: speculative decoding, continuous batching and quantization all keep the next-token loop intact, whereas diffusion replaces that loop with a small number of parallel refinement passes.^[4]^[11] Menlo Ventures argued the bet was an attempt to do for autoregression what Transformers did for RNNs.^[11]

The practical pitch is latency-sensitive in-editor tooling. The 25-millisecond Copilot Arena latency is closer to the time budget of an AI code generation hint that should not interrupt typing than the hundreds of milliseconds that mid-range autoregressive models consume per completion.^[3]^[4] Integrations such as Buildglare's apply-edit pipeline use Mercury precisely because parallel refinement is well suited to rewriting a section of code in place.^[15]

Limitations and criticisms

The publicly reported numbers come primarily from Inception's own blog, its technical report and the third-party Copilot Arena leaderboard; some peer comparisons (notably the Claude 3.5 Haiku and GPT-4o Mini throughput figures of about 60 tokens per second) are quoted by Inception against its own measurements and may not match what those vendors achieve on optimized serving stacks.^[3] On reasoning-heavy benchmarks the Mercury Coder report shows lower absolute scores than on HumanEval (LiveCodeBench at 17.0 for Mini and 25.0 for Small, BigCodeBench at 42.0 for Mini and 45.5 for Small), reflecting the smaller scale of the Coder variants relative to frontier reasoning models.^[4]

Mercury is also a hosted, closed model. Unlike LLaDA, which released weights and a paper, Mercury Coder weights are not publicly downloadable; access is via API or marketplace, and the architecture and training details published so far in the arXiv report are partial relative to a full system card.^[4]^[9]

System	Origin	Type	Public weights	Notes
SEDD	Aaron Lou, Chenlin Meng, Stefano Ermon (Stanford), Oct 2023^[8]	Academic discrete diffusion (score entropy)	Yes (GitHub)	ICML 2024 Best Paper; introduced score-entropy loss^[8]
LLaDA (8B)	Renmin University / Ant Group, Feb 2025^[9]	Academic masked diffusion LLM	Yes	First open 8B-scale diffusion LLM, instruction-tuned^[9]
Mercury Coder (Mini, Small)	Inception Labs, Feb 26, 2025^[3]	Commercial diffusion code LLM	No (API only)	First commercial dLLM; 1,109 / 737 tok/s on H100^[4]
Mercury (chat)	Inception Labs, 2025 / refresh May 14, 2026^[6]	Commercial diffusion chat LLM	No	128K context; $0.25/$1.00 per 1M tokens^[6]^[12]
Mercury 2	Inception Labs, Feb 24, 2026^[13]	Commercial reasoning diffusion LLM	No	"First reasoning dLLM"; ~1,000 tok/s^[13]

The academic systems (SEDD, LLaDA) established that discrete diffusion could match autoregressive language modeling on perplexity and downstream tasks; Mercury is the engineering effort to push those ideas to commercial-grade throughput on a hosted product.^[8]^[9]^[4]

References

Maxwell Zeff, "Inception emerges from stealth with a new type of AI model", TechCrunch, 2025-02-26. https://techcrunch.com/2025/02/26/inception-emerges-from-stealth-with-a-new-type-of-ai-model/ . Accessed 2026-05-21. ↩
Charles Rollet, "Inception raises $50 million to build diffusion models for code and text", TechCrunch, 2025-11-06. https://techcrunch.com/2025/11/06/inception-raises-50-million-to-build-diffusion-models-for-code-and-text/ . Accessed 2026-05-21. ↩
Inception Labs, "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model", Inception Labs blog, 2025-02-26. https://www.inceptionlabs.ai/blog/introducing-mercury . Accessed 2026-05-21. ↩
Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, Volodymyr Kuleshov, "Mercury: Ultra-Fast Language Models Based on Diffusion", arXiv:2506.17298, 2025-06-17. https://arxiv.org/abs/2506.17298 . Accessed 2026-05-21. ↩
Inception Labs (via Business Wire), "Inception Raises $50M to Power Diffusion LLMs, Increasing LLM Speed and Efficiency by up to 10X and Unlocking Real-Time, Accessible AI Applications", Business Wire, 2025-11-06. https://www.businesswire.com/news/home/20251106570339/en/Inception-Raises-50M-to-Power-Diffusion-LLMs-Increasing-LLM-Speed-and-Efficiency-by-up-to-10X-and-Unlocking-Real-Time-Accessible-AI-Applications . Accessed 2026-05-21. ↩
Inception Labs, "The Next Step for dLLMs: Scaling up Mercury", Inception Labs blog, 2026-05-14. https://www.inceptionlabs.ai/blog/mercury-refreshed . Accessed 2026-05-21. ↩
Stefano Ermon, "Stefano Ermon (Stanford University faculty page)", Stanford Computer Science, 2026. https://cs.stanford.edu/~ermon/ . Accessed 2026-05-21. ↩
Aaron Lou, Chenlin Meng, Stefano Ermon, "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution", arXiv:2310.16834, 2023-10-25. https://arxiv.org/abs/2310.16834 . Accessed 2026-05-21. ↩
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li, "Large Language Diffusion Models (LLaDA)", arXiv:2502.09992, 2025-02-14. https://arxiv.org/abs/2502.09992 . Accessed 2026-05-21. ↩
Asif Razzaq, "Inception Labs Introduces Mercury: A Diffusion-Based Language Model for Ultra-Fast Code Generation", MarkTechPost, 2025-06-26. https://www.marktechpost.com/2025/06/26/inception-labs-introduces-mercury-a-diffusion-based-language-model-for-ultra-fast-code-generation/ . Accessed 2026-05-21. ↩
Menlo Ventures, "From the Lab to the Frontier: The Story Behind Inception", Menlo Ventures perspective, 2025-11-06. https://menlovc.com/perspective/from-the-lab-to-the-frontier-the-story-behind-inception/ . Accessed 2026-05-21. ↩
Inception Labs, "Models, Endpoints, and Pricing", Inception Platform documentation, 2026. https://docs.inceptionlabs.ai/get-started/models . Accessed 2026-05-21. ↩
Inception Labs (via Business Wire), "Inception Launches Mercury 2, the Fastest Reasoning LLM, 5x Faster Than Leading Speed-Optimized LLMs, with Dramatically Lower Inference Cost", Business Wire, 2026-02-24. https://www.businesswire.com/news/home/20260224034496/en/Inception-Launches-Mercury-2-the-Fastest-Reasoning-LLM-5x-Faster-Than-Leading-Speed-Optimized-LLMs-with-Dramatically-Lower-Inference-Cost . Accessed 2026-05-21. ↩
AWS Machine Learning Blog, "Mercury foundation models from Inception Labs are now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart", Amazon Web Services, 2025. https://aws.amazon.com/blogs/machine-learning/mercury-foundation-models-from-inception-labs-are-now-available-in-amazon-bedrock-marketplace-and-amazon-sagemaker-jumpstart/ . Accessed 2026-05-21. ↩
Inception Labs, "Buildglare: Accelerating Low-Code Web Development with Mercury Coder", Inception Labs blog, 2025. https://www.inceptionlabs.ai/blog/buildglare-and-inception . Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Discrete diffusion language model Inception Labs Stefano Ermon

Infobox

Background

Diffusion language models before Mercury

Founding of Inception Labs

The Mercury Coder launch (February 26, 2025)

Technical details

Architecture

Throughput on NVIDIA H100

Coding benchmarks

Copilot Arena

Variants and platform

Mercury Coder Mini and Mercury Coder Small

Mercury (general chat) and Mercury 2

Distribution and integrations

Significance

Limitations and criticisms

Comparison with related diffusion language models

See also

References

Improve this article

Related Articles

Diffusion Language Models

Inception Labs

LLaDA (Large Language Diffusion)

Gemini Diffusion

Stable Diffusion

DALL-E

What links here

Related Articles

Diffusion Language Models

Inception Labs

LLaDA (Large Language Diffusion)

Gemini Diffusion

Stable Diffusion

DALL-E

What links here