Mercury (Inception Labs)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,415 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,415 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mercury is a family of commercial-scale diffusion-based large language models developed by Inception Labs, a Palo Alto startup co-founded in 2024 by Stanford professor Stefano Ermon together with his former students Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell).[^1][^2] The first model in the family, Mercury Coder, was unveiled on February 26, 2025 and was promoted by the company as the world's first commercial-grade diffusion LLM (dLLM).[^3][^1] Mercury generates text by iteratively denoising blocks of tokens in parallel rather than predicting one token at a time, which the company reports yields throughput of more than 1,000 tokens per second on a single NVIDIA H100, several times the rate of speed-optimized autoregressive baselines such as GPT-4o Mini and Claude 3.5 Haiku.[^3][^1][^4] A technical report describing the system, "Mercury: Ultra-Fast Language Models Based on Diffusion," was posted to arXiv in June 2025.[^4]
| Field | Value |
|---|---|
| Developer | Inception Labs (Inception AI, Inc.) |
| Co-founders | Stefano Ermon, Aditya Grover, Volodymyr Kuleshov[^1][^2] |
| Headquarters | Palo Alto, California[^1][^2] |
| First public model | Mercury Coder (Mini and Small)[^3][^4] |
| Initial release | February 26, 2025[^3][^1] |
| Architecture | Diffusion language model with Transformer backbone[^4] |
| Reported throughput (H100) | 1,109 tokens/s (Mini), 737 tokens/s (Small)[^4] |
| Technical report | arXiv:2506.17298 (June 17, 2025)[^4] |
| Seed funding | $50 million, led by Menlo Ventures (November 2025)[^1][^5] |
| Listed price (Mercury) | $0.25 / 1M input tokens; $1.00 / 1M output tokens[^6] |
Diffusion models had become the dominant paradigm for image and video generation by the early 2020s. Ermon's group at Stanford was one of the originators of the family, with score-matching and score-based generative modeling foundations that fed into systems such as Stable Diffusion and DDPM.[^7] Extending diffusion to discrete text was a research priority for the same group: in October 2023, Aaron Lou, Chenlin Meng and Ermon released the SEDD (Score Entropy Discrete Diffusion) paper, which introduced a score-entropy loss for discrete diffusion and reported language-modeling perplexities competitive with autoregressive GPT-2.[^8] SEDD was recognized as an ICML 2024 Best Paper.[^8] A separate line of academic work culminated in February 2025 with the LLaDA (Large Language Diffusion) model from Renmin University and Ant Group, an 8-billion-parameter masked diffusion LLM trained from scratch.[^9]
Mercury Coder appeared in the same month as LLaDA but was positioned differently: it was a hosted commercial product targeting production code-completion workloads, not an academic checkpoint.[^3][^10]
Inception Labs was founded in 2024 by Ermon, "tapping two former students" to co-lead the company: Aditya Grover (Assistant Professor at UCLA) and Volodymyr Kuleshov (Assistant Professor at Cornell Tech).[^1][^2] All three held tenure-track university positions and have collaborated on generative modeling research for more than a decade.[^11] Menlo Ventures described the trio as scientists who "left to commercialize the technology, recognizing that scaling diffusion language models to production required compute, capital, and engineering capacity that simply didn't exist inside a university."[^11]
The company's early backing came from Mayfield Fund, which led an undisclosed initial round before the public launch.[^2] On November 6, 2025, Inception announced a $50 million seed round led by Menlo Ventures, with participation from Mayfield, Innovation Endeavors, Microsoft's M12, Snowflake Ventures, Databricks Investment and Nvidia's NVentures; angel investors included Andrew Ng and Andrej Karpathy.[^1][^5]
Inception emerged from stealth on February 26, 2025 with a blog post titled "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model," accompanied by simultaneous coverage in TechCrunch and other outlets.[^3][^1] The initial release comprised two code-focused variants, Mercury Coder Mini and Mercury Coder Small, available through a hosted playground at chat.inceptionlabs.ai (run in partnership with Lambda Labs) and through an API endpoint for enterprise customers.[^3]
Ermon framed the launch as a paradigm change in inference speed. Speaking to TechCrunch, he explained that conventional autoregressive LLMs are inherently sequential because "you cannot generate the second word until you've generated the first one," whereas a dLLM emits and refines an entire block of tokens at once.[^1] An Inception spokesperson told TechCrunch that "our 'small' coding model is as good as GPT-4o mini while more than 10 times as fast" and that the company already had unnamed Fortune 100 customers seeking lower-latency AI.[^1]
The launch post reported throughputs of 1,109 tokens per second for Mercury Coder Mini and 737 tokens per second for Mercury Coder Small on a single NVIDIA H100, throughputs the company described as previously achievable only on custom inference silicon.[^3][^4]
The Mercury models keep a standard Transformer backbone but replace autoregressive next-token prediction with a discrete diffusion training objective. According to the arXiv technical report, the models are "trained to predict multiple tokens in parallel," and generation proceeds "in a coarse-to-fine way" through iterative denoising of a noised block of tokens.[^4][^3] Inputs and outputs are token sequences exactly as in a conventional LLM, so Mercury is, in the authors' words, "a drop-in replacement for a typical autoregressive LLM."[^3]
Unlike a left-to-right sampler that emits one token per forward pass, a Mercury sampling step updates all (or many) positions in the output window simultaneously. The number of denoising iterations is configurable; fewer iterations trade quality for additional speed, and more iterations recover quality at the cost of more compute per request. The technical report frames this as a tunable speed/quality lever that does not exist in pure autoregressive decoders.[^4]
The headline efficiency numbers reported in the arXiv paper and the Inception blog are:
| Model | HumanEval | MBPP | MultiPL-E | EvalPlus | Throughput (H100) |
|---|---|---|---|---|---|
| Mercury Coder Mini | 88.0 | 77.1 | 74.1 | 78.6 | 1,109 tok/s |
| Mercury Coder Small | 90.0 | 76.6 | 76.2 | 80.4 | 737 tok/s |
[^4][^3][^10]
For context, the same launch post reported Claude 3.5 Haiku at roughly 61 tokens per second and GPT-4o Mini at roughly 59 tokens per second on equivalent measurements, an order-of-magnitude gap that Inception attributes to parallel decoding rather than to a custom accelerator.[^3]
The Mercury technical report evaluates the two Coder variants on a standard suite that includes HumanEval, MBPP, EvalPlus, MultiPL-E (a multi-language translation of HumanEval covering C++, Java, JavaScript, PHP, Bash and TypeScript), Fill-in-the-Middle, LiveCodeBench and BigCodeBench.[^4][^10] On HumanEval, Mercury Coder Small scores 90.0 (matching the GPT-4o Mini result quoted alongside it), while Mercury Coder Mini ties with the company's reported Claude 3.5 Haiku score of 88.0.[^3][^4] On MultiPL-E, Mercury Coder Small reaches 76.2 and on EvalPlus it reaches 80.4.[^4][^10]
The Fill-in-the-Middle setting, which is the load Inception emphasizes as most representative of real code-completion traffic, is where the diffusion architecture excels: Mercury Coder Small reports 84.8 average accuracy, ahead of Mistral's Codestral 2501 at 82.5.[^4][^10]
An independent evaluation came from Copilot Arena, the third-party live code-LLM benchmarking platform. The launch post and the technical report both note that Mercury Coder Mini placed second on user-preference quality while having the lowest measured latency, with an average completion latency of 25 milliseconds, roughly four times faster than GPT-4o Mini.[^3][^4]
The February 2025 launch included two sizes targeted at edit-style and completion-style coding workloads. Mini emphasizes throughput at roughly 1,100 tokens per second and is the workhorse used in Copilot Arena's apply-edit evaluations; Small trades some throughput for higher accuracy on HumanEval and MultiPL-E.[^3][^4]
Inception subsequently expanded the family. A general-chat Mercury was introduced after the Coder launch, and on May 14, 2026 the company posted "The Next Step for dLLMs: Scaling up Mercury," an upgrade with reported improvements in coding, instruction following, mathematical problem solving and knowledge recall, priced at $0.25 per million input tokens and $1.00 per million output tokens.[^6] Inception's documentation lists Mercury with a 128K-token context window.[^12] On February 24, 2026, Inception announced Mercury 2, described as "the fastest reasoning LLM" and the first reasoning dLLM, with reported throughput around 1,000 tokens per second versus roughly 89 tokens per second for Claude Haiku reasoning and roughly 71 for GPT-5 Mini on the company's benchmarks.[^13]
Mercury is delivered through several channels. Inception runs its own API platform and the chat.inceptionlabs.ai playground (Lambda Labs hosted), and the model is also routed through OpenRouter and Models.dev.[^6][^3] Mercury foundation models were added to Amazon Bedrock Marketplace and Amazon SageMaker JumpStart.[^14] The November 2025 funding announcement specifically credited customer integrations including the low-code site builder Buildglare, the open-source coding-agent extension Kilo Code and the IDE-side assistant ProxyAI as commercial signals supporting the round.[^1][^15]
Mercury matters in two distinct ways. First, it operationalized at commercial scale a research idea (discrete diffusion for natural language) that had until then existed only as academic checkpoints such as SEDD and LLaDA.[^8][^9] Second, it offers a different lever for cheap and fast inference than the techniques that dominate autoregressive serving stacks: speculative decoding, continuous batching and quantization all keep the next-token loop intact, whereas diffusion replaces that loop with a small number of parallel refinement passes.[^4][^11] Menlo Ventures argued the bet was an attempt to do for autoregression what Transformers did for RNNs.[^11]
The practical pitch is latency-sensitive in-editor tooling. The 25-millisecond Copilot Arena latency is closer to the time budget of an AI code generation hint that should not interrupt typing than the hundreds of milliseconds that mid-range autoregressive models consume per completion.[^3][^4] Integrations such as Buildglare's apply-edit pipeline use Mercury precisely because parallel refinement is well suited to rewriting a section of code in place.[^15]
The publicly reported numbers come primarily from Inception's own blog, its technical report and the third-party Copilot Arena leaderboard; some peer comparisons (notably the Claude 3.5 Haiku and GPT-4o Mini throughput figures of about 60 tokens per second) are quoted by Inception against its own measurements and may not match what those vendors achieve on optimized serving stacks.[^3] On reasoning-heavy benchmarks the Mercury Coder report shows lower absolute scores than on HumanEval (LiveCodeBench at 17.0 for Mini and 25.0 for Small, BigCodeBench at 42.0 for Mini and 45.5 for Small), reflecting the smaller scale of the Coder variants relative to frontier reasoning models.[^4]
Mercury is also a hosted, closed model. Unlike LLaDA, which released weights and a paper, Mercury Coder weights are not publicly downloadable; access is via API or marketplace, and the architecture and training details published so far in the arXiv report are partial relative to a full system card.[^4][^9]
| System | Origin | Type | Public weights | Notes |
|---|---|---|---|---|
| SEDD | Aaron Lou, Chenlin Meng, Stefano Ermon (Stanford), Oct 2023[^8] | Academic discrete diffusion (score entropy) | Yes (GitHub) | ICML 2024 Best Paper; introduced score-entropy loss[^8] |
| LLaDA (8B) | Renmin University / Ant Group, Feb 2025[^9] | Academic masked diffusion LLM | Yes | First open 8B-scale diffusion LLM, instruction-tuned[^9] |
| Mercury Coder (Mini, Small) | Inception Labs, Feb 26, 2025[^3] | Commercial diffusion code LLM | No (API only) | First commercial dLLM; 1,109 / 737 tok/s on H100[^4] |
| Mercury (chat) | Inception Labs, 2025 / refresh May 14, 2026[^6] | Commercial diffusion chat LLM | No | 128K context; $0.25/$1.00 per 1M tokens[^6][^12] |
| Mercury 2 | Inception Labs, Feb 24, 2026[^13] | Commercial reasoning diffusion LLM | No | "First reasoning dLLM"; ~1,000 tok/s[^13] |
The academic systems (SEDD, LLaDA) established that discrete diffusion could match autoregressive language modeling on perplexity and downstream tasks; Mercury is the engineering effort to push those ideas to commercial-grade throughput on a hosted product.[^8][^9][^4]