Inception Labs
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,748 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,748 words
Add missing citations, update stale details, or suggest a clearer explanation.
Inception Labs (often referred to simply as Inception) is a Palo Alto, California-based artificial intelligence startup that commercializes diffusion language models (dLLMs) for text and code generation. Founded in 2024 by Stanford University computer science professor Stefano Ermon together with two of his former Ph.D. students, UCLA professor Aditya Grover and Cornell University professor Volodymyr Kuleshov, the company released Mercury Coder in February 2025, which it describes as the first commercial-scale diffusion-based large language model.[^1][^2] Inception positions Mercury as a faster, more compute-efficient alternative to mainstream autoregressive large language models such as those produced by OpenAI, Anthropic, and Google DeepMind.[^3]
Rather than generating text token-by-token from left to right as autoregressive transformer models do, Inception's models begin with a noisy or masked draft of the full response and iteratively refine multiple tokens in parallel through a denoising process inspired by image diffusion models.[^4] The company reports throughput in excess of 1,000 tokens per second on a single NVIDIA H100 GPU for its coding models, several times higher than the throughput of speed-optimized frontier models of comparable quality.[^4][^5] Inception raised a $50 million seed financing round, announced in November 2025, led by Menlo Ventures with participation from Mayfield, Innovation Endeavors, Microsoft's M12 fund, Snowflake Ventures, Databricks Investment and Nvidia's NVentures, along with angel investments from Andrew Ng and Andrej Karpathy.[^2][^6]
The company was co-founded by three computer science researchers who collaborated for nearly a decade on diffusion modeling and generative AI prior to launching Inception. Stefano Ermon, the company's chief executive officer, is an associate professor of computer science at Stanford University where he leads work on probabilistic generative modeling. He is widely credited as a co-inventor of modern diffusion models, the same family of techniques that underpin image and video generators such as Sora and Midjourney.[^7][^2] According to Menlo Ventures' partnership announcement, Ermon and his collaborators "pursued a contrarian hypothesis that diffusion models could match autoregressive approaches for language generation," eventually demonstrating that "a diffusion model matched the quality of autoregressive models on text generation and ran 10x faster."[^7]
Aditya Grover, a co-founder, is an assistant professor of computer science at the University of California, Los Angeles (UCLA). He completed his Ph.D. at Stanford under Ermon's supervision and is a co-author on several of the foundational research papers that Inception commercializes.[^1][^7] Volodymyr Kuleshov, the third co-founder, is an assistant professor of computer science at Cornell Tech, the Cornell University campus in New York City. He likewise completed his Ph.D. at Stanford under Ermon. Kuleshov leads an academic research group that has been particularly active in publishing the modern theory of masked and block diffusion language models.[^8][^9]
Beyond diffusion modeling, the founding team is associated with several broadly used techniques in modern generative AI. Members of the team are credited as co-inventors of Direct Preference Optimization (DPO), a popular alternative to reinforcement learning from human feedback for aligning language models, as well as work that contributed to Flash Attention.[^2][^7] Inception's broader research and engineering team draws from Stanford, UCLA, Cornell, Google DeepMind, Meta AI, Microsoft AI and OpenAI according to the company's website.[^10]
Inception's commercial models build on several years of academic work on discrete diffusion models for language. Three lines of research are particularly relevant to the company's products.
The first is SEDD (Score Entropy Discrete Diffusion), introduced in the 2024 paper "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution" by Aaron Lou, Chenlin Meng, and Stefano Ermon. The paper proposed score entropy, a loss that extends score matching to discrete spaces, and demonstrated that the resulting diffusion language model could substantially close the perplexity gap with autoregressive models, in some settings beating GPT-2 on generative perplexity while requiring far fewer function evaluations.[^11] On X (Twitter), Ermon described the work as "diffusion models finally bridging the gap with autoregressive models on language."[^12]
The second is MDLM (Masked Diffusion Language Models), introduced in "Simple and Effective Masked Diffusion Language Models" by Subham Sekhar Sahoo, Marianne Arriola, Aaron Gokaslan, Edgar Mariano Marroquin, Alexander M. Rush, Yair Schiff, Justin T. Chiu and Volodymyr Kuleshov, published at NeurIPS 2024. MDLM introduced a substitution-based parameterization that reduces the absorbing-state diffusion training objective to a mixture of classical masked language modeling losses, achieving state-of-the-art likelihoods among diffusion models and approaching autoregressive perplexity.[^9][^13]
The third is BD3-LMs (Block Discrete Denoising Diffusion Language Models), introduced in "Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models" by Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo and Volodymyr Kuleshov, accepted as an oral presentation at ICLR 2025. BD3-LMs decompose a token sequence into blocks and perform discrete diffusion within each block, interpolating between autoregressive and diffusion language models by tuning the block size and providing variance-reduced training and data-driven noise schedules.[^14][^15]
These three works, all closely tied to Ermon's Stanford group and Kuleshov's Cornell group, established that masked and block-wise discrete diffusion can match or approach autoregressive language model perplexity while supporting parallel inference and infilling. The Mercury technical report by Inception cites these foundations and notes that its production models adopt a Transformer denoiser trained with a masked diffusion objective at large scale.[^4]
Inception Labs was incorporated in Palo Alto, California in 2024.[^1][^16] According to Mayfield, the firm led an initial investment in the company in July 2024.[^16] At that time the company operated in stealth mode, and the precise size of the early-stage round was not publicly disclosed.[^1]
The company emerged from stealth on February 26, 2025, when it publicly unveiled Mercury Coder. TechCrunch's coverage that day noted that "the Mayfield Fund invested in the company," that Inception had secured "several customers, including unnamed Fortune 100 companies," and that Ermon declined to disclose specifics of the early funding.[^1] In April 2025, the company was selected to participate in the AWS Generative AI Accelerator cohort, according to publicly available announcements about that program.[^17]
On November 6, 2025, Inception announced a $50 million seed financing round led by Menlo Ventures. Participating investors included Mayfield (which had backed the company earlier), Innovation Endeavors, M12 (Microsoft's venture fund), Snowflake Ventures, Databricks Investment, and NVentures (Nvidia's venture capital arm), as well as angel investments from Andrew Ng and Andrej Karpathy.[^2][^6][^18] TechCrunch reported the figure as "$50 million in seed funding" and noted that the funding would support continued development of diffusion models for code and text and the scaling of the Mercury family.[^2] The company has not publicly disclosed its post-money valuation as part of the announcement.[^2]
Inception's first publicly available product was Mercury Coder, announced on February 26, 2025 as part of the company's emergence from stealth.[^3][^1] The model was released in two variants: Mercury Coder Mini and Mercury Coder Small. According to the company, Mercury Coder is "the first publicly available dLLM" and a coding-focused diffusion language model designed to deliver both high throughput and competitive code quality.[^3]
In Inception's launch blog post, Mercury Coder Mini was reported to achieve approximately 1,109 tokens per second of output throughput on an NVIDIA H100 GPU, while Mercury Coder Small was reported at approximately 737 tokens per second on the same hardware.[^3][^19] The company stated that these speeds are roughly 5-10x faster than speed-optimized autoregressive frontier models running in the 50-200 tokens per second range, and up to 20x faster than larger frontier models that produce fewer than 50 tokens per second.[^3]
On standard coding benchmarks reported in the company's launch materials and the subsequent Mercury technical report, Mercury Coder Small achieved 90.0 pass@1 on HumanEval and 76.2 on MultiPL-E, while Mercury Coder Mini achieved 88.0 on HumanEval and 74.1 on MultiPL-E. On fill-in-the-middle code tasks, Mercury Coder Small reached 84.8 and Mini reached 82.2, exceeding Codestral 2501's reported 82.5.[^19][^4] These results place Mercury Coder broadly in line with speed-optimized competitors such as GPT-4o Mini, Claude 3.5 Haiku, Gemini 2.0 Flash-Lite and Codestral on quality, while substantially exceeding them on throughput.[^4][^19]
In Copilot Arena, a human-evaluation platform that pits coding models against each other inside an IDE, Inception reported that Mercury Coder Mini ranked second overall while posting the lowest average latency of any participating model, at roughly 25 milliseconds.[^4][^19]
In June 2025 Inception's research team published a technical report titled "Mercury: Ultra-Fast Language Models Based on Diffusion" on arXiv, authored by Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha and the three founders Stefano Ermon, Aditya Grover and Volodymyr Kuleshov.[^4] The report describes Mercury Coder Mini and Small as Transformer-based denoisers trained on trillions of tokens on clusters of NVIDIA H100 GPUs, using a masked diffusion training objective and an iterative parallel decoding procedure at inference time.[^4]
According to the technical report, the models support a standard context length of 32,768 tokens, extensible up to 128,000 tokens.[^4] The "coarse-to-fine" decoding procedure starts from a fully masked or noised draft of the response and refines all positions in parallel across a small number of denoising steps, in contrast to autoregressive decoding which advances one token at a time.[^4]
Following Mercury Coder, Inception expanded the Mercury family to general-purpose chat. The company released a general chat-oriented version of Mercury, marketed simply as Mercury, in 2025.[^20] In its blog post introducing the general chat model, Inception reported that Mercury supports a 128,000-token context window, is OpenAI API-compatible, and is positioned for use cases including chat, summarization and general text generation at high concurrency.[^20]
On third-party benchmarking by Artificial Analysis, the company reported that Mercury matches the quality of speed-optimized frontier models such as GPT-4.1 Nano and Claude 3.5 Haiku while running over 7x faster, with throughput around 708 tokens per second versus roughly 96 tokens per second for GPT-4.1 Nano on the same evaluation.[^20] The model achieved 69 percent on MMLU-Pro, 85 percent on HumanEval and 83 percent on MATH-500 in Inception's reported numbers.[^20]
Inception also announced that Mercury would serve as the founding LLM partner for Microsoft's NLWeb project, a Microsoft Research initiative that allows publishers to add natural-language interfaces to their websites.[^20]
In late August 2025 Inception announced that Mercury and Mercury Coder were available through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. AWS's announcement, dated August 27, 2025, highlighted that the models deliver "ultra-fast generation speeds of up to 1,100 tokens per second on NVIDIA H100 GPUs, up to 10 times faster than comparable models," support for fill-in-the-middle code generation and context lengths up to 128,000 tokens.[^21]
In addition to AWS Bedrock and SageMaker JumpStart, Inception lists availability via its own API platform, Azure AI Foundry, OpenRouter, Models.dev and Poe.[^22][^23] The company offers OpenAI-compatible API endpoints, enabling drop-in replacement of OpenAI client calls.[^23]
In early 2026 Inception introduced a refreshed and significantly upgraded generation of the Mercury family. Mercury Edit 2, a coding- and editing-focused diffusion LLM, was introduced on March 30, 2026.[^5] Mercury 2, branded by the company as "the fastest reasoning LLM" and described as the first reasoning-capable diffusion LLM, was announced on February 24, 2026 and discussed further in a May 14, 2026 company blog post.[^24][^5]
According to Inception's announcement, Mercury 2 includes larger parameter counts, higher-quality training data, an enhanced denoiser architecture, new training objectives and faster inference algorithms, with improvements across coding, math, instruction following and knowledge recall.[^5] In Inception's reported numbers, Mercury 2 delivers approximately 5x faster generation than leading speed-optimized reasoning models while reaching roughly 1,000 tokens per second of output throughput on standard benchmarks, compared with around 89 tokens per second for Claude Haiku 4.5 in reasoning mode and around 71 tokens per second for GPT-5 Mini in the same evaluation.[^24]
On quality, Inception reported that Mercury 2 scored 91.1 on AIME 2025, 73.6 on GPQA, 71.3 on IFBench, 67.3 on LiveCodeBench, 38.4 on SciCode and 52.9 on Tau2. According to the company, these scores place Mercury 2 in competitive range with Claude Haiku 4.5 and GPT-5.2 Mini on quality while delivering roughly an order of magnitude more throughput.[^24]
Pricing for Mercury 2 was set at $0.25 per million input tokens and $0.75 per million output tokens at launch, undercutting Gemini 3 Flash's $0.50 / $3.00 by half on input and four times on output, according to the company.[^24] Mercury Edit 2 launched at the same per-token pricing.[^5] Inception's web site as of May 2026 advertises 99.5 percent uptime and enterprise service-level agreements for both models.[^10]
The technical premise of Inception's products is that text generation can be reformulated as iterative parallel denoising rather than sequential next-token prediction. In the company's framing, traditional autoregressive transformer decoders such as GPT and LLaMA generate a response token by token, conditioning each step on the entire previously generated prefix. This makes inference fundamentally sequential and bounded by the latency of the single autoregressive step, regardless of how much GPU parallelism is available.[^4]
Diffusion language models invert this picture. Inception's models begin with a sequence of mask or noise tokens of a chosen length and apply a Transformer-based denoiser that, in each step, predicts a refined version of all tokens in parallel. The denoiser is iterated for a small number of steps until it converges on a fluent response. Because each step processes every position in parallel, the wall-clock latency is dominated by the number of denoising steps rather than the length of the output, and GPU throughput is much more effectively used.[^4]
The technical lineage is rooted in masked language modeling (which has long allowed bidirectional context) and in continuous-state diffusion models for images and video (which proved that iterative denoising can produce high-quality samples). The masked diffusion language model formulations developed in MDLM and BD3-LMs provide the theoretical grounding, while SEDD provides an alternative score-based perspective.[^9][^14][^11] Inception's Mercury technical report describes a production-scale implementation that combines these ideas with a Transformer denoiser, masked diffusion training and an inference-time parallel decoding procedure.[^4]
Inception's marketing and technical materials draw a direct contrast between Mercury and mainstream autoregressive large language models.[^3][^4]
On speed, the company reports that on standard NVIDIA H100 hardware, Mercury Coder Small and Mini deliver roughly 737 and 1,109 tokens per second respectively, that the general-purpose Mercury chat model delivers roughly 708 tokens per second, and that Mercury 2 reaches roughly 1,000 tokens per second even in reasoning mode. By comparison, speed-optimized autoregressive frontier models are typically reported in the 50-200 tokens per second range on similar hardware, and the largest autoregressive frontier models often produce fewer than 50 tokens per second.[^3][^4][^20][^24]
On cost, the company has launched Mercury 2 and Mercury Edit 2 at $0.25 per million input tokens and $0.75 per million output tokens, while earlier Mercury family members carried pricing around $0.25 per million input tokens and $1.00 per million output tokens. Inception positions these prices as competitive with or below the lowest-cost autoregressive APIs of similar quality.[^24][^5]
On quality, Inception reports that Mercury Coder is broadly competitive with GPT-4o Mini, Claude 3.5 Haiku, Gemini 2.0 Flash-Lite and Codestral on coding benchmarks, that Mercury matches GPT-4.1 Nano and Claude 3.5 Haiku on general chat benchmarks, and that Mercury 2 sits in competitive range of Claude Haiku 4.5 and GPT-5.2 Mini on reasoning benchmarks.[^4][^20][^24] These claims are based primarily on Inception's own reporting and benchmark partner Artificial Analysis; longer-running independent evaluations remain limited at the time of writing.[^20]
On deployment flexibility, diffusion language models offer some properties not native to autoregressive decoders. Because the denoiser conditions on the entire output context (including future positions), Mercury supports infilling and fill-in-the-middle code completion as a first-class operation rather than via specialized tokens.[^4] Because the number of denoising steps can be tuned, latency can be traded against quality in ways that left-to-right autoregressive decoding does not allow.[^4]
There are trade-offs as well. Diffusion language models historically have lagged the very best autoregressive systems on tasks that require long, free-form, open-ended generation, and the field's longest-running benchmarks are designed around autoregressive baselines. Independent academic evaluation of how well masked-diffusion language modeling scales relative to the best autoregressive models at trillion-token training budgets is still an open research area. Inception's published benchmarks indicate the gap on common benchmarks has been substantially narrowed, but the relative behavior of dLLMs on long-context reasoning, agentic tool use and emergent behaviors at frontier scale remains a subject of ongoing study.[^4][^11]
Press coverage of Inception's emergence from stealth in February 2025 emphasized the contrarian nature of the company's bet on diffusion for text. TechCrunch's coverage described Inception's approach as "a new type of AI model" and quoted Ermon noting that with traditional LLMs "you cannot generate the second word until you've generated the first one," whereas diffusion models can generate text in parallel.[^1] The New Stack characterized Mercury 2 as "10x faster than Claude, ChatGPT, Gemini" in its reasoning category.[^25] PYMNTS noted Inception's positioning as a Silicon Valley startup building faster LLMs.[^26]
Mayfield's portfolio announcement in 2025 framed Mercury as the first commercial-scale diffusion LLM that "runs 10x faster and at 1/10th the cost of traditional LLMs," and drew parallels between the platform shift Inception is pursuing and earlier technology transitions.[^16] Menlo Ventures, the lead investor in the November 2025 financing, described the founders' bet as "a contrarian hypothesis that diffusion models could match autoregressive approaches for language generation" and characterized Inception as commercializing diffusion-based reasoning at production scale.[^7]
Adoption signals include availability of Mercury and Mercury Coder on Amazon Bedrock Marketplace and Amazon SageMaker JumpStart from August 2025, on Azure AI Foundry, on OpenRouter and Poe, and as the founding LLM partner for Microsoft's NLWeb project.[^20][^21][^22][^23] The technical report describes Fortune 100 deployments and high-throughput enterprise workloads, although Inception has generally not disclosed specific customer names.[^1][^4]
Inception's research output has also been cited by other diffusion language model efforts. The MDLM line of work, developed in Kuleshov's group, has been adopted as the basis for ByteDance's Seed Diffusion (described in third-party reporting as an industry-grade diffusion LLM) and for Nvidia's Genmol effort on molecular generation, according to Inception-associated public materials.[^8]
As of May 2026, Inception's publicly described product line consists of:
Inception's website lists availability through its own API, AWS Bedrock and Azure Foundry, an OpenAI-compatible interface, enterprise service-level agreements at 99.5 percent or higher uptime, and a research and engineering organization drawn from Stanford, UCLA, Cornell, Google DeepMind, Meta AI, Microsoft AI and OpenAI.[^10] The company continues to position diffusion language modeling as a platform shift on the order of the transition from recurrent neural networks to transformer models, with the founders publicly arguing that diffusion will become the default architecture for foundation-model inference where latency and cost matter.[^7][^10]