Deepgram
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,950 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,950 words
Add missing citations, update stale details, or suggest a clearer explanation.
Deepgram is an American voice artificial intelligence company headquartered in San Francisco, California. Founded in 2015 by Scott Stephenson, Adam Sypniewski, and Noah Shutty, all of whom had backgrounds in experimental physics at the University of Michigan, the company builds and operates proprietary deep learning models for speech recognition, text-to-speech synthesis, and voice agent orchestration. Its commercial speech-to-text product line is anchored by the Nova family (Nova-1, Nova-2, and Nova-3), while its text-to-speech offering is branded Aura, with Aura-2 reaching general availability in April 2025. In June 2025 Deepgram released its Voice Agent API, a unified speech-to-speech interface that combines its own ASR and TTS engines with pluggable large language models. In January 2026 the company closed a $130 million Series C round at a $1.3 billion post-money valuation, becoming the first speech AI specialist to reach unicorn status during the current wave of voice AI investment.
Deepgram occupies a position in the natural language processing market that is distinct from both hyperscale cloud providers and pure transcription vendors. Unlike Google, Amazon, and Microsoft, which embed speech APIs into broader cloud platforms, Deepgram is structured as a speech-first model lab that trains end-to-end neural networks specifically for production deployments. Unlike consumer-facing transcription apps, the company has from its earliest days focused on the developer and enterprise channel, exposing its models through a REST and WebSocket API rather than a packaged user interface. As of early 2026, Deepgram reports that more than 200,000 developers and over 1,300 organizations build on its platform, with the company having cumulatively processed more than 50,000 years of audio and over one trillion words of transcribed speech.
Deepgram is a privately held Delaware corporation operating principally from San Francisco. Scott Stephenson serves as chief executive officer, with co-founder Adam Sypniewski as chief technology officer. The company's product lineup comprises three primary categories: streaming and pre-recorded speech-to-text APIs (the Nova family), real-time text-to-speech synthesis (the Aura family), and a higher-level Voice Agent API that combines both into a turn-taking conversational interface. Deepgram also operates a custom model team that fine-tunes its base models for specific verticals, including healthcare, financial services, contact centers, and the quick-service restaurant industry.
Deepgram's commercial model is API-first and consumption-based. Customers are billed per minute of audio processed (for transcription) or per character generated (for synthesis), with discounted rates for committed volume. The company offers cloud-hosted, virtual private cloud, and on-premises deployment modes, with the latter two targeted at regulated industries such as healthcare and finance where audio cannot leave a customer-controlled boundary. Pricing for Nova-3 streaming begins at $0.0077 per minute for pre-recorded audio and $0.0058 per minute for streaming, while Aura-2 starts at $0.030 per 1,000 characters. The Voice Agent API is offered at a flat rate of $4.50 per hour of conversation.
The story of Deepgram begins not in software but in an underground particle physics laboratory. Scott Stephenson and Noah Shutty were both graduate students in physics at the University of Michigan during the early 2010s, working on the China Dark Matter Experiment, an effort to detect weakly interacting massive particles (WIMPs) using cryogenic germanium detectors in the Jinping Underground Laboratory. Their experimental task involved sifting through enormous volumes of waveform data to identify rare, faint signatures of particle interactions buried in background noise. The technical problem of finding sparse, subtle patterns in long continuous waveforms turned out to be a close cousin of the problem of finding spoken words in audio recordings.
In parallel with their academic work, Stephenson and Shutty had been experimenting with wearable recording devices and had accumulated hundreds of hours of personal audio. They wanted to be able to search through this archive but found that the speech recognition tools available at the time, dominated by acoustic-model and language-model pipelines built on Gaussian mixture models and n-gram language models, were not accurate enough for unconstrained conversational audio. Stephenson realized that the deep learning waveform-analysis techniques he had been applying to dark matter signatures could be repurposed for speech. The bet was that an end-to-end neural network trained directly on raw audio would outperform the assembled pipelines that defined commercial ASR at the time.
The two founders, joined by fellow Michigan physicist Adam Sypniewski as a third co-founder, moved to the San Francisco Bay Area and incorporated the company in 2015. They applied to the Y Combinator accelerator and were accepted into the Winter 2016 batch, which provided initial seed funding and the network that would shape Deepgram's investor base over the following decade.
Deepgram's first commercial product was a general-purpose speech transcription API built on end-to-end deep neural networks. At the time of launch, the prevailing architectures in commercial ASR still relied on hybrid hidden Markov model and deep neural network systems, with separate pronunciation lexicons and language models. Deepgram took the more aggressive position that a single neural network trained on enough audio could learn the joint distribution of acoustics and language directly, eliminating the need for the older pipeline.
This approach was technologically defensible but commercially difficult to sell during the company's first several years. Enterprise buyers were accustomed to procurement processes that referenced word error rate benchmarks on clean speech datasets, where the gap between Deepgram and incumbent vendors was modest. The company's real advantage, robustness on noisy real-world audio in domains like call centers, voicemails, and multi-speaker meetings, did not show up clearly in those benchmarks. Stephenson has described the period from 2017 to 2021 as a slow process of finding customers who had measured ASR performance themselves and concluded that the standard benchmarks were inadequate proxies for their workloads.
Deepgram's early customer wins came from contact center operators, media monitoring firms, and a handful of government and research customers. NASA contracted with Deepgram to build a custom model for transcribing communications between Mission Control and the International Space Station, an unusually demanding audio domain because of bandwidth-limited radio links, technical jargon, and overlapping speakers. The NASA deployment became one of the company's most visible reference customers and helped establish credibility in adjacent regulated industries.
Deepgram has raised capital across seed, Series A, Series B, and Series C rounds. Its funding history is summarized below.
| Round | Date | Amount | Lead investor(s) | Notes |
|---|---|---|---|---|
| Seed | 2016 | Undisclosed | Y Combinator | W16 batch |
| Seed extension | 2018 | $1.8M | Compound (then Metamorphic Ventures) | |
| Series A | March 2020 | $12M | Wing VC | Tigris Partners, Y Combinator, SAP.iO also participated |
| Series B (first tranche) | Feb 2021 | ~$25M | Tiger Global | Initial close |
| Series B (extension) | Nov 2022 | $47M | Madrona | Brought total Series B to $72M; new investors included Alkeon, BlackRock, In-Q-Tel, Citi Ventures, Nvidia, and SAP.iO |
| Series C | Jan 2026 | $130M | AVP | $1.3B post-money valuation; Twilio, ServiceNow Ventures, SAP, Princeville Capital, Alumni Ventures, University of Michigan, and Columbia University also participated |
Cumulative funding through the Series C exceeds $215 million. The Series B in particular was notable for the breadth of strategic investors, with the participation of In-Q-Tel signaling adoption by U.S. intelligence community customers and Nvidia's involvement reflecting the increasingly tight relationship between GPU vendors and speech model labs. Citi Ventures and SAP.iO returned to participate in subsequent rounds, foreshadowing the financial services and enterprise software customer expansion that followed.
The Series C announcement on January 13, 2026 also coincided with Deepgram's acquisition of OfOne, a Y Combinator-backed startup focused on AI-driven drive-thru ordering for quick-service restaurants. OfOne's technology became the foundation of a vertical-specific product called Deepgram for Restaurants, joining the existing nova-3-medical model in Deepgram's portfolio of domain-tuned speech systems.
Deepgram's technology stack is built around three core capabilities: automatic speech recognition, neural text-to-speech, and a voice agent orchestration layer. The company runs its own training infrastructure, develops its own model architectures (which it has not published in full academic detail), and serves models from a unified inference runtime called Deepgram Enterprise Runtime that is optimized for latency-sensitive real-time deployment.
The Nova model line represents the company's flagship product family. Nova models are trained end-to-end on a mixture of supervised audio-text pairs, weakly supervised audio with pseudo-labels, and large quantities of synthetic data generated to cover rare acoustic and linguistic conditions.
| Model | Released | Notable characteristics |
|---|---|---|
| Nova (Nova-1) | 2022 | First Nova generation. Trained on over 100 domains and 47 billion tokens. 22% WER reduction over the prior generation; 23 to 78 times faster than competing services. Starting price $0.0043 per minute. |
| Nova-2 | Nov 2023 | 18.4% relative WER improvement over Nova-1; 36.4% relative improvement over OpenAI Whisper Large. Improved punctuation by 22.6% and capitalization by 31.4%. |
| Nova-3 | Feb 2025 | First model to support live code-switching across 10 languages in a single stream. Median streaming WER of 6.84% across a 2,703-file benchmark. Introduced self-serve keyterm prompting and a dedicated nova-3-medical variant. |
Each Nova generation has emphasized a different axis of improvement. Nova-1 prioritized cost and throughput, an explicit response to the price-sensitivity of contact center workloads. Nova-2 narrowed the accuracy gap to research-quality models like Whisper while preserving Deepgram's latency advantage. Nova-3 turned attention to the long tail of the audio distribution, building a representation-learning framework that helped the training pipeline detect and target under-represented acoustic conditions in the corpus. Nova-3 also introduced live code-switching for ten supported languages without requiring the caller to indicate which language is being spoken.
The Nova line is offered in both streaming and pre-recorded modes. Streaming is delivered over a WebSocket interface and returns incremental transcripts with median end-of-utterance latency under 300 milliseconds. Pre-recorded mode operates as a standard REST API and is typically used for batch transcription of recorded audio files. Both modes share the same underlying model weights, with streaming variants tuned to balance partial-hypothesis stability against responsiveness.
Deepgram entered the text-to-speech market in 2024 with the original Aura model, positioning it as a real-time TTS engine specifically engineered for use inside voice agent loops where latency and consistency matter more than the dramatic prosody of entertainment-oriented voices. The launch was explicit about this positioning, framing Aura as the missing complementary piece to Nova for builders who wanted to assemble a complete voice stack without crossing vendor boundaries.
Aura-2 followed on April 15, 2025 and significantly expanded the product. It offers more than 40 professional-grade English voice personas at launch, with Spanish voices added in June 2025, and is engineered for time-to-first-byte latency under 200 milliseconds in streaming mode. Aura-2 is built on the same Deepgram Enterprise Runtime that serves Nova, allowing the speech-in and speech-out paths to share infrastructure and latency budgets. Deepgram has presented public listening test results in which Aura-2 was preferred by users at approximately 60% rates over competing services from ElevenLabs, Cartesia, and OpenAI in enterprise scenarios such as appointment confirmation, customer support, and order-taking.
The Voice Agent API, announced in preview during 2024 and made generally available on June 16, 2025, is Deepgram's highest-level product. It exposes a single WebSocket interface that handles bidirectional audio: the developer streams microphone audio in, and Aura-2 synthesized speech streams back out. Internally, the API combines speech-to-text via Nova-3, turn-taking and barge-in detection, function calling, large language model orchestration, and text-to-speech via Aura-2.
A distinguishing design choice is that the LLM stage is pluggable. Developers can let Deepgram orchestrate the conversation using a default model selection, or they can configure the API to call out to their own LLM endpoint, including hosted models from third parties. This allows enterprises that have already built around a specific model family to retain that choice while still benefiting from Deepgram's tightly integrated ASR and TTS pipeline. The API also supports mid-session control: prompts, voices, and even model selections can be changed during a single ongoing conversation without tearing down the connection.
On published benchmarks released by Deepgram alongside the GA launch, the Voice Agent API achieved a 6.4% lower task error rate than OpenAI's competing real-time voice product and a 29.3% lower task error rate than an ElevenLabs Conversational AI configuration on a third-party scenario benchmark. The product is priced at $4.50 per hour of conversation, which Deepgram positions as a single all-in figure for ASR, TTS, and orchestration.
In addition to its general-purpose Nova and Aura models, Deepgram operates a custom model practice that fine-tunes base models for specific verticals. The first publicly named vertical variant is nova-3-medical, a fine-tune of Nova-3 on clinical dictation and medical terminology that targets healthcare scribe and ambient clinical documentation use cases. Following the OfOne acquisition in January 2026, Deepgram for Restaurants packages domain-tuned ASR with operational integrations for drive-thru and quick-service restaurant deployments. Additional vertical fine-tunes are available under custom contract for financial services, insurance, and air traffic control applications.
Deepgram's customer base spans large enterprises, mid-market application companies, and developer-driven startups. Named enterprise customers include Citi (financial services), Twilio (communications platform), Spotify (media), Optum (healthcare), Jack in the Box (quick-service restaurants), Aircall, OpenPhone, Vapi, and Groq. NASA uses Deepgram for transcribing Mission Control to International Space Station communications, an unusually challenging audio domain because of bandwidth-limited radio links, technical jargon, and overlapping speakers.
Use cases concentrate in several broad categories:
The speech AI market in 2026 contains a small number of model labs that train and operate their own production speech systems, alongside the hyperscale cloud providers that offer commodity ASR APIs as part of broader platforms. Deepgram is most directly compared to AssemblyAI, Speechmatics, and the speech APIs offered by Google, Amazon, and Microsoft on the transcription side, and to ElevenLabs and OpenAI on the synthesis side.
| Provider | STT product | TTS product | Voice agent product | Headquarters | Notes |
|---|---|---|---|---|---|
| Deepgram | Nova-3 | Aura-2 | Voice Agent API | San Francisco, USA | API-first, end-to-end neural; both STT and TTS in-house |
| AssemblyAI | Universal-3 | None first-party | LeMUR + partner TTS | San Francisco, USA | STT specialist with audio intelligence overlay |
| Speechmatics | Ursa | None first-party | None | Cambridge, UK | Multilingual STT specialist, strong on accent coverage |
| Whisper (OpenAI) | Whisper Large v3 | TTS-1 / GPT-4o realtime | Realtime API | San Francisco, USA | Open-weights Whisper plus closed Realtime API |
| ElevenLabs | Scribe | v3 | Conversational AI | London / New York | TTS-led, expanded into STT and voice agents |
| Google Cloud | Speech-to-Text | Cloud TTS / Chirp HD | Dialogflow CX | Mountain View, USA | Cloud-platform bundled, breadth over specialization |
| Amazon Web Services | Transcribe | Polly | Lex | Seattle, USA | Cloud-platform bundled |
| Microsoft Azure | Speech to Text | Neural TTS | Azure AI Speech / Bot Service | Redmond, USA | Cloud-platform bundled, tight Office integration |
Deepgram's distinctive position in this landscape rests on several attributes. It is one of only a few independent vendors that train and operate both first-party STT and first-party TTS at production scale, in contrast to AssemblyAI and Speechmatics (STT-only) and ElevenLabs (TTS-led with derivative STT). It pursues a price-per-minute model that targets the high-volume contact center and developer market, undercutting the per-minute rates of the hyperscale cloud providers while preserving margin through model efficiency. Its emphasis on streaming latency, with sub-300 millisecond end-of-utterance latencies on the STT side and sub-200 millisecond time-to-first-byte on the TTS side, is engineered specifically for the conversational agent use case rather than for offline batch transcription.
A second axis of comparison is depth of accent and language coverage. Speechmatics has historically led on accented English and on languages outside the top ten by speaker volume, drawing on its long heritage as a UK-based speech research organization. Deepgram's coverage has expanded substantially with Nova-3, which supports live code-switching across English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch in a single audio stream. For workloads dominated by call center English with occasional Spanish, Deepgram's code-switching capability is often the more usable feature; for workloads in less-resourced languages, Speechmatics retains an edge.
The comparison with Whisper is structurally different because Whisper is distributed as open weights. Many organizations self-host Whisper Large v3 for cost or data residency reasons. Deepgram's competitive argument against self-hosted Whisper is that, once one accounts for the cost of GPU infrastructure, inference engineering, model maintenance, and the latency penalty of running an offline-oriented model in a streaming setting, the all-in cost of a managed API is often lower. Nova-3 has been benchmarked at up to 36% lower word error rate than Whisper Large v3 on select datasets according to Deepgram's testing, although direct comparisons depend heavily on the evaluation corpus.
Deepgram has not published its model architectures in the level of detail that would be expected of an academic research lab. Its public communications emphasize empirical performance on the company's own production benchmarks rather than novel architectural contributions. This reflects a deliberate positioning: the company sells access to deployed models, not papers, and its competitive moat is built around proprietary training data, custom data augmentation pipelines, and inference engineering rather than around a single architectural innovation.
That said, several aspects of Deepgram's approach are publicly known. The company trains end-to-end on raw audio, in contrast to the older hybrid HMM-DNN pipelines that dominated commercial ASR through the mid-2010s. It uses large quantities of weakly labeled and synthetic data alongside high-quality labeled corpora, an approach that resembles the self-training and pseudo-labeling techniques described in academic ASR literature. For Nova-3, the company introduced a representation-learning framework that compresses audio into a latent embedding space and uses that representation to identify under-represented acoustic conditions, allowing the training pipeline to specifically target the long tail of audio difficulty.
Deepgram's inference engineering is similarly proprietary but has been described publicly in terms of throughput targets. The company claims that its models are 23 to 78 times faster than competing services on per-GPU throughput, and that the Voice Agent API achieves end-to-end speech-to-speech latencies low enough to support natural turn-taking in human-AI conversation. The Enterprise Runtime that serves the models is designed to run identically in Deepgram's cloud, in customer-managed virtual private clouds, and in on-premises deployments, which is a meaningful operational distinction in regulated industries.
Industry coverage has generally framed Deepgram as one of the strongest independent specialists in the speech AI segment. TechCrunch, Built In, and SiliconANGLE have characterized the company as a leading challenger to hyperscale cloud speech APIs, particularly in the contact center and voice agent verticals. Coverage of the Series C round in January 2026 highlighted three themes: the rapid growth of voice agent workloads, Deepgram's positioning as a fully vertical speech stack rather than a transcription-only vendor, and the strategic implications of the OfOne acquisition as a template for vertical packaging.
Some critique from the broader speech AI community has focused on the company's relatively limited public research output, which contrasts with the more open posture of organizations like OpenAI (which released Whisper as open weights) and Speechmatics (which has historically published research papers). A second area of critique concerns vendor-run benchmarks: published comparisons between Nova models and competing services use Deepgram-selected audio datasets, and the magnitude of accuracy gaps in those comparisons is generally larger than what independent evaluations show. As with most large-scale ASR systems, Nova models also exhibit accuracy variation across speaker accents, with the strongest performance on majority American and British English and progressively weaker performance on under-represented accents.
Following the Series C and the OfOne acquisition, Deepgram has signaled three near-term strategic priorities. The first is continued investment in the Voice Agent API as the company's highest-margin and fastest-growing product, including expansion of the supported language set and deeper integration with major LLM providers. The second is vertical packaging on the model of Deepgram for Restaurants, with healthcare, financial services, and contact center bundles as likely candidates for similar treatment. The third is geographic expansion, with the Series C capital intended in part to fund hiring outside North America and to support European data residency requirements for regulated customers.
The company has also signaled continued model investment, with successor generations to Nova-3 and Aura-2 already in development. Public statements have emphasized improvements in non-English language coverage, lower latency for streaming TTS, and tighter integration of the speech and language stages of the voice agent loop, including exploration of more deeply integrated speech-to-speech models that would shorten the path from input audio to output audio inside the runtime.