Deepgram Nova-3 is the third-generation automatic speech recognition (ASR) model developed by Deepgram, a San Francisco-based voice AI company. The model reached general availability on February 12, 2025, and is designed for both real-time streaming and batch transcription workloads. Nova-3 introduced several capabilities that were novel to the commercial ASR market at the time of release, including live code-switching across ten languages in a single audio stream, self-serve keyterm prompting without model retraining, and a median streaming word error rate (WER) of 6.84% across a 2,703-file benchmark spanning nine production domains. The model is accessed through Deepgram's cloud API and has been adopted by enterprise customers in healthcare, financial services, contact centers, and fast-food drive-thru automation.
Deepgram was founded in 2015 by Scott Stephenson, Noah Shutty, and Adam Sypniewski, all of whom had backgrounds in physics research at the University of Michigan. Stephenson and Shutty had originally built wearable recording devices as a life-logging experiment and found themselves unable to search or index hundreds of hours of recorded audio with the speech recognition technology available at the time. That friction pointed toward a commercial opportunity. The three founders moved to the San Francisco Bay Area and went through Y Combinator in the Winter 2016 batch.
Rather than building on top of existing ASR systems, Deepgram trained end-to-end deep learning models on diverse audio from the start. This approach differed from older pipeline-based ASR systems, which passed audio through separate acoustic models, pronunciation dictionaries, and language models in sequence. By training a single model end-to-end, Deepgram's architecture could optimize directly for the transcription task and generalize more readily to new audio conditions.
The company raised a $72 million Series B round in November 2022, with participation from Madrona, Alkeon, Blackrock, Tiger Global, Wing VC, Citi Ventures, SAP.io, In-Q-Tel, Nvidia, and Y Combinator. In January 2026, Deepgram announced a $130 million Series C round led by AVP at a $1.3 billion valuation, with existing investors rejoining and new participants including Twilio, ServiceNow Ventures, SAP, and Alumni Ventures. The company also acquired OfOne, a Y Combinator-backed startup specializing in AI-powered drive-thru ordering, as part of that round.
As of early 2026, Deepgram reported having processed more than 50,000 cumulative years of audio, surpassed 1 trillion words transcribed, and counted more than 200,000 developers building on its APIs. Named enterprise customers include Citi, Twilio, Spotify, NASA, Kore.ai, and Jack in the Box.
Deepgram's Nova model line represents its primary commercial ASR product family. The original Nova model (sometimes called Nova-1 to distinguish it from successors) was released in 2022. It trained on over 100 domains and 47 billion tokens, which Deepgram described at launch as the most extensively trained commercial ASR model then available. Nova achieved a 22% WER reduction over its predecessor and delivered inference speeds 23 to 78 times faster than competing services at the time, starting at $0.0043 per minute.
Nova-2 followed with a focus on pushing accuracy further and broadening language coverage. Compared to Nova-1, Nova-2 delivered an 18.4% relative WER reduction and reached an overall median WER of 8.4% across tested domains. It also brought a 36.4% relative improvement over OpenAI Whisper Large, improved punctuation accuracy by 22.6%, and reduced capitalization error rates by 31.4%. Nova-2 supported English for both streaming and pre-recorded modes at launch, with multilingual expansion coming later. Pricing held at $0.0043 per minute for pre-recorded audio.
Nova-3 built on this progression by targeting accuracy at the tails of the distribution, which meant investing specifically in challenging acoustic conditions, multilingual audio, and domain-specific terminology.
Deepgram has not published a detailed technical paper describing Nova-3's architecture, but the company has shared information about the core design choices that distinguish it from Nova-2.
The central architectural innovation is a sophisticated audio embedding framework based on representation learning. During training, Nova-3 compresses audio into a dense latent space and uses that representation to identify under-represented acoustic conditions in the training set. This approach allows the model to detect where it lacks sufficient data and target those regions specifically with additional examples. Deepgram describes the goal as ensuring the model has been exposed to the full diversity of real-world audio rather than simply interpolating from common, clean-speech examples.
Training proceeded in multiple stages. The first stage combined synthetic code-switched audio generated at large scale with curated real-world multilingual datasets. Generating synthetic code-switched data was necessary because labeled audio in which speakers naturally switch between languages mid-sentence is rare and expensive to collect. The second stage used advanced audio-text alignment techniques to create adversarial training examples, cases where the audio is ambiguous or the correct transcription is counterintuitive, so the model learns to handle edge cases rather than relying on statistical priors. A third stage applied targeted data augmentation specifically for specialized long-tail vocabulary, covering medical terminology, financial jargon, and alphanumeric sequences that appear infrequently in general training corpora.
Nova-3 maintains a unified model architecture rather than routing audio through separate language-specific models. This design choice is what enables real-time code-switching: the model does not need to switch between language models when a speaker changes languages; it handles the transition within a single inference pass. The tradeoff is that building one model that generalizes well across dozens of languages requires significantly more training data and compute than building separate per-language models.
A specialized variant, nova-3-medical, is available for English-language healthcare transcription and is fine-tuned on medical terminology and clinical dictation patterns.
Deepgram evaluated Nova-3 on a benchmark dataset of 2,703 audio files representing 81.69 hours of recorded audio across nine production domains: air traffic control, conversational AI, drive-thru ordering, finance, medical, meeting, phone call, podcast, video and media, and voicemail.
The benchmark results are as follows:
| Mode | Nova-3 WER | Next-best competitor | Improvement |
|---|---|---|---|
| Streaming | 6.84% | 14.92% | 54.2% lower |
| Batch | 5.26% | 10.00% | 47.4% lower |
Nova-3 also achieves up to 36% lower WER than OpenAI Whisper Large V3 on select datasets, according to Deepgram's testing.
Compared directly to Nova-2, the improvement in streaming WER is from 9.09% to 6.84%, a relative reduction of approximately 25%. Batch WER dropped from 8.4% to 5.26%, a relative reduction of about 37%.
Deepgram attributes the gains in part to nova-3's targeted treatment of acoustic tail conditions. In domains with challenging audio such as drive-thru (high background noise, speaker-to-microphone distance, overlapping speech) and air traffic control (rapid numeric sequences, non-standard pronunciation conventions), models trained on general audio tend to fail disproportionately. By augmenting training data specifically for these conditions, Nova-3 narrows the performance gap between clean-speech domains and noisy real-world deployments.
Numeric recognition also improved noticeably. Nova-3 handles sequences of numbers, identifiers, and entity strings, common in financial, medical, and logistics transcription, more accurately than Nova-2. Punctuation and paragraph structuring for English improved as well, as did word-level timestamp precision.
Nova-3 is offered in two operational modes: streaming (real-time) and pre-recorded (batch).
In streaming mode, Nova-3 receives audio over a WebSocket connection and returns partial transcripts incrementally as the audio arrives. Latency from the end of a spoken phrase to the return of a stable transcript is below 300 milliseconds for the majority of requests. This latency profile is suitable for voice agent applications where the conversational loop requires a rapid response: the ASR layer must return a transcript quickly enough for the downstream language model to generate a reply and the text-to-speech layer to synthesize audio before the pause in conversation becomes awkward. Deepgram claims this makes Nova-3 up to 40 times faster than competing models that include speaker diarization.
In batch mode, Nova-3 processes pre-recorded audio files and returns complete transcripts at higher accuracy. Batch transcription is appropriate for workflows like meeting summarization, podcast captioning, call center quality assurance, and compliance archiving. The lower WER of batch mode (5.26% vs. 6.84% for streaming) reflects the ability to use longer context windows and bidirectional processing when real-time constraints are relaxed. Processing speed in batch mode is approximately 30 to 33 seconds of compute per hour of audio, which allows Deepgram to deliver results far faster than real-time even for long files.
At launch in February 2025, Nova-3 supported streaming code-switching across ten languages: English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. Code-switching refers to the natural phenomenon of speakers alternating between languages within a conversation or even within a single sentence. Before Nova-3, commercial ASR systems generally required callers to stay in one language or required the application to detect the language in advance and select a different model. Nova-3 handles these transitions within a single inference pass.
The multilingual variant of Nova-3 has been expanded in subsequent updates to include a broader set of languages. The full list of languages supported across Nova-3 models spans Arabic (with regional variants including ar-AE and ar-SA), Bengali, Bulgarian, Catalan, Chinese (Cantonese and Mandarin in both simplified and traditional scripts), Croatian, Czech, Danish, Estonian, Finnish, Greek, Gujarati, Hebrew, Hungarian, Indonesian, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Norwegian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and all major English regional variants.
An update in mid-2025 added ten languages including Greek, Romanian, Slovak, Catalan, Lithuanian, Latvian, Estonian, Flemish, Swiss German, and Malay. This expansion also introduced multilingual keyterm prompting, extending the keyterm feature to non-English vocabulary. The update showed WER reductions exceeding 20% for several languages, including Malay, Romanian, and Slovak. A notable pattern in the multilingual data is that streaming models outperformed batch models for roughly half the supported languages, an unusual result that Deepgram attributes to the training curriculum for real-time audio.
Performance on non-European languages such as Hindi and Japanese tends to run slightly higher in WER than European languages, a reflection of relative data scarcity. In noisy conditions (0 dB SNR), Japanese WER has been reported rising from around 4.8% to 11.9% and Portuguese from 3.9% to 9.7%, illustrating the sensitivity of multilingual models to background noise.
Keyterm prompting is a self-serve customization feature that allows developers to pass up to 500 tokens (roughly 100 words or key terms) in an API request to improve recognition of specific vocabulary. Common use cases include brand names, product names, drug names, technical abbreviations, employee names, and regional slang. When keyterms are provided, Nova-3 increases the likelihood of transcribing those terms correctly without any model retraining.
The feature is significant for enterprise deployments because custom vocabulary training previously required submitting audio and text datasets to Deepgram and waiting for a fine-tuned model to be built. Keyterm prompting shifts that process to request time: a developer can inject a new term in seconds without coordination with Deepgram's model team.
Deepgram has reported that a veterinary customer achieved a 625% improvement in keyterm recognition accuracy for medical terminology using this feature. In the multilingual expansion update, keyterm prompting was extended to all supported languages, so enterprises with global operations can load domain-specific vocabulary in whichever language is active in the conversation.
Nova-3 is available through Deepgram's self-serve API under two plans:
| Plan | Nova-3 Monolingual Streaming | Nova-3 Monolingual Pre-Recorded | Nova-3 Multilingual Streaming | Nova-3 Multilingual Pre-Recorded |
|---|---|---|---|---|
| Pay As You Go | $0.0048/min | $0.0077/min | $0.0058/min | $0.0092/min |
| Growth ($4K+/yr) | $0.0042/min | $0.0065/min | $0.0050/min | $0.0078/min |
New accounts receive $200 in free credits. Enterprise pricing requires a sales conversation and can include private deployments, volume discounts, and committed usage contracts.
For comparison, Nova-2 was priced at $0.0061 per minute for streaming (in beta) and $0.0043 per minute for pre-recorded audio, making Nova-3 approximately 25 to 79% more expensive depending on mode. Nova-3 at $0.0077 per minute for pre-recorded audio is still cheaper per minute than comparable cloud provider offerings from Google, AWS, and Azure. Deepgram's internal analysis claims Nova-3 offers over 2x cost advantage versus those platforms while delivering a 53% lower WER.
For a workload of 1,000 hours per month, costs under the Pay As You Go plan work out to approximately $288 for monolingual streaming and $552 for multilingual pre-recorded.
Deepgram released Aura-2, its second-generation text-to-speech model, in April 2025, roughly two months after Nova-3. Aura-2 is designed as a companion to Nova-3 for complete voice agent pipelines, covering the synthesis side of a conversation while Nova-3 handles transcription.
Aura-2 delivers a time-to-first-byte latency of under 200 milliseconds and a real-time factor of 0.111x (synthesizing one second of audio in approximately 100 milliseconds). Deepgram benchmarked Aura-2 against ElevenLabs, Azure, Google, Cartesia, PlayHT, and OpenAI TTS on latency and found Aura-2 outperformed all of them on TTFB.
The model offers over 40 English voices with localized accents covering American, British, Australian, Irish, Filipino, and other variants. Voice design for Aura-2 prioritized clarity and business appropriateness rather than expressiveness or emotional range, making the voices well-suited to customer service, IVR, healthcare intake, and financial services interactions. Domain-specific pronunciation is built in, covering drug names, legal references, alphanumeric identifiers, dates, times, and currency values.
Pricing for Aura-2 is $0.030 per 1,000 characters, which Deepgram positions as cheaper than Cartesia Sonic ($0.038) and ElevenLabs Flash ($0.050). In a preference study across 2,794 pairwise comparisons, Aura-2 voices were chosen approximately 60% of the time in customer service scenarios against competing systems.
Aura-2 supports cloud, VPC, and on-premises deployment through Deepgram Enterprise Runtime, which is relevant for healthcare and financial services customers with data residency or compliance requirements.
Nova-3 occupies the ASR component in most modern real-time voice agent stacks, where three services operate in a low-latency loop: an ASR model transcribes audio, a large language model generates a response, and a TTS model synthesizes speech. Each component's latency compounds, so the ASR layer typically needs to return a result in well under 500 milliseconds to keep the overall conversational latency under two seconds.
Vapi is a developer-focused platform for building and deploying voice agents. It abstracts the orchestration of ASR, LLM, and TTS services so developers can configure a voice agent through an API or dashboard rather than building the pipeline from scratch. Vapi added Nova-3 to its transcriber options on February 13, 2025, the day of its general availability, allowing its users to switch from Nova-2 to Nova-3 with a single configuration change. As of 2026, Vapi handles over 62 million calls per month for enterprise customers. Deepgram is listed as a named partner on Vapi's platform documentation.
Retell AI provides voice agent infrastructure with a focus on call center and healthcare automation. Retell AI integrates Deepgram for its STT layer. Healthcare and financial services customers using Retell have reported workloads in the range of tens of thousands of calls monthly. The platform's compliance monitoring and real-time analytics capabilities complement Nova-3's accuracy improvements in regulated verticals.
Pipecat is an open-source Python framework for building real-time multimodal AI pipelines, maintained by Daily (now Pipecat AI). It provides a modular pipeline where STT, LLM, and TTS services are connected as composable services. Pipecat includes native DeepgramSTTService and DeepgramTTSService implementations that use Deepgram's WebSocket API for streaming audio. Nova-3 is listed as a supported STT model in Pipecat documentation and has been used in production examples including an AWS-hosted patient outreach demo that streams audio through an AWS SageMaker endpoint to Nova-3. Pipecat is popular with voice AI developers because its open-source nature allows full customization and self-hosting, and Deepgram's self-hosted Enterprise Runtime can run alongside a Pipecat deployment for organizations with strict data handling requirements.
Nova-3 is also available on AWS Marketplace as a hosted streaming endpoint, enabling procurement through existing cloud vendor agreements. It appears in the STT configuration options for other voice orchestration frameworks including LiveKit Agents and Bland AI. Companies like Groq have used Deepgram in their voice AI product demonstrations, pairing Nova-3 with Groq's fast LLM inference to minimize conversational latency.
| Provider | Model | Streaming WER | Batch WER | Latency | Price (streaming) | Multilingual |
|---|---|---|---|---|---|---|
| Deepgram | Nova-3 | 6.84% | 5.26% | <300ms | $0.0048/min | Yes (live code-switching) |
| AssemblyAI | Universal-2 | ~8% | ~6.8% | ~400ms | $0.0032/min | Limited |
| OpenAI | Whisper Large V3 | ~10-12% | ~7.4% | High (batch-optimized) | Free (self-host) | Yes (separate models) |
| OpenAI | GPT-4o Transcribe | ~2.5% | ~2.5% | Moderate | Higher per-min cost | Yes |
| Cloud Speech v2 | ~9-10% | ~8% | ~400ms | $0.016/min | Yes | |
| AWS | Transcribe | ~10-12% | ~9% | ~500ms | $0.024/min | Limited |
| Azure | Speech Services | ~9-11% | ~8% | ~400ms | $0.016/min | Yes |
Note: WER figures from independent benchmarks vary depending on the audio domain tested. Figures above reflect best-available published data from third-party comparisons.
Whisper (speech recognition) is an open-source model released by OpenAI in September 2022 and is free to run self-hosted. Whisper Large V3 remains widely used because it requires no per-minute payment when self-hosted, performs well on clean English speech, and has broad community support.
Nova-3 outperforms Whisper on accuracy in nearly all real-world conditions where Deepgram has published comparisons. In a comparative study of 200+ audio samples across seven languages, Deepgram found evaluators preferred Nova-3 over Whisper at ratios up to 8-to-1 on certain languages. The accuracy gap widens further in noisy environments, where Whisper's lack of specific noise-condition training becomes more apparent.
Where Whisper retains a competitive advantage is on cost, flexibility, and accuracy on clean long-form English audio. Self-hosting Whisper on a modern GPU costs a fraction of Nova-3's per-minute API rate. Whisper also supports more languages through the open-source community's fine-tuned variants. For English-only podcast transcription, clean meeting recordings, or media archiving, Whisper Large V3 or the newer open-source Whisper variants are credible alternatives that may outperform Nova-3 on specific clean-audio tasks even while losing on noisy or multilingual ones.
AssemblyAI Universal-2 was released in late 2024 and is AssemblyAI's flagship ASR model. Universal-2 emphasizes conversational accuracy, proper noun recognition, and downstream intelligence features such as speaker diarization, sentiment analysis, entity detection, and summarization bundled with the transcription output.
In WER benchmarks, Universal-2 and Nova-3 are competitive. AssemblyAI's published benchmark data shows Universal-2 achieving a 24% improvement in proper noun accuracy and 30% fewer hallucinations compared to Whisper Large V3. Deepgram's benchmark data shows Nova-3 at 6.84% streaming WER compared to Universal-2's reported 8.1% on English streaming benchmarks, though the two companies use different test sets, making direct comparison imprecise.
The practical distinction between the two models often comes down to what surrounds the transcription. AssemblyAI bundles intelligence features in its API responses, making it a single-call solution for applications that need transcription plus summaries, topics, or sentiment. Deepgram's focus is tighter: highly accurate, low-latency transcription with self-serve vocabulary customization, leaving downstream processing to the application layer or to a separate LLM call. For real-time voice agent applications, Nova-3's lower latency and higher streaming accuracy give it an advantage. For asynchronous content analysis, Universal-2's bundled intelligence features may reduce overall integration complexity.
Several customer deployments have been documented publicly:
Citi uses Deepgram for financial services transcription, a domain where accurate recognition of ticker symbols, numeric sequences, and regulatory terminology is particularly valuable. Twilio, which provides programmable communications infrastructure, integrates Deepgram into its voice AI offerings and also participated in Deepgram's Series C funding round. Spotify has used Deepgram for podcast-related audio processing. NASA has used Deepgram in research contexts, consistent with the air traffic control domain appearing in Nova-3's benchmark. Jack in the Box and the OfOne acquisition positioned Deepgram directly in the fast-food drive-thru automation market, where background noise, regional accents, and speed of service are central challenges that Nova-3 was specifically designed to address.
Vapi, cited as both a customer and a distribution partner, embedded Nova-3 across its voice agent platform on launch day, giving Nova-3 immediate exposure to Vapi's developer user base.
Developer reception to Nova-3 has been generally positive, with most commentary focusing on two areas: the accuracy improvement in noisy and multilingual conditions, and the keyterm prompting feature.
The real-time code-switching capability received particular attention from developers building global customer service applications, who had previously needed to maintain separate ASR pipelines for different language regions. The ability to handle code-switching in a single stream reduces both infrastructure complexity and per-call latency.
Deepgram received the 2025 Voice AI Technology Excellence Award from CUSTOMER Magazine, an industry publication covering contact center and CX technology.
On pricing, some independent comparisons noted that Nova-3's streaming price is roughly 1.8 times that of Nova-2, and that alternatives like Soniox offer competitive accuracy at lower per-minute rates for developers with tight cost constraints. The pricing increase is consistent with the pattern of differentiated enterprise-grade API models commanding a premium over predecessor versions.
Some developers have also observed that the 100-keyterm limit per request can be restrictive for organizations with very large custom vocabularies, and that non-EU languages such as Hindi and Japanese show measurably higher WER than European languages in both clean and noisy conditions.
A third-party review rated Nova-3 at 4.5 out of 5 stars, describing it as "the most accurate real-time STT tested" while noting the cost premium compared to open-source alternatives and the WER gap for certain non-European languages.
Nova-3 has several documented limitations:
Background noise degrades multilingual accuracy significantly. At 0 dB signal-to-noise ratio (heavy background noise), languages like Japanese have been reported showing WER increases from around 4.8% to 11.9% and Portuguese from 3.9% to 9.7%, representing relative increases of over 100%. Language detection confidence drops when noise is high, sometimes requiring application-level confirmation prompts.
Domain-specific terminology in non-English languages is less well-covered than English. Healthcare deployments have reported 15 to 20% lower accuracy on medical terms in non-English languages, and financial call transcription shows elevated WER on non-English legal and financial jargon.
The keyterm prompting feature is capped at approximately 100 words per request (500 tokens), which may be insufficient for organizations with very large proprietary terminology sets. Exceeding this limit requires a custom model engagement.
Nova-3 is an API-hosted model. While an enterprise on-premises runtime exists, the primary product is cloud-hosted, which creates data residency considerations for certain industries and geographies.
No public technical paper has been published describing Nova-3's architecture in detail, which limits independent reproducibility of results and makes it harder for researchers to understand where and why the model succeeds or fails.
The concurrent streaming request limit on the self-serve tier is 1,200 requests, which can be a constraint for organizations with sudden traffic spikes.