Amazon Nova Sonic
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,556 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,556 words
Add missing citations, update stale details, or suggest a clearer explanation.
Amazon Nova Sonic is a real-time speech-to-speech foundation model developed by Amazon and offered through Amazon Bedrock. Announced on April 8, 2025, it is part of the Amazon Nova family of foundation models. Unlike conventional voice stacks that chain together separate automatic speech recognition (ASR), a large language model, and text-to-speech (TTS) components, Nova Sonic unifies speech understanding and speech generation into a single model. It accepts spoken audio as input and produces spoken audio as output natively, which lets it model the prosody, tone, and timing of conversation, reduce latency, and handle natural turn-taking and interruptions.[1][2]
Amazon positions Nova Sonic as infrastructure for building conversational voice agents, with built-in tool use (function calling) and retrieval so that a voice application can fetch real-time information, query proprietary data, or take action in external systems while talking with a user.[1][3] Components of Nova Sonic already power Alexa+, Amazon's upgraded generative AI assistant.[3]
Nova Sonic is delivered on Amazon Bedrock through a new bidirectional streaming API, InvokeModelWithBidirectionalStream, built on the HTTP/2 protocol with an event-driven design that interleaves input and output audio streams in real time.[1] The model ID is amazon.nova-sonic-v1:0. At launch it was available in the US East (N. Virginia) AWS Region and had to be enabled through the Bedrock console's model access page.[1]
Because the model processes voice end to end rather than passing text between separate subsystems, Amazon describes it as capable of adapting its delivery to the speaker's tone and style, gracefully handling user interruptions (barge-in), and remaining robust to background noise. The model also produces real-time text transcripts of both sides of a conversation alongside the generated speech, which is useful for logging, analytics, and downstream automation.[1]
Nova Sonic is the speech-focused member of the broader Amazon Nova family, a set of foundation models that Amazon first introduced on December 3, 2024, at AWS re:Invent. The family spans several modalities:[4]
Within this lineup, Nova Sonic is the component that gives the family a native voice interface, complementing the text, image, and video models rather than replacing them.[1][4]
A traditional voice assistant pipeline runs in three stages: an ASR model transcribes the user's speech to text, an LLM reasons over that text and writes a reply, and a TTS model reads the reply aloud. Each stage adds latency, and information about how something was said (emotion, emphasis, hesitation) is typically discarded at the ASR step, making the resulting conversation feel brittle and stilted.[1][2]
Nova Sonic collapses these stages into one unified model architecture that maps speech input directly to speech output. By keeping the audio signal in a single model, it can take acoustic context into account when deciding what to say and how to say it, supporting more natural and expressive spoken dialogue with low latency.[1] On top of this core, Amazon exposes capabilities aimed at production voice agents:[1][3]
Amazon and third parties published several performance figures at launch. The benchmark numbers below are the providers' own claims and should be read with that attribution.
Latency was measured by the independent firm Artificial Analysis. According to that benchmarking, Nova Sonic delivers an average customer-perceived latency of 1.09 seconds, defined as the time from when the customer finishes talking to when the system generates the first speech response. Artificial Analysis reported this against 1.18 seconds for OpenAI's GPT-4o (Realtime) and 1.41 seconds for Google's Gemini 2.0 Flash via Gemini's experimental Live API.[2][5]
For speech recognition accuracy, Amazon's own evaluations claim a word error rate (WER) of 4.2 percent on the Multilingual LibriSpeech benchmark, which it states is 36.4 percent relatively lower than OpenAI's GPT-4o Transcribe model. On the noisier Augmented Multi-Party Interaction (AMI) benchmark, Amazon reports a 46.7 percent relatively lower WER for English compared with GPT-4o Transcribe.[2][5] Prasad framed this as the model being less prone to speech recognition errors than competing AI voice models.[3]
On cost, Amazon describes Nova Sonic as among the most cost-efficient options in its class, stating it is nearly 80 percent less expensive than OpenAI's GPT-4o (Realtime). Amazon did not disclose a specific per-unit price in the launch announcement, and the comparison is Amazon's own.[2][3]
At launch the model supported three expressive voices, with both masculine-sounding and feminine-sounding options, generally available in English and able to produce American and British English accents. Amazon expanded language coverage over the following months, adding Spanish around June 2025 and French, Italian, and German around July 2025.[1] Other practical specifications at release included a 300,000-token context window and a default streaming connection limit of about eight minutes, with conversations able to continue across new connections by passing prior chat history.[1]
| Attribute | Detail |
|---|---|
| Developer | Amazon (Amazon Web Services) |
| Model family | Amazon Nova |
| Type | Real-time speech-to-speech (voice-to-voice) foundation model |
| Announced | April 8, 2025[1][3] |
| Availability | Amazon Bedrock; US East (N. Virginia) at launch[1] |
| Model ID | amazon.nova-sonic-v1:0[1] |
| API | InvokeModelWithBidirectionalStream (bidirectional streaming, HTTP/2)[1] |
| Voices | Three expressive voices (masculine- and feminine-sounding)[2] |
| Languages at launch | English (American and British accents)[1][2] |
| Languages added | Spanish (~June 2025); French, Italian, German (~July 2025)[1] |
| Context window | 300,000 tokens[1] |
| Capabilities | Tool use / function calling, RAG, interruption handling, live transcription[1] |
| Avg. perceived latency | 1.09 s (Artificial Analysis; vs 1.18 s GPT-4o Realtime, 1.41 s Gemini 2.0 Flash)[2][5] |
| ASR accuracy | 4.2% WER on Multilingual LibriSpeech (Amazon claim)[2][5] |
| Cost claim | ~80% less expensive than GPT-4o Realtime (Amazon claim)[2][3] |
Nova Sonic is consumed as a managed service on Amazon Bedrock, so customers do not host the model themselves. Integration is through the bidirectional streaming API, and Amazon shipped SDK support across several languages, including C++, Java, JavaScript, Kotlin, Ruby, Rust, and Swift, along with an experimental Python SDK for early development.[1] Pricing and usage follow standard Bedrock billing.
Amazon describes a range of intended use cases, including customer support and contact-center call automation, interactive education and language learning, gaming, and other voice-enabled assistant and agent applications.[1][3] The same speech technology underpins consumer-facing experiences as well: Amazon has stated that components of Nova Sonic already power Alexa+.[3]
On December 2, 2025, Amazon announced Amazon Nova 2 Sonic, a next-generation speech-to-speech model that builds on the original. Amazon says Nova 2 Sonic maintains the price-performance and low latency of the first release while improving model intelligence, function-calling consistency, and ASR robustness (including better handling of alphanumeric inputs, short utterances, and 8 kHz telephony audio). It expands language coverage to English, French, Italian, German, Spanish, Portuguese, and Hindi, and introduces polyglot voices that can switch languages within a single conversation. Amazon reported availability in the US East (N. Virginia), US West (Oregon), and Asia Pacific (Tokyo) Regions.[6]
Nova Sonic is significant as a fully unified speech-to-speech model offered to enterprises as a managed service on AWS, rather than as a research demo or a consumer feature. By folding ASR, reasoning, and TTS into one model, it targets the latency and naturalness limitations of pipeline-based voice systems, and by adding tool use and retrieval it is aimed squarely at production voice agents for customer service and similar workloads.[1][3]
It also slots Amazon into a competitive field of real-time conversational voice models that includes OpenAI's Realtime API built around GPT-4o, Google's Gemini Live, and specialist voice providers such as ElevenLabs. Amazon's pitch combines competitive latency and ASR accuracy with a substantially lower cost claim, all delivered inside the AWS ecosystem where many enterprise customers already run their data and applications.[2][3]