Amazon Nova Sonic

AI Models Generative AI

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,556 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Amazon Nova Sonic is a real-time speech-to-speech foundation model developed by Amazon and offered through Amazon Bedrock. Announced on April 8, 2025, it is part of the Amazon Nova family of foundation models. Unlike conventional voice stacks that chain together separate automatic speech recognition (ASR), a large language model, and text-to-speech (TTS) components, Nova Sonic unifies speech understanding and speech generation into a single model. It accepts spoken audio as input and produces spoken audio as output natively, which lets it model the prosody, tone, and timing of conversation, reduce latency, and handle natural turn-taking and interruptions.^[1]^[2]

Amazon positions Nova Sonic as infrastructure for building conversational voice agents, with built-in tool use (function calling) and retrieval so that a voice application can fetch real-time information, query proprietary data, or take action in external systems while talking with a user.^[1]^[3] Components of Nova Sonic already power Alexa+, Amazon's upgraded generative AI assistant.^[3]

Overview

Nova Sonic is delivered on Amazon Bedrock through a new bidirectional streaming API, InvokeModelWithBidirectionalStream, built on the HTTP/2 protocol with an event-driven design that interleaves input and output audio streams in real time.^[1] The model ID is amazon.nova-sonic-v1:0. At launch it was available in the US East (N. Virginia) AWS Region and had to be enabled through the Bedrock console's model access page.^[1]

Because the model processes voice end to end rather than passing text between separate subsystems, Amazon describes it as capable of adapting its delivery to the speaker's tone and style, gracefully handling user interruptions (barge-in), and remaining robust to background noise. The model also produces real-time text transcripts of both sides of a conversation alongside the generated speech, which is useful for logging, analytics, and downstream automation.^[1]

The Nova family

Nova Sonic is the speech-focused member of the broader Amazon Nova family, a set of foundation models that Amazon first introduced on December 3, 2024, at AWS re:Invent. The family spans several modalities:^[4]

Text and multimodal understanding models: Amazon Nova Micro (text only, lowest latency and cost), Amazon Nova Lite, Amazon Nova Pro, and Amazon Nova Premier (the most capable, used for complex reasoning and as a teacher for model distillation).
Amazon Nova Canvas, an image generation model.
Amazon Nova Reel, a video generation model.
Nova Act, an agentic offering for automating browser-based UI workflows, previewed shortly before Nova Sonic.^[3]

Within this lineup, Nova Sonic is the component that gives the family a native voice interface, complementing the text, image, and video models rather than replacing them.^[1]^[4]

What Nova Sonic does (unified speech-to-speech)

A traditional voice assistant pipeline runs in three stages: an ASR model transcribes the user's speech to text, an LLM reasons over that text and writes a reply, and a TTS model reads the reply aloud. Each stage adds latency, and information about how something was said (emotion, emphasis, hesitation) is typically discarded at the ASR step, making the resulting conversation feel brittle and stilted.^[1]^[2]

Nova Sonic collapses these stages into one unified model architecture that maps speech input directly to speech output. By keeping the audio signal in a single model, it can take acoustic context into account when deciding what to say and how to say it, supporting more natural and expressive spoken dialogue with low latency.^[1] On top of this core, Amazon exposes capabilities aimed at production voice agents:^[1]^[3]

Function calling, also known as tool use, so the model can invoke external APIs and services during a conversation.
Agentic workflows that let a voice agent decide when to fetch real-time data from the internet, parse a proprietary data source, or act in an external application. Amazon SVP and Head Scientist of AGI Rohit Prasad said the model "excels at routing user requests to different APIs."^[3]
Knowledge grounding with enterprise data using Retrieval-Augmented Generation (RAG).

Capabilities and benchmarks

Amazon and third parties published several performance figures at launch. The benchmark numbers below are the providers' own claims and should be read with that attribution.

Latency was measured by the independent firm Artificial Analysis. According to that benchmarking, Nova Sonic delivers an average customer-perceived latency of 1.09 seconds, defined as the time from when the customer finishes talking to when the system generates the first speech response. Artificial Analysis reported this against 1.18 seconds for OpenAI's GPT-4o (Realtime) and 1.41 seconds for Google's Gemini 2.0 Flash via Gemini's experimental Live API.^[2]^[5]

For speech recognition accuracy, Amazon's own evaluations claim a word error rate (WER) of 4.2 percent on the Multilingual LibriSpeech benchmark, which it states is 36.4 percent relatively lower than OpenAI's GPT-4o Transcribe model. On the noisier Augmented Multi-Party Interaction (AMI) benchmark, Amazon reports a 46.7 percent relatively lower WER for English compared with GPT-4o Transcribe.^[2]^[5] Prasad framed this as the model being less prone to speech recognition errors than competing AI voice models.^[3]

On cost, Amazon describes Nova Sonic as among the most cost-efficient options in its class, stating it is nearly 80 percent less expensive than OpenAI's GPT-4o (Realtime). Amazon did not disclose a specific per-unit price in the launch announcement, and the comparison is Amazon's own.^[2]^[3]

At launch the model supported three expressive voices, with both masculine-sounding and feminine-sounding options, generally available in English and able to produce American and British English accents. Amazon expanded language coverage over the following months, adding Spanish around June 2025 and French, Italian, and German around July 2025.^[1] Other practical specifications at release included a 300,000-token context window and a default streaming connection limit of about eight minutes, with conversations able to continue across new connections by passing prior chat history.^[1]

Specifications

Attribute	Detail
Developer	Amazon (Amazon Web Services)
Model family	Amazon Nova
Type	Real-time speech-to-speech (voice-to-voice) foundation model
Announced	April 8, 2025^[1]^[3]
Availability	Amazon Bedrock; US East (N. Virginia) at launch^[1]
Model ID	`amazon.nova-sonic-v1:0`^[1]
API	`InvokeModelWithBidirectionalStream` (bidirectional streaming, HTTP/2)^[1]
Voices	Three expressive voices (masculine- and feminine-sounding)^[2]
Languages at launch	English (American and British accents)^[1]^[2]
Languages added	Spanish (~June 2025); French, Italian, German (~July 2025)^[1]
Context window	300,000 tokens^[1]
Capabilities	Tool use / function calling, RAG, interruption handling, live transcription^[1]
Avg. perceived latency	1.09 s (Artificial Analysis; vs 1.18 s GPT-4o Realtime, 1.41 s Gemini 2.0 Flash)^[2]^[5]
ASR accuracy	4.2% WER on Multilingual LibriSpeech (Amazon claim)^[2]^[5]
Cost claim	~80% less expensive than GPT-4o Realtime (Amazon claim)^[2]^[3]

Availability (Bedrock)

Nova Sonic is consumed as a managed service on Amazon Bedrock, so customers do not host the model themselves. Integration is through the bidirectional streaming API, and Amazon shipped SDK support across several languages, including C++, Java, JavaScript, Kotlin, Ruby, Rust, and Swift, along with an experimental Python SDK for early development.^[1] Pricing and usage follow standard Bedrock billing.

Amazon describes a range of intended use cases, including customer support and contact-center call automation, interactive education and language learning, gaming, and other voice-enabled assistant and agent applications.^[1]^[3] The same speech technology underpins consumer-facing experiences as well: Amazon has stated that components of Nova Sonic already power Alexa+.^[3]

Successor

On December 2, 2025, Amazon announced Amazon Nova 2 Sonic, a next-generation speech-to-speech model that builds on the original. Amazon says Nova 2 Sonic maintains the price-performance and low latency of the first release while improving model intelligence, function-calling consistency, and ASR robustness (including better handling of alphanumeric inputs, short utterances, and 8 kHz telephony audio). It expands language coverage to English, French, Italian, German, Spanish, Portuguese, and Hindi, and introduces polyglot voices that can switch languages within a single conversation. Amazon reported availability in the US East (N. Virginia), US West (Oregon), and Asia Pacific (Tokyo) Regions.^[6]

Significance

Nova Sonic is significant as a fully unified speech-to-speech model offered to enterprises as a managed service on AWS, rather than as a research demo or a consumer feature. By folding ASR, reasoning, and TTS into one model, it targets the latency and naturalness limitations of pipeline-based voice systems, and by adding tool use and retrieval it is aimed squarely at production voice agents for customer service and similar workloads.^[1]^[3]

It also slots Amazon into a competitive field of real-time conversational voice models that includes OpenAI's Realtime API built around GPT-4o, Google's Gemini Live, and specialist voice providers such as ElevenLabs. Amazon's pitch combines competitive latency and ASR accuracy with a substantially lower cost claim, all delivered inside the AWS ecosystem where many enterprise customers already run their data and applications.^[2]^[3]

References

AWS News Blog, "Introducing Amazon Nova Sonic: Human-like voice conversations for generative AI applications," April 8, 2025. https://aws.amazon.com/blogs/aws/introducing-amazon-nova-sonic-human-like-voice-conversations-for-generative-ai-applications/ ↩
Amazon Press Center, "Introducing Amazon Nova Sonic: A New Gen AI Model for Building Voice Applications and Agents," April 8, 2025. https://press.aboutamazon.com/2025/4/introducing-amazon-nova-sonic-a-new-gen-ai-model-for-building-voice-applications-and-agents ↩
Kyle Wiggers, "Amazon unveils a new AI voice model, Nova Sonic," TechCrunch, April 8, 2025. https://techcrunch.com/2025/04/08/amazon-unveils-a-new-ai-voice-model-nova-sonic/ ↩
Amazon, "Amazon Nova: Meet our new foundation models in Amazon Bedrock," December 3, 2024. https://www.aboutamazon.com/news/aws/amazon-nova-artificial-intelligence-bedrock-aws ↩
BigDATAwire, "Amazon Nova Sonic Brings Unified Speech Understanding and Generation to Amazon Bedrock," April 2025. https://www.bigdatawire.com/this-just-in/amazon-nova-sonic-brings-unified-speech-understanding-and-generation-to-amazon-bedrock/ ↩
AWS News Blog, "Introducing Amazon Nova 2 Sonic: next-generation speech-to-speech model for conversational AI," December 2, 2025. https://aws.amazon.com/blogs/aws/introducing-amazon-nova-2-sonic-next-generation-speech-to-speech-model-for-conversational-ai/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Pipecat