Hume Octave 2

AI Models Generative AI Speech & Audio AI

19 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v2 · 3,852 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Hume Octave 2 is a multilingual emotional text-to-speech model released by Hume AI on October 1, 2025.^[1] It is the second generation of the company's Octave family, which stands for Omni-Capable Text and Voice Engine.^[2] The model speaks 11 languages, generates audio in under 200 milliseconds, and costs half as much per character as the first Octave release.^[1] It introduces voice conversion, phoneme editing, and multi-speaker dialogue, three features that had previously been split across different vendors rather than offered inside a single speech-language model.^[1] Octave 2 is deployed on dedicated inference hardware in partnership with SambaNova and is also the speech backbone for EVI 4 mini, the latest version of Hume's Empathic Voice Interface.^[5]^[6]

The defining bet of Octave is that a text-to-speech system should reason about what it is reading rather than only convert phonemes to waveforms.^[3] Octave is trained on speech and emotion tokens alongside text, and the inference loop uses an LLM-style backbone to decide tone, rhythm, and cadence before audio is synthesized.^[3] Octave 2 keeps that architecture and rebuilds it for lower latency, broader language coverage, and editing controls that approach what audio engineers expect from a professional voiceover tool.^[1] Hume positions the model against ElevenLabs v3, Cartesia Sonic, and OpenAI's TTS endpoints, with the primary differentiator being emotional inference and direct phoneme-level control.^[15]

Background

Hume AI and the Octave program

Hume AI is a New York-based research company founded in March 2021 by Alan Cowen, a former Google AI researcher with a psychology background.^[20] The company's commercial product line has two pillars: the Empathic Voice Interface, a speech-to-speech conversational system, and Octave, a standalone text-to-speech model.^[20] Both are sold through a single developer platform at dev.hume.ai, and both share Hume's underlying expression measurement technology.

Octave is the company's answer to a long-standing complaint about neural TTS. Most early neural voices sounded natural in the studio sense but flat in the dramatic sense. They could read a paragraph cleanly without learning when a line should be whispered, shouted, or rushed. Hume's argument was that prosody is not a finishing layer on top of speech synthesis. It is the meaning itself. The Octave family treats the choice of tone as a language-modeling problem, with prosodic tokens predicted in the same way an LLM predicts the next word.^[3]

Octave 1

Octave 1 launched on February 26, 2025, under the full name Omni-Capable Text and Voice Engine.^[2] It was the first commercial text-to-speech system built on a large language model trained on speech and emotion tokens in addition to text.^[3] Hume's launch materials described it as a model that understands what it is saying, in the sense that it can infer the emotional context of a line and adjust delivery before generating audio.^[3]

Octave 1 supported English only. It accepted a written prompt plus an optional acting instruction, then produced audio with prosody appropriate to the instruction.^[3] A user could write the same line of dialogue and ask the model to deliver it as a sarcastic aside, a sincere apology, or a frantic warning, and the output would differ in pitch contour, pacing, and timbre.^[13] The model also supported voice design from a textual description, voice cloning from short samples, and emotion-tagged scripts.^[14]

In a blind preference study with 180 human raters and 120 prompts that Hume published alongside the launch, Octave 1 was preferred over ElevenLabs Voice Design 71.6 percent of the time on audio quality, 51.7 percent on naturalness, and 57.7 percent on how well the speech matched the description of the desired voice.^[15] ElevenLabs Voice Design was the natural comparison target because both systems generate voices from textual specifications rather than only from cloned samples.

Octave 1 entered general availability with a free tier of 10,000 characters per month and paid tiers running from $7 per month upward.^[13] The launch was covered by VentureBeat, MarkTechPost, and Techmeme.^[13]^[14]

Why a second generation

Four limits of Octave 1 shaped the design of Octave 2. The model only spoke English, which excluded the majority of the global TTS market. Inference latency was around 330 milliseconds in typical conditions, too slow for many real-time conversational uses. Per-character pricing was higher than competitive offerings such as Cartesia Sonic. And the model offered no granular editing controls; a user who wanted to fix a single mispronunciation had to regenerate the entire passage and hope the rest of the delivery survived. The full release of Octave 2 followed on October 1, 2025.^[1]

Octave 2 capabilities

Octave 2 is a single model that handles synthesis, voice design, cloning, and editing across 11 languages. The headline numbers from Hume's launch announcement are summarized in the table below.^[1]

Feature	Octave 1 (Feb 2025)	Octave 2 (Oct 2025)
Languages supported	English only	11 (Arabic, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Russian, Spanish)
Latency (typical)	About 330 ms	Under 200 ms
Cost vs Octave 1	Baseline	About 50% lower per character
Multi-speaker dialogue	No	Yes
Voice conversion	No	Yes
Phoneme editing	No	Yes
Pronunciation reliability on numbers, symbols, repeats	Inconsistent	Improved (Hume internal evaluations)
Inference hardware	Standard GPUs	Dedicated stack on SambaNova chips ^[5]
API access	Yes	Yes, plus EVI 4 mini integration

The under-200-millisecond figure refers to time-to-first-audio.^[1] For a conversational turn, Hume reports an end-to-end round trip of roughly 100 to 300 milliseconds when Octave 2 is paired with SambaNova hardware, depending on prompt length.^[5] This brings the system into a latency range where users perceive the response as immediate, which is the threshold most conversational designers target.

The 50 percent price reduction comes mostly from architecture and hardware changes rather than from a margin cut.^[1] Hume rebuilt the inference stack specifically for Octave 2 and ran it on SambaNova's reconfigurable dataflow chips, which the two companies say produce a substantial throughput advantage on speech-language workloads compared with the general-purpose GPU stack used for Octave 1.^[4]^[5]

Voice conversion

Voice conversion in Octave 2 lets a user replace one speaker's voice with another while keeping the phonetic timing and prosody of the original take.^[1] The typical use case is dubbing or AI voiceover touch-up. A studio can record a take with a stand-in actor, then transfer the original lead actor's voice onto the take without re-recording the dialogue. The original performance is preserved, including the breath placement, hesitations, and emphasis pattern of the human actor. Only the timbre and identity of the voice change.

This is technically different from cloning. Cloning generates new speech in a target voice from text. Conversion takes an existing audio file and re-renders it in a different voice while keeping the source performance. The two operations have different failure modes. Cloning can drift from the source style when the script is long. Conversion can sound mechanical when the source recording was overly slurred or recorded at a poor quality. Octave 2 supports both modes from the same API.

Phoneme editing

Phoneme editing exposes the underlying acoustic tokens to the user. A producer can select a generated take, identify a single mispronounced name or stressed syllable, and edit it in place without regenerating the surrounding audio.^[1] The control surface in Hume's playground is similar to a piano roll in a digital audio workstation, with phoneme cells laid out in time and edit handles for duration and pitch.

The practical effect is that brand names, technical jargon, and foreign loanwords can be corrected after the fact rather than worked around in the prompt. Octave 1 required users to spell words phonetically in the input text to coax the desired pronunciation, a workaround that often broke prosody elsewhere in the line. Octave 2 keeps the prosody and only nudges the phonemes that need it.

Multi-speaker dialogue

Octave 2 can render conversations between multiple voices in a single generation request rather than stitching together separate takes from different API calls.^[1] Each speaker is identified by a voice ID or description, and the model renders the back-and-forth with timing that respects the conversational rhythm. Turn-taking, overlaps, and brief acknowledgments such as "yeah" or "mm-hmm" can be specified inline.

Multi-speaker dialogue is a recognized weakness of single-utterance TTS systems. When two takes are generated separately and concatenated, the energy levels, pacing, and ambient timbre rarely match. By generating both sides of a conversation in one pass, Octave 2 avoids most of that mismatch. The feature is aimed at podcast producers, audiobook publishers with dialogue-heavy fiction, and game studios producing character banter.

Pronunciation reliability

Hume reports that Octave 2 handles numbers, dates, repeated words, and rare symbols more reliably than Octave 1.^[1] The model was specifically retrained on examples that previous versions struggled with, including phone numbers, addresses, mathematical expressions, code identifiers, and dialogue tags like "he said, he said, he said" where naive systems collapse into a single utterance.

Multilingual support

Octave 2 launched fluent in 11 languages. They are listed alphabetically in the table below along with notes on Hume's stated readiness level at launch.^[1]

Language	Region focus	Voice cloning supported	Notes
Arabic	Modern Standard Arabic	Yes	First Hume model to support Arabic
English	US and UK varieties	Yes	Backwards-compatible with Octave 1 voices
French	European French	Yes	Quebec accent available via cloning
German	Standard German	Yes
Hindi	Standard Hindi	Yes	Devanagari script input
Italian	Standard Italian	Yes
Japanese	Standard Japanese	Yes	Promoted as flagship language in SambaNova partnership ^[5]
Korean	Standard Korean	Yes
Portuguese	Brazilian and European	Yes	Two regional defaults
Russian	Standard Russian	Yes
Spanish	Castilian and Latin American	Yes	Two regional defaults

Hume's launch announcement stated that support for at least 20 languages was planned for the months following release, with the next batch expected to add Chinese (Mandarin), Dutch, Polish, Turkish, and several South Asian and Southeast Asian languages.^[1] The company has not given a fixed schedule. Each language requires its own training data, expert linguistic review, and prosody tuning, and Hume has historically been conservative about announcing language launches before voice quality meets internal targets.

A distinctive property of Octave 2's multilingual mode is cross-language accent prediction. When a voice is cloned from a 15-second sample of a native English speaker and then asked to read a Spanish passage, the model attempts to predict how that specific speaker would sound speaking Spanish, including the accent and timing pattern.^[1] This is the opposite of typical multilingual TTS, which renders cloned voices in the new language using the native pronunciation of that language without preserving any of the original speaker's identity. Hume's approach trades some pronunciation accuracy for identity preservation.

Voice cloning

Voice cloning in Octave 2 requires a 15-second audio sample of a native speaker reading prepared text.^[9] The sample is uploaded through the dashboard or the API, processed in a few seconds, and made available as a voice ID that can be used for generation.^[9] The same voice can be used across all 11 supported languages, with the cross-language accent prediction described above.^[1]

Cloning is available on the Creator plan and higher.^[12] The Creator plan, at $7 per month or $14 per month on annual billing depending on the tier path, offers unlimited voice creation and use, with overage on synthesis charged separately.^[12] Enterprise plans add unlimited API access and commercial licensing terms suitable for production deployments at scale.^[12] Hume publishes a content policy that prohibits cloning a real person's voice without consent and requires customers to obtain explicit permission for any commercial voice based on a real recording.^[9]

For identity verification, Hume requires customers to record a verbal consent statement in the same session that the cloning sample is provided.^[9] The statement says the speaker consents to use of their voice. The system stores the consent recording alongside the cloned voice and uses it to gate deployment.^[9] The verbal consent flow is closer in design to the verification used by Sesame CSM than to the looser sample upload flow used by older TTS APIs.

API and pricing

Octave 2 is available through the Hume API at dev.hume.ai.^[10] The endpoint accepts a JSON request with the text to synthesize, a voice ID or description, optional acting instructions, and an output format specification.^[10] Output formats include WAV, MP3, and streaming PCM.^[10] The streaming endpoint emits audio chunks as soon as they are available, which is the path most real-time applications use.

The table below summarizes the published pricing tiers for individual and business plans as of the October 2025 launch. Enterprise pricing is custom.^[12]

Plan	Monthly price	Characters included	Overage rate	Voice cloning	Commercial license
Free	$0	10,000	Not available	Limited	No
Starter	$3	30,000	$0.20 per 1,000	Limited	No
Creator	$7 to $14	140,000	$0.15 per 1,000	Yes	Yes
Pro	$70	1,000,000	$0.12 per 1,000	Yes	Yes
Scale	$200	Higher allotment	Volume discount	Yes	Yes
Business	$500	Higher allotment	Volume discount	Yes	Yes
Enterprise	Custom	Unlimited available	Negotiated	Unlimited	Yes

The headline pay-as-you-go rate for Octave 2 is approximately $7.60 per million characters generated, which Hume's pricing page describes as the lowest among the major top-tier multilingual TTS providers as of the launch date.^[12] For dedicated enterprise deployments with reserved capacity, Hume quotes a marginal cost of under one cent per minute of generated audio, which is competitive with the cost basis of Cartesia and significantly below ElevenLabs v3 at the same usage tier.

Rate limits depend on the plan. The Free tier is rate-limited to a few concurrent requests, which is enough for testing but not for production. The Pro tier supports moderate concurrency. Scale and Business plans support higher concurrency and provide service-level commitments. Enterprise customers get dedicated inference capacity on SambaNova hardware with negotiated availability targets.

Empathic Voice Interface integration

Octave 2 is the speech backbone of EVI 4 mini, the version of Hume's Empathic Voice Interface released alongside it in October 2025.^[6] EVI 4 mini is the smaller of two planned EVI 4 variants.^[11] It pairs Octave 2 for synthesis with Hume's expression measurement front end and a pluggable language model for response generation.^[11] Customers can bring their own LLM, including frontier models from Anthropic, OpenAI, or Google, or use one of Hume's hosted options.^[11]

The practical impact of pairing EVI with Octave 2 is that the system can detect a caller's emotional state from incoming audio, generate a response that takes that state into account, and synthesize the reply with prosody appropriate to the moment. If the caller sounds upset, the response is rendered with a slower pace and lower energy. If the caller is excited, the response matches that energy without veering into mockery. The system is interruptible mid-utterance, which keeps the back-and-forth feeling natural.

EVI 4 mini is sold to customer service operators, telehealth providers, and conversational education platforms. The pitch is not that the AI is more accurate than the older speech-to-speech systems, although Hume claims it is. The pitch is that callers tend to stay longer in the conversation when the system sounds like it understands them, which translates into measurable outcomes such as completion rates on intake forms or adherence to medication reminders.

EVI 3, the prior generation, remains supported for customers who have not yet migrated.^[11] EVI 4, the full-size variant, was previewed at the same time as the mini release but had not entered general availability as of the article date.

Comparison to competitors

The text-to-speech market in late 2025 had three reasonably mature contenders aimed at developers building expressive voice products: Hume with Octave 2, ElevenLabs with v3, and Cartesia with Sonic 2. OpenAI's TTS endpoints offered a simpler, lower-control alternative aimed at general developers. The table below compares the four on the dimensions developers tend to evaluate.

Dimension	Octave 2	ElevenLabs v3	Cartesia Sonic 2	OpenAI TTS-1-HD
Launch date	October 1, 2025	June 5, 2025	Mid 2025	November 6, 2024
Languages at launch	11	32	15	About 50
Latency target	Under 200 ms	About 250 ms	Around 90 ms	Around 300 ms
Voice cloning sample	15 seconds	1 minute (Instant)	3 seconds	Not offered
Emotional control	LLM-inferred from script and instructions	Audio tag system in v3	Limited tags	None
Voice conversion	Yes	No (separate dubbing product)	No	No
Phoneme editing	Yes	No	No	No
Multi-speaker dialogue	Yes, single request	Limited, requires manual stitching	No	No
Per-character cost	~$7.60 per 1M	~$15.00 per 1M	~$5 per 1M	~$15.00 per 1M
Dedicated hardware	SambaNova chips	Standard GPU	Standard GPU	OpenAI infrastructure
Strongest claim	Emotional accuracy and editing	Largest language coverage	Lowest latency	Easiest integration

In third-party benchmark studies released after Octave 2 launched, the picture is mixed. On pure speech naturalness measured by mean opinion score, ElevenLabs v3 scored 89.6 percent in a survey published by Pixazo, while Octave 2 scored 78.5 percent.^[16] On pronunciation accuracy in the same survey, ElevenLabs scored 87.1 percent against Octave 2's 80 percent.^[16] In blind preference tests focused specifically on emotional nuance, Hume tends to lead,^[15] with reviewers describing Octave 2 as more capable of authentic empathy and subtle mood shifts but less consistent on long-form narration.^[17] ElevenLabs holds an advantage when the use case is straightforward voiceover or audiobook narration where consistency outweighs emotional range.

Cartesia Sonic 2 is the latency leader. With sub-100-millisecond time-to-first-audio in good network conditions, it remains the default choice for use cases where speed dominates emotional expressiveness, such as voice agents fielding high-volume inbound calls. Cartesia trades emotional range and editing tools for raw speed, which is the inverse of Hume's bet.

OpenAI's TTS endpoints sit in a different segment. They are easier to integrate, support more languages, and require no voice design step, but they do not expose emotional controls or cloning. Octave 2 competes against OpenAI TTS only on the segment of the market that cares about expression. For developers who want a serviceable narrator voice in a few lines of code, OpenAI remains the path of least resistance.

Sesame CSM, released earlier in 2025, is a related but architecturally distinct system focused on conversational presence rather than voice design. It targets the same emotional-intelligence segment as Hume but with a different emphasis on prosodic continuity across turns. The two are sometimes evaluated together by buyers shopping for empathic voice technology rather than for narration.

Reception

The technical press received Octave 2 favorably. TestingCatalog, AllAboutAI, and FunBlocks ran feature pieces covering the launch and emphasized the multilingual jump from English-only and the 50 percent price cut.^[6]^[7]^[8] Reviewers consistently called out voice conversion and phoneme editing as the most differentiated features, on the grounds that no other commercial TTS provider was shipping both in a single API.^[7]^[8] The Product Hunt launch on October 1 drew 119 upvotes in the first day, with comments focused on the multilingual coverage and the Japanese voice.

Reception inside the AI engineering community was more measured. Several developers noted that Octave 2's 11-language coverage at launch was narrower than ElevenLabs v3's 32-language list, and that for many production applications the choice would come down to whether emotional control or language coverage mattered more. Others praised the SambaNova partnership as the clearest example to date of a frontier speech-language model running on non-GPU inference hardware in production rather than as a benchmarking exercise.^[19]

A recurring criticism in user reviews collected by Murf and Fish Audio is that Octave 2 occasionally drifts from a chosen voice over very long passages, particularly in languages other than English.^[17] Hume has acknowledged this as a known limit of cross-language cloning at launch.

The broader strategic read from analysts at Contrary Research and TestingCatalog is that Hume is now positioned more clearly as the emotion specialist in the TTS market, rather than as a generalist trying to compete with ElevenLabs on every dimension.^[6]^[20] The 11-language coverage is enough to make the product viable for global products, the latency is fast enough for conversational use, and the price is low enough to remove cost as a primary objection.

References

Hume AI. "Octave 2: next-generation multilingual voice AI." Hume Blog. October 1, 2025. https://www.hume.ai/blog/octave-2-launch ↩
Hume AI. "Introducing OCTAVE (Omni-Capable Text and Voice Engine)." Hume Blog. February 26, 2025. https://www.hume.ai/blog/introducing-octave ↩
Hume AI. "Octave TTS: the first text-to-speech system that understands what it's saying." Hume Blog. February 26, 2025. https://www.hume.ai/blog/octave-the-first-text-to-speech-model-that-understands-what-its-saying ↩
Hume AI. "Hume AI brings expressive speech to SambaNova-powered language models." Hume Blog. October 2025. https://www.hume.ai/blog/case-study-hume-sambanova ↩
SambaNova Systems. "SambaNova and Hume AI Unleash Lightning-Fast, Multilingual Speech-Language Model to Redefine Conversational AI for Global Enterprises." Business Wire press release. October 1, 2025. https://www.businesswire.com/news/home/20251001557100/en/SambaNova-and-Hume-AI-Unleash-Lightning-Fast-Multilingual-Speech-Language-Model-to-Redefine-Conversational-AI-for-Global-Enterprises ↩
TestingCatalog. "Hume AI launches Octave 2 and EVI 4 mini voice models." October 2025. https://www.testingcatalog.com/hume-ai-launches-octave-2-and-evi-4-mini-voice-models/ ↩
FunBlocks AI Reviews. "Octave 2 by Hume AI: The Next-Generation Multilingual Text-to-Speech Redefining AI Voices." 2025. https://www.funblocks.net/aitools/reviews/hume-2 ↩
AllAboutAI. "Meet Octave 2 by Hume AI: A Next-Gen Text-to-Speech Model." 2025. https://www.allaboutai.com/ai-news/meet-octave-2-by-hume-ai-a-next-gen-text-to-speech-model/ ↩
Hume AI. "Voice Cloning." Hume API Documentation. https://dev.hume.ai/docs/voice/voice-cloning ↩
Hume AI. "Text-to-Speech (TTS) overview." Hume API Documentation. https://dev.hume.ai/docs/text-to-speech-tts/overview ↩
Hume AI. "Empathic Voice Interface (EVI) Version." Hume API Documentation. https://dev.hume.ai/docs/speech-to-speech-evi/configuration/evi-version ↩
Hume AI. "Pricing." https://www.hume.ai/pricing ↩
VentureBeat. "Hume launches new text-to-speech model Octave that generates custom AI voices with adjustable emotions." Carl Franzen. February 26, 2025. https://venturebeat.com/ai/hume-launches-text-to-speech-model-octave ↩
MarkTechPost. "Hume Introduces Octave TTS: A New Text-to-Speech Model that Creates Custom AI Voices with Tailored Emotions." February 26, 2025. https://www.marktechpost.com/2025/02/26/hume-introduces-octave-tts-a-new-text-to-speech-model-that-creates-custom-ai-voices-with-tailored-emotions/ ↩
Hume AI. "How does Hume Octave compare to other leading TTS models like ElevenLabs?" Hume Blog. 2025. https://www.hume.ai/blog/octave-tts-study-performance ↩
Pixazo. "Hume AI vs ElevenLabs: Comparing Two Expressive Text-to-Speech Models." 2025. https://www.pixazo.ai/blog/hume-ai-vs-elevenlabs ↩
Murf. "Hume AI vs ElevenLabs Tried Both and Here's the Winner." 2026. https://murf.ai/blog/hume-ai-vs-elevenlabs ↩
Cartesia. "ElevenLabs vs Hume." https://cartesia.ai/vs/elevenlabs-vs-hume
ITBrief. "SambaNova debuts Hume AI voice models for emotional speech AI." October 2025. https://itbrief.news/story/sambanova-debuts-hume-ai-voice-models-for-emotional-speech-ai ↩
Contrary Research. "Report: Hume AI Business Breakdown and Founding Story." https://research.contrary.com/company/hume-ai ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

ElevenLabs v3 Sesame CSM

Background

Hume AI and the Octave program

Octave 1

Why a second generation

Octave 2 capabilities

Voice conversion

Phoneme editing

Multi-speaker dialogue

Pronunciation reliability

Multilingual support

Voice cloning

API and pricing

Empathic Voice Interface integration

Comparison to competitors

Reception

See also

References

Improve this article

Related Articles

Lyria

Suno v5

ElevenLabs Music

ElevenLabs v3

Sesame CSM

Stable Audio 2.5

What links here

Related Articles

Lyria

Suno v5

ElevenLabs Music

ElevenLabs v3

Sesame CSM

Stable Audio 2.5

What links here