Hume Octave 2
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 3,850 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 3,850 words
Add missing citations, update stale details, or suggest a clearer explanation.
Hume Octave 2 is a multilingual emotional text-to-speech model released by Hume AI on October 1, 2025. It is the second generation of the company's Octave family, which stands for Omni-Capable Text and Voice Engine. The model speaks 11 languages, generates audio in under 200 milliseconds, and costs half as much per character as the first Octave release. It introduces voice conversion, phoneme editing, and multi-speaker dialogue, three features that had previously been split across different vendors rather than offered inside a single speech-language model. Octave 2 is deployed on dedicated inference hardware in partnership with SambaNova and is also the speech backbone for EVI 4 mini, the latest version of Hume's Empathic Voice Interface.
The defining bet of Octave is that a text-to-speech system should reason about what it is reading rather than only convert phonemes to waveforms. Octave is trained on speech and emotion tokens alongside text, and the inference loop uses an LLM-style backbone to decide tone, rhythm, and cadence before audio is synthesized. Octave 2 keeps that architecture and rebuilds it for lower latency, broader language coverage, and editing controls that approach what audio engineers expect from a professional voiceover tool. Hume positions the model against ElevenLabs v3, Cartesia Sonic, and OpenAI's TTS endpoints, with the primary differentiator being emotional inference and direct phoneme-level control.
Hume AI is a New York-based research company founded in March 2021 by Alan Cowen, a former Google AI researcher with a psychology background. The company's commercial product line has two pillars: the Empathic Voice Interface, a speech-to-speech conversational system, and Octave, a standalone text-to-speech model. Both are sold through a single developer platform at dev.hume.ai, and both share Hume's underlying expression measurement technology.
Octave is the company's answer to a long-standing complaint about neural TTS. Most early neural voices sounded natural in the studio sense but flat in the dramatic sense. They could read a paragraph cleanly without learning when a line should be whispered, shouted, or rushed. Hume's argument was that prosody is not a finishing layer on top of speech synthesis. It is the meaning itself. The Octave family treats the choice of tone as a language-modeling problem, with prosodic tokens predicted in the same way an LLM predicts the next word.
Octave 1 launched on February 26, 2025, under the full name Omni-Capable Text and Voice Engine. It was the first commercial text-to-speech system built on a large language model trained on speech and emotion tokens in addition to text. Hume's launch materials described it as a model that understands what it is saying, in the sense that it can infer the emotional context of a line and adjust delivery before generating audio.
Octave 1 supported English only. It accepted a written prompt plus an optional acting instruction, then produced audio with prosody appropriate to the instruction. A user could write the same line of dialogue and ask the model to deliver it as a sarcastic aside, a sincere apology, or a frantic warning, and the output would differ in pitch contour, pacing, and timbre. The model also supported voice design from a textual description, voice cloning from short samples, and emotion-tagged scripts.
In a blind preference study with 180 human raters and 120 prompts that Hume published alongside the launch, Octave 1 was preferred over ElevenLabs Voice Design 71.6 percent of the time on audio quality, 51.7 percent on naturalness, and 57.7 percent on how well the speech matched the description of the desired voice. ElevenLabs Voice Design was the natural comparison target because both systems generate voices from textual specifications rather than only from cloned samples.
Octave 1 entered general availability with a free tier of 10,000 characters per month and paid tiers running from $7 per month upward. The launch was covered by VentureBeat, MarkTechPost, and Techmeme.
Four limits of Octave 1 shaped the design of Octave 2. The model only spoke English, which excluded the majority of the global TTS market. Inference latency was around 330 milliseconds in typical conditions, too slow for many real-time conversational uses. Per-character pricing was higher than competitive offerings such as Cartesia Sonic. And the model offered no granular editing controls; a user who wanted to fix a single mispronunciation had to regenerate the entire passage and hope the rest of the delivery survived. The full release of Octave 2 followed on October 1, 2025.
Octave 2 is a single model that handles synthesis, voice design, cloning, and editing across 11 languages. The headline numbers from Hume's launch announcement are summarized in the table below.
| Feature | Octave 1 (Feb 2025) | Octave 2 (Oct 2025) |
|---|---|---|
| Languages supported | English only | 11 (Arabic, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Russian, Spanish) |
| Latency (typical) | About 330 ms | Under 200 ms |
| Cost vs Octave 1 | Baseline | About 50% lower per character |
| Multi-speaker dialogue | No | Yes |
| Voice conversion | No | Yes |
| Phoneme editing | No | Yes |
| Pronunciation reliability on numbers, symbols, repeats | Inconsistent | Improved (Hume internal evaluations) |
| Inference hardware | Standard GPUs | Dedicated stack on SambaNova chips |
| API access | Yes | Yes, plus EVI 4 mini integration |
The under-200-millisecond figure refers to time-to-first-audio. For a conversational turn, Hume reports an end-to-end round trip of roughly 100 to 300 milliseconds when Octave 2 is paired with SambaNova hardware, depending on prompt length. This brings the system into a latency range where users perceive the response as immediate, which is the threshold most conversational designers target.
The 50 percent price reduction comes mostly from architecture and hardware changes rather than from a margin cut. Hume rebuilt the inference stack specifically for Octave 2 and ran it on SambaNova's reconfigurable dataflow chips, which the two companies say produce a substantial throughput advantage on speech-language workloads compared with the general-purpose GPU stack used for Octave 1.
Voice conversion in Octave 2 lets a user replace one speaker's voice with another while keeping the phonetic timing and prosody of the original take. The typical use case is dubbing or AI voiceover touch-up. A studio can record a take with a stand-in actor, then transfer the original lead actor's voice onto the take without re-recording the dialogue. The original performance is preserved, including the breath placement, hesitations, and emphasis pattern of the human actor. Only the timbre and identity of the voice change.
This is technically different from cloning. Cloning generates new speech in a target voice from text. Conversion takes an existing audio file and re-renders it in a different voice while keeping the source performance. The two operations have different failure modes. Cloning can drift from the source style when the script is long. Conversion can sound mechanical when the source recording was overly slurred or recorded at a poor quality. Octave 2 supports both modes from the same API.
Phoneme editing exposes the underlying acoustic tokens to the user. A producer can select a generated take, identify a single mispronounced name or stressed syllable, and edit it in place without regenerating the surrounding audio. The control surface in Hume's playground is similar to a piano roll in a digital audio workstation, with phoneme cells laid out in time and edit handles for duration and pitch.
The practical effect is that brand names, technical jargon, and foreign loanwords can be corrected after the fact rather than worked around in the prompt. Octave 1 required users to spell words phonetically in the input text to coax the desired pronunciation, a workaround that often broke prosody elsewhere in the line. Octave 2 keeps the prosody and only nudges the phonemes that need it.
Octave 2 can render conversations between multiple voices in a single generation request rather than stitching together separate takes from different API calls. Each speaker is identified by a voice ID or description, and the model renders the back-and-forth with timing that respects the conversational rhythm. Turn-taking, overlaps, and brief acknowledgments such as "yeah" or "mm-hmm" can be specified inline.
Multi-speaker dialogue is a recognized weakness of single-utterance TTS systems. When two takes are generated separately and concatenated, the energy levels, pacing, and ambient timbre rarely match. By generating both sides of a conversation in one pass, Octave 2 avoids most of that mismatch. The feature is aimed at podcast producers, audiobook publishers with dialogue-heavy fiction, and game studios producing character banter.
Hume reports that Octave 2 handles numbers, dates, repeated words, and rare symbols more reliably than Octave 1. The model was specifically retrained on examples that previous versions struggled with, including phone numbers, addresses, mathematical expressions, code identifiers, and dialogue tags like "he said, he said, he said" where naive systems collapse into a single utterance.
Octave 2 launched fluent in 11 languages. They are listed alphabetically in the table below along with notes on Hume's stated readiness level at launch.
| Language | Region focus | Voice cloning supported | Notes |
|---|---|---|---|
| Arabic | Modern Standard Arabic | Yes | First Hume model to support Arabic |
| English | US and UK varieties | Yes | Backwards-compatible with Octave 1 voices |
| French | European French | Yes | Quebec accent available via cloning |
| German | Standard German | Yes | |
| Hindi | Standard Hindi | Yes | Devanagari script input |
| Italian | Standard Italian | Yes | |
| Japanese | Standard Japanese | Yes | Promoted as flagship language in SambaNova partnership |
| Korean | Standard Korean | Yes | |
| Portuguese | Brazilian and European | Yes | Two regional defaults |
| Russian | Standard Russian | Yes | |
| Spanish | Castilian and Latin American | Yes | Two regional defaults |
Hume's launch announcement stated that support for at least 20 languages was planned for the months following release, with the next batch expected to add Chinese (Mandarin), Dutch, Polish, Turkish, and several South Asian and Southeast Asian languages. The company has not given a fixed schedule. Each language requires its own training data, expert linguistic review, and prosody tuning, and Hume has historically been conservative about announcing language launches before voice quality meets internal targets.
A distinctive property of Octave 2's multilingual mode is cross-language accent prediction. When a voice is cloned from a 15-second sample of a native English speaker and then asked to read a Spanish passage, the model attempts to predict how that specific speaker would sound speaking Spanish, including the accent and timing pattern. This is the opposite of typical multilingual TTS, which renders cloned voices in the new language using the native pronunciation of that language without preserving any of the original speaker's identity. Hume's approach trades some pronunciation accuracy for identity preservation.
Voice cloning in Octave 2 requires a 15-second audio sample of a native speaker reading prepared text. The sample is uploaded through the dashboard or the API, processed in a few seconds, and made available as a voice ID that can be used for generation. The same voice can be used across all 11 supported languages, with the cross-language accent prediction described above.
Cloning is available on the Creator plan and higher. The Creator plan, at $7 per month or $14 per month on annual billing depending on the tier path, offers unlimited voice creation and use, with overage on synthesis charged separately. Enterprise plans add unlimited API access and commercial licensing terms suitable for production deployments at scale. Hume publishes a content policy that prohibits cloning a real person's voice without consent and requires customers to obtain explicit permission for any commercial voice based on a real recording.
For identity verification, Hume requires customers to record a verbal consent statement in the same session that the cloning sample is provided. The statement says the speaker consents to use of their voice. The system stores the consent recording alongside the cloned voice and uses it to gate deployment. The verbal consent flow is closer in design to the verification used by Sesame CSM than to the looser sample upload flow used by older TTS APIs.
Octave 2 is available through the Hume API at dev.hume.ai. The endpoint accepts a JSON request with the text to synthesize, a voice ID or description, optional acting instructions, and an output format specification. Output formats include WAV, MP3, and streaming PCM. The streaming endpoint emits audio chunks as soon as they are available, which is the path most real-time applications use.
The table below summarizes the published pricing tiers for individual and business plans as of the October 2025 launch. Enterprise pricing is custom.
| Plan | Monthly price | Characters included | Overage rate | Voice cloning | Commercial license |
|---|---|---|---|---|---|
| Free | $0 | 10,000 | Not available | Limited | No |
| Starter | $3 | 30,000 | $0.20 per 1,000 | Limited | No |
| Creator | $7 to $14 | 140,000 | $0.15 per 1,000 | Yes | Yes |
| Pro | $70 | 1,000,000 | $0.12 per 1,000 | Yes | Yes |
| Scale | $200 | Higher allotment | Volume discount | Yes | Yes |
| Business | $500 | Higher allotment | Volume discount | Yes | Yes |
| Enterprise | Custom | Unlimited available | Negotiated | Unlimited | Yes |
The headline pay-as-you-go rate for Octave 2 is approximately $7.60 per million characters generated, which Hume's pricing page describes as the lowest among the major top-tier multilingual TTS providers as of the launch date. For dedicated enterprise deployments with reserved capacity, Hume quotes a marginal cost of under one cent per minute of generated audio, which is competitive with the cost basis of Cartesia and significantly below ElevenLabs v3 at the same usage tier.
Rate limits depend on the plan. The Free tier is rate-limited to a few concurrent requests, which is enough for testing but not for production. The Pro tier supports moderate concurrency. Scale and Business plans support higher concurrency and provide service-level commitments. Enterprise customers get dedicated inference capacity on SambaNova hardware with negotiated availability targets.
Octave 2 is the speech backbone of EVI 4 mini, the version of Hume's Empathic Voice Interface released alongside it in October 2025. EVI 4 mini is the smaller of two planned EVI 4 variants. It pairs Octave 2 for synthesis with Hume's expression measurement front end and a pluggable language model for response generation. Customers can bring their own LLM, including frontier models from Anthropic, OpenAI, or Google, or use one of Hume's hosted options.
The practical impact of pairing EVI with Octave 2 is that the system can detect a caller's emotional state from incoming audio, generate a response that takes that state into account, and synthesize the reply with prosody appropriate to the moment. If the caller sounds upset, the response is rendered with a slower pace and lower energy. If the caller is excited, the response matches that energy without veering into mockery. The system is interruptible mid-utterance, which keeps the back-and-forth feeling natural.
EVI 4 mini is sold to customer service operators, telehealth providers, and conversational education platforms. The pitch is not that the AI is more accurate than the older speech-to-speech systems, although Hume claims it is. The pitch is that callers tend to stay longer in the conversation when the system sounds like it understands them, which translates into measurable outcomes such as completion rates on intake forms or adherence to medication reminders.
EVI 3, the prior generation, remains supported for customers who have not yet migrated. EVI 4, the full-size variant, was previewed at the same time as the mini release but had not entered general availability as of the article date.
The text-to-speech market in late 2025 had three reasonably mature contenders aimed at developers building expressive voice products: Hume with Octave 2, ElevenLabs with v3, and Cartesia with Sonic 2. OpenAI's TTS endpoints offered a simpler, lower-control alternative aimed at general developers. The table below compares the four on the dimensions developers tend to evaluate.
| Dimension | Octave 2 | ElevenLabs v3 | Cartesia Sonic 2 | OpenAI TTS-1-HD |
|---|---|---|---|---|
| Launch date | October 1, 2025 | June 5, 2025 | Mid 2025 | November 6, 2024 |
| Languages at launch | 11 | 32 | 15 | About 50 |
| Latency target | Under 200 ms | About 250 ms | Around 90 ms | Around 300 ms |
| Voice cloning sample | 15 seconds | 1 minute (Instant) | 3 seconds | Not offered |
| Emotional control | LLM-inferred from script and instructions | Audio tag system in v3 | Limited tags | None |
| Voice conversion | Yes | No (separate dubbing product) | No | No |
| Phoneme editing | Yes | No | No | No |
| Multi-speaker dialogue | Yes, single request | Limited, requires manual stitching | No | No |
| Per-character cost | ~$7.60 per 1M | ~$15.00 per 1M | ~$5 per 1M | ~$15.00 per 1M |
| Dedicated hardware | SambaNova chips | Standard GPU | Standard GPU | OpenAI infrastructure |
| Strongest claim | Emotional accuracy and editing | Largest language coverage | Lowest latency | Easiest integration |
In third-party benchmark studies released after Octave 2 launched, the picture is mixed. On pure speech naturalness measured by mean opinion score, ElevenLabs v3 scored 89.6 percent in a survey published by Pixazo, while Octave 2 scored 78.5 percent. On pronunciation accuracy in the same survey, ElevenLabs scored 87.1 percent against Octave 2's 80 percent. In blind preference tests focused specifically on emotional nuance, Hume tends to lead, with reviewers describing Octave 2 as more capable of authentic empathy and subtle mood shifts but less consistent on long-form narration. ElevenLabs holds an advantage when the use case is straightforward voiceover or audiobook narration where consistency outweighs emotional range.
Cartesia Sonic 2 is the latency leader. With sub-100-millisecond time-to-first-audio in good network conditions, it remains the default choice for use cases where speed dominates emotional expressiveness, such as voice agents fielding high-volume inbound calls. Cartesia trades emotional range and editing tools for raw speed, which is the inverse of Hume's bet.
OpenAI's TTS endpoints sit in a different segment. They are easier to integrate, support more languages, and require no voice design step, but they do not expose emotional controls or cloning. Octave 2 competes against OpenAI TTS only on the segment of the market that cares about expression. For developers who want a serviceable narrator voice in a few lines of code, OpenAI remains the path of least resistance.
Sesame CSM, released earlier in 2025, is a related but architecturally distinct system focused on conversational presence rather than voice design. It targets the same emotional-intelligence segment as Hume but with a different emphasis on prosodic continuity across turns. The two are sometimes evaluated together by buyers shopping for empathic voice technology rather than for narration.
The technical press received Octave 2 favorably. TestingCatalog, AllAboutAI, and FunBlocks ran feature pieces covering the launch and emphasized the multilingual jump from English-only and the 50 percent price cut. Reviewers consistently called out voice conversion and phoneme editing as the most differentiated features, on the grounds that no other commercial TTS provider was shipping both in a single API. The Product Hunt launch on October 1 drew 119 upvotes in the first day, with comments focused on the multilingual coverage and the Japanese voice.
Reception inside the AI engineering community was more measured. Several developers noted that Octave 2's 11-language coverage at launch was narrower than ElevenLabs v3's 32-language list, and that for many production applications the choice would come down to whether emotional control or language coverage mattered more. Others praised the SambaNova partnership as the clearest example to date of a frontier speech-language model running on non-GPU inference hardware in production rather than as a benchmarking exercise.
A recurring criticism in user reviews collected by Murf and Fish Audio is that Octave 2 occasionally drifts from a chosen voice over very long passages, particularly in languages other than English. Hume has acknowledged this as a known limit of cross-language cloning at launch.
The broader strategic read from analysts at Contrary Research and TestingCatalog is that Hume is now positioned more clearly as the emotion specialist in the TTS market, rather than as a generalist trying to compete with ElevenLabs on every dimension. The 11-language coverage is enough to make the product viable for global products, the latency is fast enough for conversational use, and the price is low enough to remove cost as a primary objection.