ElevenLabs v3

AI Models Generative AI Speech & Audio AI

30 min read

Updated Jun 26, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 26, 2026

Fact-checked

In review queue

Sources

63 citations

Revision

v3 · 5,923 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Eleven v3 (alpha)
Developer	ElevenLabs
Type	Text-to-speech model
Alpha release	June 5, 2025^[1]
General availability	February 2, 2026^[2]
Architecture	Successor to Eleven Multilingual v2; new model with deeper text understanding^[3]
Languages	70+ at launch^[4]
Character limit	5,000 characters per request (alpha and GA)^[5]
Output formats	MP3 at 44,100 Hz, 128 kbps and other PCM formats^[6]
Key feature	Inline audio tags such as `[laughs]`, `[whispers]`, `[excited]` for emotion and delivery control^[7]
Model ID	`eleven_v3` (API parameter)^[8]
Pricing during alpha	80% discount in the user interface through June 2025; API alpha opened in mid-2025^[9]
Real-time use	Not optimized for live conversation; Flash v2.5 remains the low-latency option^[10]

Eleven v3, marketed by ElevenLabs as Eleven v3 (alpha), is a third-generation text-to-speech model that ElevenLabs released in public alpha on June 5, 2025 and described as "the most expressive Text to Speech model ever."^[1] Its signature feature is an inline markup language of bracketed "audio tags" such as [laughs], [whispers], [excited], and [sighs] that direct emotion, delivery, and non-verbal reactions inside the script itself, alongside support for 70+ languages and native multi-speaker dialogue.^[1]^[7] Because its higher-fidelity output trades away the low-latency profile of the older Flash and Turbo families, ElevenLabs positions Eleven v3 for offline workloads such as audiobooks, character voice acting, and cinematic narration rather than live agent calls.^[10]

The London-based audio artificial intelligence startup paired the alpha rollout with an 80 percent price cut in the ElevenLabs web interface through the end of June 2025, and a public alpha of the v3 application programming interface followed shortly after.^[1]^[13] ElevenLabs moved Eleven v3 into general availability on February 2, 2026, reporting that testers preferred the GA model to the alpha 72 percent of the time, alongside a wider expansion of the company's publisher tooling and ElevenCreative audiobook environment.^[2]^[14] Third-party benchmarks placed v3 near the top of expressive text-to-speech leaderboards, but reviews were mixed: testers praised the dramatic delivery and audio-tag system while criticizing slower generation, reduced control compared to Multilingual v2, and content filters that blocked some creative use cases.^[15]

What is Eleven v3?

Eleven v3 is ElevenLabs' flagship expressive text-to-speech model, built around four ideas: directable emotional delivery through inline audio tags, native multi-speaker dialogue, coverage of more than 70 languages, and improved handling of dense numeric and symbolic text. It is the successor to Eleven Multilingual v2 and is described internally as a new architecture rather than a refinement of the older model, with a larger network and a higher-fidelity voice codec that improve audio quality at the cost of generation speed.^[3]^[24] Unlike the company's low-latency Flash and Turbo lines, which are tuned for real-time conversation, v3 is aimed at prepared content where a producer can wait for a higher-quality render: audiobooks, podcasts, dubbing, animation, and cinematic narration.^[10]^[12]

Functionally, v3 lets a writer direct an audio scene the way a screenplay does. A short passage might open with [whispers] over background tension, shift to [shouts] for a confrontation, and end with [sighs] for resignation, all from the same cloned voice, while a separate Text to Dialogue interface renders an entire multi-voice scene, complete with overlaps and interruptions, in a single pass.^[7]^[26]

Why did ElevenLabs build Eleven v3?

ElevenLabs was founded in 2022 by Mati Staniszewski and Piotr Dabkowski, two friends from Warsaw who started the company after watching a dubbed American film whose Polish voice track stripped most of the original emotional performance. Staniszewski had worked at Palantir Technologies and Dabkowski had been a machine learning engineer at Google. Their first commercial product was an English text-to-speech engine, internally called v1, that drew attention in early 2023 for producing audio that listeners often struggled to distinguish from a human reader.^[16]

A second generation followed: Eleven English v2 sharpened the original voice quality, and Eleven Multilingual v2, released in 2023, extended the system to 28 languages while keeping a cloned speaker's accent and timbre across them. Multilingual v2 quickly became the company's default model for narration and audiobook work because it produced steady, neutral delivery with relatively few mispronunciations. ElevenLabs also shipped lower-latency siblings, Flash v2.5 and Turbo v2.5, that traded some expressive range for inference speeds suited to live agent platforms. By the end of 2024 the product line included instant and professional voice cloning, an AI dubbing studio, a voice library with thousands of community submissions, and a conversational agent platform that competed with offerings from OpenAI and Google.^[17]

The gap that Eleven v3 was built to close was expressive performance. Even with Multilingual v2's emotional range, scripted reads still sounded controlled in a way that worked for nonfiction but felt flat in fiction, gaming, and animation. Writers asking the model to whisper, shout, sob, or laugh on cue had to chain SSML breaks and prompt tricks. Multi-speaker dialogue had to be produced one speaker at a time and stitched together. The v3 research project, which ElevenLabs began describing publicly in early 2025, was an attempt to solve those problems with a single new model.

When was Eleven v3 released?

ElevenLabs announced Eleven v3 on June 5, 2025 via a post on its corporate X account and a microsite at elevenlabs.io/v3. The launch message described the model as "the most expressive Text to Speech model ever," highlighted 70+ language support, multi-speaker dialogue, and audio tags such as [excited], [sighs], [laughing], and [whispers], and offered an 80 percent discount on v3 generation in the ElevenLabs user interface through the end of June 2025.^[18] The model was labelled an alpha and a research preview, and ElevenLabs cautioned that it required more careful prompt engineering than Multilingual v2, that professional voice clones were not yet fully tuned for the new architecture, and that latency was higher than on the company's Flash and Turbo lines.^[19]

The rollout proceeded in stages. On launch day v3 became selectable from the Text to Speech and Studio dropdowns on elevenlabs.io for all paid plans, with the free tier excluded. A few weeks later ElevenLabs opened a public alpha of the v3 endpoint for the company's hosted application programming interface, with the model identifier eleven_v3 available through the standard POST /text-to-speech/:voice_id route and dedicated text-to-dialogue endpoints for multi-speaker scenes.^[20] The official ElevenLabs developer account confirmed that public API access was scheduled to roll out shortly after the user interface launch and would follow the same alpha pricing pattern for early adopters.^[21]

Eleven v3 left alpha on February 2, 2026, when ElevenLabs declared the model generally available. The company framed the GA release around two improvements over the alpha: "More stable" delivery and "More accurate" handling of numbers, symbols, and specialized notation across languages, and reported that in blind comparisons users preferred the GA model 72 percent of the time over the alpha.^[2] Coverage of both the alpha and GA focused on the audio-tag system, which represented a clear break from the brittle SSML-style markup used by older text-to-speech systems. Outlets framed the release as a step toward voice generation that could be directed like a stage performer rather than configured like a synthesizer; some compared it to expressive systems from rivals such as Hume's Octave line and to the open-weights conversational model Sesame CSM.^[22] Staniszewski used subsequent press appearances to argue that audio models would commoditize over time and that the lasting moat would sit in tooling, voice rights, and distribution rather than the underlying acoustic model.^[23]

What can Eleven v3 do?

Eleven v3 is built around four ideas: expressive delivery, multi-speaker dialogue, broad language coverage, and accurate handling of complex text. Internally, ElevenLabs has described v3 as a new architecture rather than a refinement of Multilingual v2, with a larger model and a higher-fidelity voice codec that improves audio quality at the cost of generation speed.^[24] The company also raised the model's ceiling for contextual understanding so that a single passage can shift mood, accent, and even speaker without losing the cloned voice's identity.

Headline capabilities

Capability	Behavior in Eleven v3	Notes
Inline audio tags	Bracketed cues such as `[laughs]`, `[whispers]`, `[shouts]`, `[sarcastically]` change emotion, volume, pacing, accent, or insert non-verbal reactions^[25]	Tag behavior depends on the chosen voice and surrounding context; results vary across voices
Multi-speaker dialogue	Dedicated Text to Dialogue interface generates an entire scene with multiple voices in one pass, including overlaps and interruptions^[26]	First ElevenLabs model with native multi-speaker support
Multilingual output	70+ languages spanning major languages such as English, Mandarin, Spanish, German, French, and Japanese plus smaller languages such as Luxembourgish, Lingala, Sindhi, and Cebuano^[27]	Coverage roughly doubles Multilingual v2's 28 languages
Accent preservation	A cloned voice keeps its original timbre and accent when speaking any of the supported languages^[28]	Supports cross-language dubbing without re-cloning the speaker
Complex text accuracy	At GA the error rate on chemical formulas, phone numbers, addresses, abbreviations, and currency formats dropped from 15.3 percent to 4.9 percent, a 68 percent reduction versus the alpha^[2]^[29]	Especially relevant for educational, technical, and finance content
Studio-quality output	MP3 at 44,100 Hz 128 kbps plus various PCM and Opus options through the API^[30]	Same audio format menu as older models
Context length	Up to 5,000 characters per request, roughly five minutes of audio^[31]	Multilingual v2 supports 10,000 characters per request
Voice cloning compatibility	Works with Instant Voice Clones and designed voices; Professional Voice Clones are usable but were not fully optimized at alpha^[32]	ElevenLabs flagged this limitation in launch documentation

The practical effect is that v3 lets a writer direct an audio scene the way a screenplay does. A short passage might open with [whispers] over background tension, shift to [shouts] for a confrontation, and end with [sighs] for resignation, all from the same cloned voice. Creators have used it for animated character dialogue, audiobook performances with distinct narrator and character voices, and dubbed video that preserves the original speaker's emotional arc.

What are the trade-offs?

The new architecture has costs. Generation is slower than on Multilingual v2 and substantially slower than Flash v2.5 or Turbo v2.5, which makes v3 unsuitable for live conversational use cases where the agent must respond within a few hundred milliseconds. ElevenLabs explicitly recommends Flash v2.5 for real-time agents and steers v3 toward batch workloads such as audiobooks, podcasts, dubbing, and content pre-production.^[33] The 5,000-character ceiling per request is also half of Multilingual v2's limit, which forces longer pieces to be chunked and stitched.

Alpha testers reported that v3's expressive range can introduce variability across takes, and some users argued that Multilingual v2 still offers tighter control for neutral narration where every sentence must land the same way.^[34] ElevenLabs has acknowledged the gap and continues to keep Multilingual v2 in the product line as the recommended default for steady reads.

What are Eleven v3 audio tags?

Audio tags are the most visible feature of Eleven v3. The system lets writers embed cues inside a regular text prompt by wrapping a directive in square brackets, for example She paused. [whispers] "They're already inside." ElevenLabs documents four broad categories of tags, although in practice the line between them blurs and tags can be combined.^[35]

Documented audio-tag categories

Category	Purpose	Representative tags
Emotions and tone	Color the next line with a specific feeling	`[sad]`, `[angry]`, `[happily]`, `[sorrowful]`, `[nervously]`, `[excited]`, `[cheerfully]`, `[sternly]`, `[sarcastically]`, `[dramatic tone]`
Delivery direction	Change volume, pacing, or speech rate	`[whispers]`, `[shouts]`, `[pause]`, `[rushed]`, `[slows down]`, `[deliberate]`, `[rapid-fire]`, `[drawn out]`, `[continues after a beat]`
Non-verbal reactions	Insert authentic human sounds	`[laughs]`, `[laughs softly]`, `[giggles]`, `[sighs]`, `[sigh of relief]`, `[clears throat]`, `[gasps]`, `[gulps]`, `[breathes]`, `[stammers]`
Character voices and accents	Switch into a persona or regional accent	`[pirate voice]`, `[French accent]`, `[British accent]`, `[Southern US accent]`, `[American accent]`
Sound effects and environment	Drop in basic non-speech sound elements	`[clapping]`, `[explosion]`, `[gunshot]`

ElevenLabs has stressed that audio tags are interpreted by the model rather than executed as pre-recorded clips, so a [gunshot] or [explosion] will sound stylized and context-dependent rather than a sample from a sound effects library, and the same tag may produce slightly different audio across different voices.^[36] Tags can also be stacked: [whispering][pause] "Don't move." [sigh of relief] produces a single line that whispers, holds, delivers the dialogue, and exhales out of the moment, all in one generation.

The company recommends that users start with a small number of tags on a strong, stable voice rather than over-marking a script, and that Instant Voice Clones generally outperform Professional Voice Clones on audio-tag accuracy in the alpha. Independent guides published after the launch echo this advice and add that combining a delivery tag with an emotion tag ([whispering][nervously]) produces more reliable results than chaining several emotion tags in a row.^[37]

How do you keep audio tags stable?

Prompt engineering for v3 has crystallized into a small set of patterns since the alpha opened. Writers anchor a scene with a setting line that establishes mood ([soft] [intimate]) before the first piece of dialogue, then alternate between plain text and a delivery cue when the energy of the scene changes. ElevenLabs documentation distinguishes between persistent tags, which stay in effect until a contradictory tag arrives, and transient tags such as [gasps] or [clears throat], which return the voice to its baseline after a single beat.^[53]

Audio tags also interact with the underlying voice. ElevenLabs has reported that v3 stability is highest on voices designed for the model in the in-app Voice Design flow, somewhat lower on Instant Voice Clones, and lowest on legacy Professional Voice Clones that have not been retuned for the alpha codec. As GA approached, the company began rolling out a v3-compatible retrain pipeline for professional clone owners.

How many languages does Eleven v3 support?

Eleven v3 supports more than 70 languages at launch, roughly doubling the 28 languages covered by Multilingual v2. ElevenLabs has stated that this expansion increases the share of the global population that can use the model in a native language from roughly 60 percent to roughly 90 percent.^[38] The list spans every major regional language family and reaches into languages where high-quality synthetic voices had been thin on the ground, including Luxembourgish, Lingala, Cebuano, Sindhi, and several African and Central Asian languages.

Language support in v3 is paired with three features that previous models did not have together. First, automatic language detection: the model can read a prompt that mixes English and another language in the same paragraph without requiring a hard switch. Second, accent and timbre preservation when a single cloned voice speaks a language different from its training source, which is what allows the AI dubbing studio to dub a Spanish actor's interview into Japanese without sounding like a different speaker. Third, audio tags work across languages, so [whispers] or [laughs] produces appropriate non-verbal sounds whether the surrounding line is in French, Polish, or Tagalog.^[39]

Language quality is not uniform. ElevenLabs has framed v3 as a research preview and has noted that smaller languages can show stronger accent artifacts and occasional mispronunciations, particularly for technical vocabulary. Multilingual v2 remains in the catalog for users who need the most stable performance on a narrow set of widely supported languages, while v3 is positioned as the higher-ceiling option for projects that need either broad coverage or expressive performance.

What is the Text to Dialogue API?

The Text to Dialogue endpoint is the most structurally novel piece of Eleven v3. Earlier ElevenLabs models could only produce one speaker per generation, so a conversation between a narrator, a child character, and a villain required three separate API calls and a manual stitch. Text to Dialogue collapses this into a single request that takes a list of turns, each tagged with a voice identifier and an optional set of audio tags, and renders the full scene in one pass with consistent room tone and emotional continuity across speakers.^[54] ElevenLabs describes the endpoint as automatically managing "speaker transitions, emotional changes, and interruptions" to produce a cohesive, overlapping audio file.^[26]

The model handles three behaviors that are difficult to script outside a unified pass. Interruptions cut one speaker off mid-line and lift the second voice in tempo. Overlaps allow short phrases such as [gasps] or [laughs] to ride over the tail of another speaker's line. Cross-speaker reactions let an utterance from one character shape the prosody of the next character's reply. These behaviors are part of why community demos of four-character animated scenes produced in a single generation drew so much attention at launch.^[55]

In the API, the dialogue endpoint accepts a JSON list of speaker turns at POST /v1/text-to-dialogue with each entry specifying voice_id and text, optionally including voice_settings to override per-turn style or stability values. The same audio tag vocabulary used in single-speaker generation works inside each turn. Third-party guides have collected reusable templates for two-person interviews, four-character dramatic scenes, and call-center training material.

How is Eleven v3 used in Studio and ElevenCreative?

Eleven v3 is wired into Studio 3.0, the renamed and expanded successor to ElevenLabs Studio that the company shipped alongside GA in February 2026. Studio 3.0 is positioned as a full audio and video production environment rather than a thin wrapper around the text-to-speech endpoint, and it consolidates manuscript ingestion, voice generation, audio tag editing, multi-speaker assembly, music scoring, video lip-sync, and export into a single browser-based timeline.^[56] Inside Studio, v3 is selected on a per-clip basis, which lets editors mix steady Multilingual v2 takes for neutral narration with v3 takes for emotional peaks in the same project.

The audiobook surface, branded ElevenCreative Audiobooks, sits one level up from Studio and targets trade publishers and self-publishing authors. The flow accepts an EPUB or DOCX manuscript, splits it into chapters, lets the producer assign character voices, runs v3 on dialogue passages while leaving narration on Multilingual v2 by default, and exports a publisher-ready audiobook bundle with chapter markers and metadata. ElevenLabs paired the launch with InAudio, the distribution arm built on infrastructure from the company's 2025 Findaway acquisition, which lets producers push a finished audiobook directly to Spotify, Audible, and library wholesalers.^[57]

ElevenLabs also launched a v3-aware iteration of its dubbing studio, which uses the model's accent-preserving multilingual delivery to retain the original speaker's voice across language switches. An English documentary can be relocalized into Spanish, French, and Japanese in parallel, with v3 picking up emotion cues from the source audio.

Why is Eleven v3 not used for live agents?

ElevenLabs spent 2025 rebranding its conversational AI product into ElevenAgents, and the platform reached a wider March 2026 launch that bundled phone, web, and chat surfaces into a single agent runtime with first-turn latency reported under 500 milliseconds.^[58] Eleven v3 is not the recommended speech engine for ElevenAgents, because its higher-fidelity codec and broader expressive range come with generation latency too long for natural turn-taking. The agent stack defaults to Flash v2.5 or Turbo v2.5 for the spoken side of a live call, and ElevenLabs documentation explicitly steers callers building production agents toward the Flash family when latency budgets are tight.^[59]

What v3 does contribute is offline production work that surrounds a live deployment. Prompted greetings, scripted voicemail messages, training samples, and outbound batch calls are usually rendered in v3 to capture more emotional range, then cached and streamed by the agent at runtime as audio assets. The IBM partnership announced in March 2026 to bring ElevenLabs voices into IBM watsonx Orchestrate uses this pattern: live agent turns run on Flash v2.5, while pre-recorded brand greetings and longer narrative segments are generated in v3 and served as static clips.^[60]

The clearest internal demarcation is the latency budget. Real-time agent turns must complete a first-byte response inside a few hundred milliseconds; Flash v2.5 hits that envelope and v3 does not. For workloads where the writer can afford to wait for a higher-quality render, v3 is the engine of choice.

How much does Eleven v3 cost?

Eleven v3 is accessible through both the ElevenLabs web interface and the company's hosted application programming interface. In the web app, the model appears in the model selector under Text to Speech and in Studio projects, and is available on every paid subscription tier (Starter, Creator, Pro, Scale, and Business). The free tier does not include v3 generation. The public application programming interface uses the model identifier eleven_v3 and supports the standard text-to-speech endpoint as well as dedicated text-to-dialogue endpoints for multi-speaker scenes, including a Beta dialogue endpoint that lets callers pass per-speaker voice assignments in a single request.^[40]

ElevenLabs uses a credit model rather than per-character billing for most plans, where each subscription tier comes with a monthly pool of credits and v3 consumes one credit per character generated, identical to the rate on Multilingual v2 once the alpha discount expires.^[41] On the API, ElevenLabs lists Multilingual v2 and v3 at the same headline rate of $0.10 per 1,000 characters, twice the $0.05 per 1,000 characters charged on the Flash and Turbo models.^[63] Monthly credit allotments under the standard creator plans range from 10,000 credits on the free tier (Multilingual v2 only, no v3 access) to 30,000 credits on Starter, 100,000 credits on Creator, 500,000 on Pro, and into the millions on Scale and Business plans. The 80 percent alpha discount that ran through June 2025 reduced the effective character cost of v3 by a factor of five in the user interface and was used to seed creator adoption during the launch window. After the alpha discount ended, the effective per-character cost on the credit plans landed in a band that third-party trackers have placed at roughly $0.17 to $0.30 per 1,000 characters depending on plan tier.^[42]

A second route to Eleven v3 access runs through third-party model aggregators. Platforms such as WaveSpeed, Kie.ai, and several enterprise voice routing services expose eleven_v3 and the Text to Dialogue endpoint behind their own application programming interfaces, which appeals to developers who mix multiple speech vendors behind a single billing layer.^[43]

Effective per-character cost across tiers

Plan	Monthly credits	Headline price	Effective rate on v3
Free	10,000 (v3 excluded)	$0	n/a
Starter	30,000	$5	About $0.17 per 1,000 characters
Creator	100,000	$22	About $0.22 per 1,000 characters
Pro	500,000	$99	About $0.20 per 1,000 characters
Scale	2,000,000	$330	About $0.165 per 1,000 characters
Business	11,000,000	$1,320	About $0.12 per 1,000 characters
API list rate	n/a	$0.10 per 1,000 characters	Same as Multilingual v2; double the Flash and Turbo rate^[63]
Alpha promo (June 2025)	n/a	80% off in-app	About one-fifth of the post-alpha rate

Pricing trackers flag that the per-character rate on v3 is substantially higher than on community options, with Hume Octave 2 priced at roughly $7.60 per million characters and self-hosted Sesame CSM effectively free at the model cost while requiring user-provided compute.^[61] ElevenLabs has argued that the headline price reflects the cost of the voice library, audio tag training, and the publisher tooling that wraps the model rather than just inference.

How does Eleven v3 compare to competitors?

ElevenLabs operates in an increasingly crowded expressive text-to-speech market. The closest comparisons in 2025 and early 2026 were OpenAI's TTS models, Hume's Octave family (and its successor Hume Octave 2), Cartesia Sonic, and the open-source Sesame CSM conversational model. Each system makes a different trade-off between latency, expressivity, control style, and licensing.

Eleven v3 in context (selected models, 2025 to 2026)

Model	Vendor	Approximate launch	Languages	Expressive control method	Latency profile	Notable strength
Eleven v3	ElevenLabs	Alpha June 2025; GA February 2026	70+	Inline bracketed audio tags	Offline; not real-time	Dramatic delivery, voice cloning ecosystem
Eleven Flash v2.5	ElevenLabs	2024	32	SSML-style break and emphasis	~75 ms inference, ~150 ms TTFA	Conversational agents, low-latency calls
OpenAI TTS	OpenAI	2023; Realtime API 2024	English-first, multilingual via Realtime	Plain-English voice instructions in prompt	Interactive plus a Realtime API for sub-second responses	Instructable voice character via prose direction
Hume Octave 2	Hume AI	October 2025	11+	Plain-English emotional instructions plus voice conversion and phoneme editing	Under 200 ms latency	Emotional intelligence, voice conversion
Cartesia Sonic	Cartesia	2024; Sonic-2 2025; Sonic-3 2026	Multilingual, growing	Limited emotion controls focused on speed	Sub-100 ms TTFB; Sonic-3 ~40 ms TTFA	Real-time voice agents on State Space Models
Sesame CSM	Sesame AI Labs	March 2025	English-first	Conversational context; open weights	Designed for conversational responsiveness	Open-weights companion-style voice model

On raw expressivity for prepared content, Eleven v3 and Hume's Octave line trade blows: third-party listening tests have placed both at the top of the expressive-speech category, with Octave 2 favored for emotional nuance over a plain-English directive vocabulary and Eleven v3 favored for cinematic delivery and a deeper voice library.^[44] For real-time voice agents, Cartesia Sonic-3 leads on latency with a time-to-first-audio around 40 milliseconds, and ElevenLabs steers customers toward Flash v2.5 rather than v3 for that workload. OpenAI's instructable voices offer flexible character control but operate on a smaller voice catalog and do not provide cloning, while Sesame CSM is the open-weights option for developers who want to host a conversational voice model themselves.

A recurring framing in 2026 coverage is that no single model dominates on both quality and latency at the same time. ElevenLabs splits this between Flash v2.5 and Eleven v3, Cartesia between Sonic-3 and a slower high-fidelity tier, and Hume through different Octave 2 modes. Producers route real-time turns through a fast model and pre-rendered content through v3 or an Octave 2 quality tier.

Audio-tag style versus directive prose

The expressive control vocabulary is the most visible philosophical difference among the three top expressive models. Eleven v3 uses inline bracketed tags that read like stage directions inside a screenplay; Hume Octave 2 uses plain-English directives passed as a separate field on the request; OpenAI's Realtime API accepts a short voice description prompt that biases the speaker style for an entire session. Reviewers in 2026 have argued that the tag approach in v3 is easier to version-control because the cues live inside the script, while the directive approach in Octave 2 is easier for casual users. Game studios and animation houses, where dialogue is iterated in writers' rooms, have leaned toward v3 for the script-embedded format; emotional interview tools and AI companion apps have leaned toward Octave 2.

How was Eleven v3 received?

The launch reception was loud and split. Coverage in the days after June 5, 2025 leaned positive on the model's expressive ceiling: outlets including Sifted, Geeky Gadgets, AIBase, and several developer-focused newsletters described v3 as a step change in synthetic voice acting, with multi-speaker dialogue and audio tags singled out as the features that closest to performing rather than narrating.^[45] In one widely circulated demo, a single voice handled a four-character animated scene complete with overlaps and reactions, which had previously required either heavy editing or a dedicated multi-speaker pipeline.

At the same time, three threads of criticism appeared quickly. The first concerned consistency: heavy use of audio tags can change voice character across takes, and some testers reported that v3 felt less controllable than Multilingual v2 when the goal was a steady, neutral narration. A working consensus among long-form audiobook producers settled on Multilingual v2 for most chapters and v3 for emotionally charged passages where its dramatic range pays off.^[46] The second concerned content filters: users on the company's community forum complained that v3 declined to render some profanity and intense emotional content even in clearly artistic contexts, which they argued narrowed its usefulness for fiction and game dialogue.^[47] The third concerned post-alpha pricing: the 80 percent discount during June 2025 made v3 a near drop-in upgrade, but several reviewers noted that the post-discount rate sat well above community open-weights options such as Dia from Nari Labs, which had improved quickly through 2025.^[48]

Independent benchmark trackers placed Eleven v3 near the top of expressive text-to-speech leaderboards in late 2025 and early 2026. Artificial Analysis recorded v3 with an Elo score of 1196 in its speech model rankings after the general availability release, placing it second in its category, with first place taken by a real-time variant from a different vendor that traded off some expressive range for sub-second response.^[49] The general availability release sharpened the accuracy story: ElevenLabs reported that v3's error rate on numbers, symbols, and specialized notation fell from 15.3 percent in the alpha to 4.9 percent at GA, and that testers preferred the GA model 72 percent of the time in blind comparisons.^[2] Reviewers wrote that v3's combination of voice cloning, multilingual reach, and audio tags was hard to match end-to-end even where individual rivals beat it on one axis, and that the company's broader catalog (Multilingual v2 for stability, Flash and Turbo for latency, v3 for expression) gave producers a way to mix and match.^[50]

General availability arrived on February 2, 2026, paired with an expansion of ElevenLabs' publisher tools and the launch of a dedicated audiobook environment inside ElevenCreative.^[51] By that point ElevenLabs had moved past $330 million in annualized revenue and reached an $11 billion valuation, and Staniszewski used the GA milestone to argue that voice was becoming a primary interface for AI products.^[52]

Adoption signals after general availability

By spring 2026 a handful of adoption signals had appeared. A March 2026 Sacesta industry survey of voice AI deployments reported that Eleven v3 was the most commonly cited model for prerecorded brand voice, audiobook narration, and animation dialogue, while Flash v2.5 led the live-agent category by a wide margin.^[62] Several large North American publishers moved their default narration workflow into ElevenCreative Audiobooks during the same window, with v3 reserved for character voices.

Independent reviewers in 2026 began to read v3 as part of a broader product story rather than a standalone model launch: ElevenLabs has shifted from a model vendor to an audio platform, with v3 as the expressive layer, Flash v2.5 as the conversational layer, Eleven Music as the score layer, and Eleven Scribe as the speech synthesis and speech-to-text layer feeding the same content graph.

References

ElevenLabs (June 5, 2025). "Introducing Eleven v3 (alpha)." X (formerly Twitter). https://x.com/elevenlabsio/status/1930689774278570003 ↩
ElevenLabs Blog (February 2, 2026). "Eleven v3 is Now Generally Available." https://elevenlabs.io/blog/eleven-v3-is-now-generally-available ↩
ElevenLabs. "Eleven v3: Most Expressive AI Voice Model." https://elevenlabs.io/v3 ↩
ElevenLabs Documentation. "Models." https://elevenlabs.io/docs/overview/models ↩
ElevenLabs Documentation. "Models: Eleven v3 character limits." https://elevenlabs.io/docs/overview/models ↩
ElevenLabs. "Eleven v3 (alpha): output formats." https://elevenlabs.io/v3 ↩
ElevenLabs Blog. "What are Eleven v3 Audio Tags and why they matter." https://elevenlabs.io/blog/v3-audiotags ↩
ElevenLabs Documentation. "Eleven v3 model identifier." https://elevenlabs.io/docs/overview/models ↩
ElevenLabs (June 5, 2025). "Eleven v3 alpha launch." https://x.com/elevenlabsio/status/1930689774278570003 ↩
Inworld AI. "ElevenLabs v3 review and latency guidance." https://inworld.ai/resources/elevenlabs-v3-review ↩
ElevenLabs Blog. "Eleven v3 audio tags overview." https://elevenlabs.io/blog/v3-audiotags
ElevenLabs. "Eleven v3 use cases." https://elevenlabs.io/v3 ↩
ElevenLabs Developers (June 5, 2025). "Eleven v3 alpha in UI; public API access for Eleven v3 (alpha) coming soon." https://x.com/ElevenLabsDevs/status/1930690086204821639 ↩
Publishing Perspectives (February 2026). "ElevenLabs Summit: An Audiobook Company That Isn't About Audiobooks." https://publishingperspectives.com/2026/02/elevenlabs-summit-an-audiobook-company-that-isnt-about-audiobooks/ ↩
ElevenLabs Magazine (2026). "ElevenLabs Eleven v3 Model Complete Guide 2026." https://elevenlabsmagazine.com/elevenlabs-eleven-v3-model-complete-guide-2026/ ↩
Wikipedia. "ElevenLabs." https://en.wikipedia.org/wiki/ElevenLabs ↩
ElevenLabs. "Product overview." https://elevenlabs.io ↩
ElevenLabs (June 5, 2025). "Introducing Eleven v3 (alpha)." https://x.com/elevenlabsio/status/1930689774278570003 ↩
ElevenLabs Blog. "Audio tags and research-preview caveats." https://elevenlabs.io/blog/v3-audiotags ↩
ElevenLabs Documentation. "Text to Speech and Text to Dialogue endpoints." https://elevenlabs.io/docs/overview/models ↩
ElevenLabs Developers (2025). "Public API for Eleven v3 (alpha) coming soon." https://x.com/ElevenLabsDevs/status/1930690086204821639 ↩
CXO Today. "ElevenLabs introduces Eleven v3 (alpha): the most expressive Text to Speech model." https://cxotoday.com/press-release/elevenlabs-introduces-eleven-v3-alpha-the-most-expressive-text-to-speech-model/ ↩
TechCrunch (October 29, 2025). "ElevenLabs CEO says AI audio models will be 'commoditized' over time." https://techcrunch.com/2025/10/29/elevenlabs-ceo-says-ai-audio-models-will-be-commoditized-over-time/ ↩
Inworld AI. "Eleven v3 architecture and codec notes." https://inworld.ai/resources/elevenlabs-v3-review ↩
ElevenLabs Blog. "Eleven v3 Audio Tags: Precision Delivery Control for AI Speech." https://elevenlabs.io/blog/eleven-v3-audio-tags-precision-delivery-control-for-ai-speech ↩
ElevenLabs Blog. "Eleven v3 Audio Tags: Multi-Character Dialogue in AI Speech." https://elevenlabs.io/blog/eleven-v3-audio-tags-bringing-multi-character-dialogue-to-life ↩
ElevenLabs Documentation. "Eleven v3 language coverage." https://elevenlabs.io/docs/overview/models ↩
ElevenLabs Magazine. "Accent preservation in Eleven v3." https://elevenlabsmagazine.com/elevenlabs-eleven-v3-model-complete-guide-2026/ ↩
Inworld AI. "68 percent reduction in complex text errors." https://inworld.ai/resources/elevenlabs-v3-review ↩
ElevenLabs. "Audio output formats." https://elevenlabs.io/v3 ↩
ElevenLabs Documentation. "Eleven v3 character limits." https://elevenlabs.io/docs/overview/models ↩
ElevenLabs Blog. "Voice cloning compatibility with v3." https://elevenlabs.io/blog/v3-audiotags ↩
Inworld AI. "Real-time guidance: use Flash v2.5 for live agents." https://inworld.ai/resources/elevenlabs-v3-review ↩
ElevenLabs Magazine. "Community comparison of v3 and Multilingual v2 stability." https://elevenlabsmagazine.com/elevenlabs-eleven-v3-model-complete-guide-2026/ ↩
ElevenLabs Blog. "Eleven v3 audio tags categories." https://elevenlabs.io/blog/v3-audiotags ↩
ElevenLabs Blog. "Audio tag behavior is voice and context dependent." https://elevenlabs.io/blog/v3-audiotags ↩
Jonathan Mast. "ElevenLabs v3 Audio Tags User Guide." https://jonathanmast.com/elevenlabs-v3-audio-tags-user-guide-mastering-emotional-voice-control/ ↩
ElevenLabs (June 5, 2025). "From 33 to 70+ languages; 60 percent to 90 percent of the world's population." https://x.com/elevenlabsio/status/1930689774278570003 ↩
AlphaAvenue. "ElevenLabs v3 multilingual and accent preservation." https://alphaavenue.ai/en/magazine/technologies/elevenlabs-v3-setting-new-standards-in-ai-powered-speech-synthesis/ ↩
ElevenLabs Documentation. "eleven_v3 API parameter and dialogue endpoints." https://elevenlabs.io/docs/overview/models ↩
ElevenLabs Help Center. "How much does it cost to generate using Eleven v3 (Alpha)?" https://help.elevenlabs.io/hc/en-us/articles/35869113958801-How-much-does-it-cost-to-generate-using-Eleven-v3-Alpha ↩
Inworld AI. "Eleven v3 effective per-1,000-character pricing band." https://inworld.ai/resources/elevenlabs-v3-review ↩
Kie.ai. "Affordable ElevenLabs Eleven v3 API for Multilingual Text to Dialogue with Audio Tags." https://kie.ai/elevenlabs/text-to-dialogue-v3 ↩
SurePrompts. "Voice Generation Models Compared (2026)." https://sureprompts.com/blog/voice-generation-models-compared-2026 ↩
CXO Today. "Eleven v3 alpha launch coverage." https://cxotoday.com/media-coverage/elevenlabs-introduces-eleven-v3-alpha-the-most-expressive-text-to-speech-model/ ↩
ElevenLabs Magazine. "Long-form audiobook workflow notes." https://elevenlabsmagazine.com/elevenlabs-eleven-v3-model-complete-guide-2026/ ↩
Robo Rhythms. "ElevenLabs Just Dropped v3: Here's What's Actually New." https://www.roborhythms.com/elevenlabs-v3-whats-new/ ↩
Robo Rhythms. "Post-discount pricing and Dia comparison." https://www.roborhythms.com/elevenlabs-v3-whats-new/ ↩
Inworld AI. "Artificial Analysis Elo score for Eleven v3." https://inworld.ai/resources/elevenlabs-v3-review ↩
SurePrompts. "End-to-end vendor comparison." https://sureprompts.com/blog/voice-generation-models-compared-2026 ↩
Publishing Perspectives (February 2026). "ElevenLabs Summit and ElevenCreative audiobook environment." https://publishingperspectives.com/2026/02/elevenlabs-summit-an-audiobook-company-that-isnt-about-audiobooks/ ↩
TechCrunch (February 5, 2026). "ElevenLabs CEO: Voice is the next interface for AI." https://techcrunch.com/2026/02/05/elevenlabs-ceo-voice-is-the-next-interface-for-ai/ ↩
ElevenLabs Documentation. "Best practices for v3 audio tags." https://elevenlabs.io/docs/overview/capabilities/text-to-speech/best-practices ↩
ElevenLabs Blog. "Eleven v3 Audio Tags: Multi-Character Dialogue in AI Speech." https://elevenlabs.io/blog/eleven-v3-audio-tags-bringing-multi-character-dialogue-to-life ↩
Webfuse. "What is new in ElevenLabs V3." https://www.webfuse.com/blog/what-is-new-in-elevenlabs-v3 ↩
ElevenLabs Magazine. "ElevenLabs Studio 3.0: Complete Guide for Creators (2026)." https://elevenlabsmagazine.com/elevenlabs-studio-3-complete-guide-2026/ ↩
Publishing Perspectives (February 2026). "ElevenCreative Audiobooks and the InAudio distribution surface." https://publishingperspectives.com/2026/02/elevenlabs-summit-an-audiobook-company-that-isnt-about-audiobooks/ ↩
ElevenLabs Documentation. "ElevenAgents overview." https://elevenlabs.io/docs/eleven-agents/overview ↩
Webfuse. "ElevenLabs cheat sheet 2026: agents, streaming, and models." https://www.webfuse.com/elevenlabs-cheat-sheet ↩
IBM Newsroom (March 25, 2026). "Enterprise AI Finds its Voice: ElevenLabs and IBM Bring Premium Voice Capabilities to Agentic AI." https://newsroom.ibm.com/2026-03-25-enterprise-ai-finds-its-voice-elevenlabs-and-ibm-bring-premium-voice-capabilities-to-agentic-ai ↩
BuildMVPFast. "AI Voice TTS Pricing (April 2026): ElevenLabs, Inworld, Deepgram, OpenAI." https://www.buildmvpfast.com/api-costs/ai-voice ↩
Sacesta (2026). "ElevenAgents: Build Conversational AI Voice Agents for Phone, WhatsApp & Chat (2026 Guide)." https://www.sacesta.com/our-work/blog/elevenlabs-agents-conversational-ai-guide-2026 ↩
ElevenLabs. "ElevenAPI Pricing for creators and businesses of all sizes." https://elevenlabs.io/pricing/api ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

CosyVoice F5-TTS Hume Octave 2 Sesame CSM Sonauto Text-to-Speech Models

What is Eleven v3?

Why did ElevenLabs build Eleven v3?

When was Eleven v3 released?

What can Eleven v3 do?

Headline capabilities

What are the trade-offs?

What are Eleven v3 audio tags?

Documented audio-tag categories

How do you keep audio tags stable?

How many languages does Eleven v3 support?

What is the Text to Dialogue API?

How is Eleven v3 used in Studio and ElevenCreative?

Why is Eleven v3 not used for live agents?

How much does Eleven v3 cost?

Effective per-character cost across tiers

How does Eleven v3 compare to competitors?

Eleven v3 in context (selected models, 2025 to 2026)

Audio-tag style versus directive prose

How was Eleven v3 received?

Adoption signals after general availability

See also

References

Improve this article

Related Articles

Lyria

Suno v5

ElevenLabs Music

Hume Octave 2

Sesame CSM

Stable Audio 2.5

What links here

Related Articles

Lyria

Suno v5

ElevenLabs Music

Hume Octave 2

Sesame CSM

Stable Audio 2.5

What links here