ElevenLabs v3
Last reviewed
May 17, 2026
Sources
62 citations
Review status
Source-backed
Revision
v2 ยท 5,469 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
62 citations
Review status
Source-backed
Revision
v2 ยท 5,469 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Eleven v3 (alpha) | |
|---|---|
| Developer | ElevenLabs |
| Type | Text-to-speech model |
| Alpha release | June 5, 2025[1] |
| General availability | February 2026[2] |
| Architecture | Successor to Eleven Multilingual v2; new model with deeper text understanding[3] |
| Languages | 70+ at launch[4] |
| Character limit | 5,000 characters per request (alpha and GA)[5] |
| Output formats | MP3 at 44,100 Hz, 128 kbps and other PCM formats[6] |
| Key feature | Inline audio tags such as [laughs], [whispers], [excited] for emotion and delivery control[7] |
| Model ID | eleven_v3 (API parameter)[8] |
| Pricing during alpha | 80% discount in the user interface through June 2025; API alpha opened in mid-2025[9] |
| Real-time use | Not optimized for live conversation; Flash v2.5 remains the low-latency option[10] |
Eleven v3, marketed by ElevenLabs as Eleven v3 (alpha), is a third-generation text-to-speech model that the London-based audio artificial intelligence startup released in public alpha on June 5, 2025. The model was promoted by the company as its most expressive synthetic voice system to date and introduced an inline markup language of bracketed "audio tags" such as [laughs], [whispers], [excited], and [sighs] that direct emotion, delivery, and non-verbal reactions inside the script itself.[11] Eleven v3 supports more than seventy languages at launch, native multi-speaker dialogue generation, and improved handling of dense numeric and symbolic text, although it sacrifices the low-latency profile of the older Flash and Turbo families and is therefore positioned for offline workloads such as audiobooks, character voice acting, and cinematic narration rather than live agent calls.[12]
The alpha rollout was paired with an 80 percent price cut in the ElevenLabs web interface through the end of June 2025, and a public alpha of the v3 application programming interface followed shortly after.[13] ElevenLabs moved Eleven v3 into general availability in February 2026, alongside a wider expansion of the company's publisher tooling and ElevenCreative audiobook environment.[14] Third-party benchmarks placed v3 near the top of expressive text-to-speech leaderboards, but reviews were mixed: testers praised the dramatic delivery and audio-tag system while criticizing slower generation, reduced control compared to Multilingual v2, and content filters that blocked some creative use cases.[15]
ElevenLabs was founded in 2022 by Mati Staniszewski and Piotr Dabkowski, two friends from Warsaw who started the company after watching a dubbed American film whose Polish voice track stripped most of the original emotional performance. Staniszewski had worked at Palantir Technologies and Dabkowski had been a machine learning engineer at Google. Their first commercial product was an English text-to-speech engine, internally called v1, that drew attention in early 2023 for producing audio that listeners often struggled to distinguish from a human reader.[16]
A second generation followed: Eleven English v2 sharpened the original voice quality, and Eleven Multilingual v2, released in 2023, extended the system to 28 languages while keeping a cloned speaker's accent and timbre across them. Multilingual v2 quickly became the company's default model for narration and audiobook work because it produced steady, neutral delivery with relatively few mispronunciations. ElevenLabs also shipped lower-latency siblings, Flash v2.5 and Turbo v2.5, that traded some expressive range for inference speeds suited to live agent platforms. By the end of 2024 the product line included instant and professional voice cloning, an AI dubbing studio, a voice library with thousands of community submissions, and a conversational agent platform that competed with offerings from OpenAI and Google.[17]
The gap that Eleven v3 was built to close was expressive performance. Even with Multilingual v2's emotional range, scripted reads still sounded controlled in a way that worked for nonfiction but felt flat in fiction, gaming, and animation. Writers asking the model to whisper, shout, sob, or laugh on cue had to chain SSML breaks and prompt tricks. Multi-speaker dialogue had to be produced one speaker at a time and stitched together. The v3 research project, which ElevenLabs began describing publicly in early 2025, was an attempt to solve those problems with a single new model.
ElevenLabs announced Eleven v3 on June 5, 2025 via a post on its corporate X account and a microsite at elevenlabs.io/v3. The launch message described the model as "the most expressive Text to Speech model ever," highlighted 70+ language support, multi-speaker dialogue, and audio tags such as [excited], [sighs], [laughing], and [whispers], and offered an 80 percent discount on v3 generation in the ElevenLabs user interface through the end of June 2025.[18] The model was labelled an alpha and a research preview, and ElevenLabs cautioned that it required more careful prompt engineering than Multilingual v2, that professional voice clones were not yet fully tuned for the new architecture, and that latency was higher than on the company's Flash and Turbo lines.[19]
The rollout proceeded in stages. On launch day v3 became selectable from the Text to Speech and Studio dropdowns on elevenlabs.io for all paid plans, with the free tier excluded. A few weeks later ElevenLabs opened a public alpha of the v3 endpoint for the company's hosted application programming interface, with the model identifier eleven_v3 available through the standard POST /text-to-speech/:voice_id route and dedicated text-to-dialogue endpoints for multi-speaker scenes.[20] The official ElevenLabs developer account confirmed that public API access was scheduled to roll out shortly after the user interface launch and would follow the same alpha pricing pattern for early adopters.[21]
Coverage of the launch focused on the audio-tag system, which represented a clear break from the brittle SSML-style markup used by older text-to-speech systems. Outlets framed the release as a step toward voice generation that could be directed like a stage performer rather than configured like a synthesizer; some compared it to expressive systems from rivals such as Hume's Octave line and to the open-weights conversational model Sesame CSM.[22] Staniszewski used subsequent press appearances to argue that audio models would commoditize over time and that the lasting moat would sit in tooling, voice rights, and distribution rather than the underlying acoustic model.[23]
Eleven v3 is built around four ideas: expressive delivery, multi-speaker dialogue, broad language coverage, and accurate handling of complex text. Internally, ElevenLabs has described v3 as a new architecture rather than a refinement of Multilingual v2, with a larger model and a higher-fidelity voice codec that improves audio quality at the cost of generation speed.[24] The company also raised the model's ceiling for contextual understanding so that a single passage can shift mood, accent, and even speaker without losing the cloned voice's identity.
| Capability | Behavior in Eleven v3 | Notes |
|---|---|---|
| Inline audio tags | Bracketed cues such as [laughs], [whispers], [shouts], [sarcastically] change emotion, volume, pacing, accent, or insert non-verbal reactions[25] | Tag behavior depends on the chosen voice and surrounding context; results vary across voices |
| Multi-speaker dialogue | Dedicated Text to Dialogue interface generates an entire scene with multiple voices in one pass, including overlaps and interruptions[26] | First ElevenLabs model with native multi-speaker support |
| Multilingual output | 70+ languages spanning major languages such as English, Mandarin, Spanish, German, French, and Japanese plus smaller languages such as Luxembourgish, Lingala, Sindhi, and Cebuano[27] | Coverage roughly doubles Multilingual v2's 28 languages |
| Accent preservation | A cloned voice keeps its original timbre and accent when speaking any of the supported languages[28] | Supports cross-language dubbing without re-cloning the speaker |
| Complex text accuracy | ElevenLabs reports a 68 percent reduction in errors on chemical formulas, phone numbers, addresses, abbreviations, and currency formats compared with v2[29] | Especially relevant for educational, technical, and finance content |
| Studio-quality output | MP3 at 44,100 Hz 128 kbps plus various PCM and Opus options through the API[30] | Same audio format menu as older models |
| Context length | Up to 5,000 characters per request, roughly five minutes of audio[31] | Multilingual v2 supports 10,000 characters per request |
| Voice cloning compatibility | Works with Instant Voice Clones and designed voices; Professional Voice Clones are usable but were not fully optimized at alpha[32] | ElevenLabs flagged this limitation in launch documentation |
The practical effect is that v3 lets a writer direct an audio scene the way a screenplay does. A short passage might open with [whispers] over background tension, shift to [shouts] for a confrontation, and end with [sighs] for resignation, all from the same cloned voice. Creators have used it for animated character dialogue, audiobook performances with distinct narrator and character voices, and dubbed video that preserves the original speaker's emotional arc.
The new architecture has costs. Generation is slower than on Multilingual v2 and substantially slower than Flash v2.5 or Turbo v2.5, which makes v3 unsuitable for live conversational use cases where the agent must respond within a few hundred milliseconds. ElevenLabs explicitly recommends Flash v2.5 for real-time agents and steers v3 toward batch workloads such as audiobooks, podcasts, dubbing, and content pre-production.[33] The 5,000-character ceiling per request is also half of Multilingual v2's limit, which forces longer pieces to be chunked and stitched.
Alpha testers reported that v3's expressive range can introduce variability across takes, and some users argued that Multilingual v2 still offers tighter control for neutral narration where every sentence must land the same way.[34] ElevenLabs has acknowledged the gap and continues to keep Multilingual v2 in the product line as the recommended default for steady reads.
Audio tags are the most visible feature of Eleven v3. The system lets writers embed cues inside a regular text prompt by wrapping a directive in square brackets, for example She paused. [whispers] "They're already inside." ElevenLabs documents four broad categories of tags, although in practice the line between them blurs and tags can be combined.[35]
| Category | Purpose | Representative tags |
|---|---|---|
| Emotions and tone | Color the next line with a specific feeling | [sad], [angry], [happily], [sorrowful], [nervously], [excited], [cheerfully], [sternly], [sarcastically], [dramatic tone] |
| Delivery direction | Change volume, pacing, or speech rate | [whispers], [shouts], [pause], [rushed], [slows down], [deliberate], [rapid-fire], [drawn out], [continues after a beat] |
| Non-verbal reactions | Insert authentic human sounds | [laughs], [laughs softly], [giggles], [sighs], [sigh of relief], [clears throat], [gasps], [gulps], [breathes], [stammers] |
| Character voices and accents | Switch into a persona or regional accent | [pirate voice], [French accent], [British accent], [Southern US accent], [American accent] |
| Sound effects and environment | Drop in basic non-speech sound elements | [clapping], [explosion], [gunshot] |
ElevenLabs has stressed that audio tags are interpreted by the model rather than executed as pre-recorded clips, so a [gunshot] or [explosion] will sound stylized and context-dependent rather than a sample from a sound effects library, and the same tag may produce slightly different audio across different voices.[36] Tags can also be stacked: [whispering][pause] "Don't move." [sigh of relief] produces a single line that whispers, holds, delivers the dialogue, and exhales out of the moment, all in one generation.
The company recommends that users start with a small number of tags on a strong, stable voice rather than over-marking a script, and that Instant Voice Clones generally outperform Professional Voice Clones on audio-tag accuracy in the alpha. Independent guides published after the launch echo this advice and add that combining a delivery tag with an emotion tag ([whispering][nervously]) produces more reliable results than chaining several emotion tags in a row.[37]
Prompt engineering for v3 has crystallized into a small set of patterns since the alpha opened. Writers anchor a scene with a setting line that establishes mood ([soft] [intimate]) before the first piece of dialogue, then alternate between plain text and a delivery cue when the energy of the scene changes. ElevenLabs documentation distinguishes between persistent tags, which stay in effect until a contradictory tag arrives, and transient tags such as [gasps] or [clears throat], which return the voice to its baseline after a single beat.[53]
Audio tags also interact with the underlying voice. ElevenLabs has reported that v3 stability is highest on voices designed for the model in the in-app Voice Design flow, somewhat lower on Instant Voice Clones, and lowest on legacy Professional Voice Clones that have not been retuned for the alpha codec. As GA approached, the company began rolling out a v3-compatible retrain pipeline for professional clone owners.
Multilingual coverage was one of the headline upgrades. Multilingual v2 supported 28 languages at launch; Eleven v3 covers more than seventy at alpha. ElevenLabs has stated that this expansion increases the share of the global population that can use the model in a native language from roughly 60 percent to roughly 90 percent.[38] The list spans every major regional language family and reaches into languages where high-quality synthetic voices had been thin on the ground, including Luxembourgish, Lingala, Cebuano, Sindhi, and several African and Central Asian languages.
Language support in v3 is paired with three features that previous models did not have together. First, automatic language detection: the model can read a prompt that mixes English and another language in the same paragraph without requiring a hard switch. Second, accent and timbre preservation when a single cloned voice speaks a language different from its training source, which is what allows the AI dubbing studio to dub a Spanish actor's interview into Japanese without sounding like a different speaker. Third, audio tags work across languages, so [whispers] or [laughs] produces appropriate non-verbal sounds whether the surrounding line is in French, Polish, or Tagalog.[39]
Language quality is not uniform. ElevenLabs has framed v3 as a research preview and has noted that smaller languages can show stronger accent artifacts and occasional mispronunciations, particularly for technical vocabulary. Multilingual v2 remains in the catalog for users who need the most stable performance on a narrow set of widely supported languages, while v3 is positioned as the higher-ceiling option for projects that need either broad coverage or expressive performance.
The Text to Dialogue endpoint is the most structurally novel piece of Eleven v3. Earlier ElevenLabs models could only produce one speaker per generation, so a conversation between a narrator, a child character, and a villain required three separate API calls and a manual stitch. Text to Dialogue collapses this into a single request that takes a list of turns, each tagged with a voice identifier and an optional set of audio tags, and renders the full scene in one pass with consistent room tone and emotional continuity across speakers.[54]
The model handles three behaviors that are difficult to script outside a unified pass. Interruptions cut one speaker off mid-line and lift the second voice in tempo. Overlaps allow short phrases such as [gasps] or [laughs] to ride over the tail of another speaker's line. Cross-speaker reactions let an utterance from one character shape the prosody of the next character's reply. These behaviors are part of why community demos of four-character animated scenes produced in a single generation drew so much attention at launch.[55]
In the API, the dialogue endpoint accepts a JSON list of speaker turns at POST /v1/text-to-dialogue with each entry specifying voice_id and text, optionally including voice_settings to override per-turn style or stability values. The same audio tag vocabulary used in single-speaker generation works inside each turn. Third-party guides have collected reusable templates for two-person interviews, four-character dramatic scenes, and call-center training material.
Eleven v3 is wired into Studio 3.0, the renamed and expanded successor to ElevenLabs Studio that the company shipped alongside GA in February 2026. Studio 3.0 is positioned as a full audio and video production environment rather than a thin wrapper around the text-to-speech endpoint, and it consolidates manuscript ingestion, voice generation, audio tag editing, multi-speaker assembly, music scoring, video lip-sync, and export into a single browser-based timeline.[56] Inside Studio, v3 is selected on a per-clip basis, which lets editors mix steady Multilingual v2 takes for neutral narration with v3 takes for emotional peaks in the same project.
The audiobook surface, branded ElevenCreative Audiobooks, sits one level up from Studio and targets trade publishers and self-publishing authors. The flow accepts an EPUB or DOCX manuscript, splits it into chapters, lets the producer assign character voices, runs v3 on dialogue passages while leaving narration on Multilingual v2 by default, and exports a publisher-ready audiobook bundle with chapter markers and metadata. ElevenLabs paired the launch with InAudio, the distribution arm built on infrastructure from the company's 2025 Findaway acquisition, which lets producers push a finished audiobook directly to Spotify, Audible, and library wholesalers.[57]
ElevenLabs also launched a v3-aware iteration of its dubbing studio, which uses the model's accent-preserving multilingual delivery to retain the original speaker's voice across language switches. An English documentary can be relocalized into Spanish, French, and Japanese in parallel, with v3 picking up emotion cues from the source audio.
ElevenLabs spent 2025 rebranding its conversational AI product into ElevenAgents, and the platform reached a wider March 2026 launch that bundled phone, web, and chat surfaces into a single agent runtime with first-turn latency reported under 500 milliseconds.[58] Eleven v3 is not the recommended speech engine for ElevenAgents, because its higher-fidelity codec and broader expressive range come with generation latency too long for natural turn-taking. The agent stack defaults to Flash v2.5 or Turbo v2.5 for the spoken side of a live call, and ElevenLabs documentation explicitly steers callers building production agents toward the Flash family when latency budgets are tight.[59]
What v3 does contribute is offline production work that surrounds a live deployment. Prompted greetings, scripted voicemail messages, training samples, and outbound batch calls are usually rendered in v3 to capture more emotional range, then cached and streamed by the agent at runtime as audio assets. The IBM partnership announced in March 2026 to bring ElevenLabs voices into IBM watsonx Orchestrate uses this pattern: live agent turns run on Flash v2.5, while pre-recorded brand greetings and longer narrative segments are generated in v3 and served as static clips.[60]
The clearest internal demarcation is the latency budget. Real-time agent turns must complete a first-byte response inside a few hundred milliseconds; Flash v2.5 hits that envelope and v3 does not. For workloads where the writer can afford to wait for a higher-quality render, v3 is the engine of choice.
Eleven v3 is accessible through both the ElevenLabs web interface and the company's hosted application programming interface. In the web app, the model appears in the model selector under Text to Speech and in Studio projects, and is available on every paid subscription tier (Starter, Creator, Pro, Scale, and Business). The free tier does not include v3 generation. The public application programming interface uses the model identifier eleven_v3 and supports the standard text-to-speech endpoint as well as dedicated text-to-dialogue endpoints for multi-speaker scenes, including a Beta dialogue endpoint that lets callers pass per-speaker voice assignments in a single request.[40]
ElevenLabs uses a credit model rather than per-character billing for most plans, where each subscription tier comes with a monthly pool of credits and v3 consumes one credit per character generated, identical to the rate on Multilingual v2 once the alpha discount expires.[41] Monthly credit allotments under the standard creator plans range from 10,000 credits on the free tier (Multilingual v2 only, no v3 access) to 30,000 credits on Starter, 100,000 credits on Creator, 500,000 on Pro, and into the millions on Scale and Business plans. The 80 percent alpha discount that ran through June 2025 reduced the effective character cost of v3 by a factor of five in the user interface and was used to seed creator adoption during the launch window. After the alpha discount ended, the API price for v3 settled into a band that third-party trackers have placed at roughly $0.17 to $0.30 per 1,000 characters depending on plan tier.[42]
A second route to Eleven v3 access runs through third-party model aggregators. Platforms such as WaveSpeed, Kie.ai, and several enterprise voice routing services expose eleven_v3 and the Text to Dialogue endpoint behind their own application programming interfaces, which appeals to developers who mix multiple speech vendors behind a single billing layer.[43]
| Plan | Monthly credits | Headline price | Effective rate on v3 |
|---|---|---|---|
| Free | 10,000 (v3 excluded) | $0 | n/a |
| Starter | 30,000 | $5 | About $0.17 per 1,000 characters |
| Creator | 100,000 | $22 | About $0.22 per 1,000 characters |
| Pro | 500,000 | $99 | About $0.20 per 1,000 characters |
| Scale | 2,000,000 | $330 | About $0.165 per 1,000 characters |
| Business | 11,000,000 | $1,320 | About $0.12 per 1,000 characters |
| Alpha promo (June 2025) | n/a | 80% off in-app | About one-fifth of the post-alpha rate |
Pricing trackers flag that the per-character rate on v3 is substantially higher than on community options, with Hume Octave 2 priced at roughly $7.60 per million characters and self-hosted Sesame CSM effectively free at the model cost while requiring user-provided compute.[61] ElevenLabs has argued that the headline price reflects the cost of the voice library, audio tag training, and the publisher tooling that wraps the model rather than just inference.
ElevenLabs operates in an increasingly crowded expressive text-to-speech market. The closest comparisons in 2025 and early 2026 were OpenAI's TTS models, Hume's Octave family (and its successor Hume Octave 2), Cartesia Sonic, and the open-source Sesame CSM conversational model. Each system makes a different trade-off between latency, expressivity, control style, and licensing.
| Model | Vendor | Approximate launch | Languages | Expressive control method | Latency profile | Notable strength |
|---|---|---|---|---|---|---|
| Eleven v3 | ElevenLabs | Alpha June 2025; GA February 2026 | 70+ | Inline bracketed audio tags | Offline; not real-time | Dramatic delivery, voice cloning ecosystem |
| Eleven Flash v2.5 | ElevenLabs | 2024 | 32 | SSML-style break and emphasis | ~75 ms inference, ~150 ms TTFA | Conversational agents, low-latency calls |
| OpenAI TTS | OpenAI | 2023; Realtime API 2024 | English-first, multilingual via Realtime | Plain-English voice instructions in prompt | Interactive plus a Realtime API for sub-second responses | Instructable voice character via prose direction |
| Hume Octave 2 | Hume AI | October 2025 | 11+ | Plain-English emotional instructions plus voice conversion and phoneme editing | Under 200 ms latency | Emotional intelligence, voice conversion |
| Cartesia Sonic | Cartesia | 2024; Sonic-2 2025; Sonic-3 2026 | Multilingual, growing | Limited emotion controls focused on speed | Sub-100 ms TTFB; Sonic-3 ~40 ms TTFA | Real-time voice agents on State Space Models |
| Sesame CSM | Sesame AI Labs | March 2025 | English-first | Conversational context; open weights | Designed for conversational responsiveness | Open-weights companion-style voice model |
On raw expressivity for prepared content, Eleven v3 and Hume's Octave line trade blows: third-party listening tests have placed both at the top of the expressive-speech category, with Octave 2 favored for emotional nuance over a plain-English directive vocabulary and Eleven v3 favored for cinematic delivery and a deeper voice library.[44] For real-time voice agents, Cartesia Sonic-3 leads on latency with a time-to-first-audio around 40 milliseconds, and ElevenLabs steers customers toward Flash v2.5 rather than v3 for that workload. OpenAI's instructable voices offer flexible character control but operate on a smaller voice catalog and do not provide cloning, while Sesame CSM is the open-weights option for developers who want to host a conversational voice model themselves.
A recurring framing in 2026 coverage is that no single model dominates on both quality and latency at the same time. ElevenLabs splits this between Flash v2.5 and Eleven v3, Cartesia between Sonic-3 and a slower high-fidelity tier, and Hume through different Octave 2 modes. Producers route real-time turns through a fast model and pre-rendered content through v3 or an Octave 2 quality tier.
The expressive control vocabulary is the most visible philosophical difference among the three top expressive models. Eleven v3 uses inline bracketed tags that read like stage directions inside a screenplay; Hume Octave 2 uses plain-English directives passed as a separate field on the request; OpenAI's Realtime API accepts a short voice description prompt that biases the speaker style for an entire session. Reviewers in 2026 have argued that the tag approach in v3 is easier to version-control because the cues live inside the script, while the directive approach in Octave 2 is easier for casual users. Game studios and animation houses, where dialogue is iterated in writers' rooms, have leaned toward v3 for the script-embedded format; emotional interview tools and AI companion apps have leaned toward Octave 2.
The launch reception was loud and split. Coverage in the days after June 5, 2025 leaned positive on the model's expressive ceiling: outlets including Sifted, Geeky Gadgets, AIBase, and several developer-focused newsletters described v3 as a step change in synthetic voice acting, with multi-speaker dialogue and audio tags singled out as the features that closest to performing rather than narrating.[45] In one widely circulated demo, a single voice handled a four-character animated scene complete with overlaps and reactions, which had previously required either heavy editing or a dedicated multi-speaker pipeline.
At the same time, three threads of criticism appeared quickly. The first concerned consistency: heavy use of audio tags can change voice character across takes, and some testers reported that v3 felt less controllable than Multilingual v2 when the goal was a steady, neutral narration. A working consensus among long-form audiobook producers settled on Multilingual v2 for most chapters and v3 for emotionally charged passages where its dramatic range pays off.[46] The second concerned content filters: users on the company's community forum complained that v3 declined to render some profanity and intense emotional content even in clearly artistic contexts, which they argued narrowed its usefulness for fiction and game dialogue.[47] The third concerned post-alpha pricing: the 80 percent discount during June 2025 made v3 a near drop-in upgrade, but several reviewers noted that the post-discount rate sat well above community open-weights options such as Dia from Nari Labs, which had improved quickly through 2025.[48]
Independent benchmark trackers placed Eleven v3 near the top of expressive text-to-speech leaderboards in late 2025 and early 2026. Artificial Analysis recorded v3 with an Elo score of 1196 in its speech model rankings after the general availability release, placing it second in its category, with first place taken by a real-time variant from a different vendor that traded off some expressive range for sub-second response.[49] Reviewers wrote that v3's combination of voice cloning, multilingual reach, and audio tags was hard to match end-to-end even where individual rivals beat it on one axis, and that the company's broader catalog (Multilingual v2 for stability, Flash and Turbo for latency, v3 for expression) gave producers a way to mix and match.[50]
General availability arrived in February 2026, paired with an expansion of ElevenLabs' publisher tools and the launch of a dedicated audiobook environment inside ElevenCreative.[51] By that point ElevenLabs had moved past $330 million in annualized revenue and reached an $11 billion valuation, and Staniszewski used the GA milestone to argue that voice was becoming a primary interface for AI products.[52]
By spring 2026 a handful of adoption signals had appeared. A March 2026 Sacesta industry survey of voice AI deployments reported that Eleven v3 was the most commonly cited model for prerecorded brand voice, audiobook narration, and animation dialogue, while Flash v2.5 led the live-agent category by a wide margin.[62] Several large North American publishers moved their default narration workflow into ElevenCreative Audiobooks during the same window, with v3 reserved for character voices.
Independent reviewers in 2026 began to read v3 as part of a broader product story rather than a standalone model launch: ElevenLabs has shifted from a model vendor to an audio platform, with v3 as the expressive layer, Flash v2.5 as the conversational layer, Eleven Music as the score layer, and Eleven Scribe as the speech-to-text layer feeding the same content graph.