Cartesia is a San Francisco-based AI company focused on real-time voice synthesis, speech recognition, and state space model (SSM) research. Founded in 2023 by researchers from Stanford University's AI Lab, the company builds text-to-speech and speech-to-text systems optimized for interactive applications where latency is the primary constraint: voice agents, customer service automation, accessibility tools, and live audio pipelines. Its flagship product, the Sonic model family, holds records for time-to-first-audio in streaming TTS and powers millions of conversations per month across enterprise deployments.
The company sits at an unusual intersection: it is simultaneously a model research lab publishing work on SSM architectures, including the Mamba line, and a commercial voice platform serving developers at API scale. Sonic 3, released in October 2025, generates speech with 90ms model latency, supports 42 languages, and adds native laughter and fine-grained emotion controls. A parallel speech-to-text product called Ink launched in June 2025, completing Cartesia's ambition to own both sides of the voice pipeline. Total funding reached $191 million by late 2025, with Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA among the investors.
Cartesia traces its origins to the Stanford AI Lab, where a group of PhD students and their advisors spent years developing state-space model architectures as an alternative to the transformer. The key theoretical work arrived in two waves.
Albert Gu completed his Stanford PhD in 2023 with a thesis titled "Modeling sequences with structured state spaces." His work on the S4 (Structured State Space Sequence) model and its successors established that linear recurrent architectures could match transformer performance on sequence modeling benchmarks while using fundamentally less memory and compute. In December 2023, Gu and Tri Dao (associated with the Flash Attention work) co-authored the Mamba paper, which introduced a selective state space mechanism that matched or exceeded transformer performance on language modeling benchmarks while running at significantly lower cost. Mamba demonstrated linear scaling in sequence length versus the quadratic scaling of standard attention, a difference that matters substantially for streaming audio, where a 30-second clip at common codec rates represents thousands of tokens.
Karan Goel, a Stanford PhD student working with professor Christopher Ré, had spent years alongside Gu thinking about how SSMs could move from research benchmarks into production systems. Christopher Ré is a Stanford full professor known for founding Snorkel, SambaNova, and two companies later acquired by Apple. He received a MacArthur Fellowship in 2015. Ré became part of the founding circle alongside Goel, Gu, Arjun Desai, and Brandon Yang.
Cartesia was formally incorporated in 2023. The founding thesis was direct: SSMs offered memory and latency characteristics that transformers could not easily replicate for real-time inference, and audio was the natural first application because conversational voice requires fast, continuous generation rather than batch processing. Generating the next frame of audio from an SSM requires updating a fixed-size hidden state rather than attending over the full context window, which means the time per generation step stays constant regardless of how long the conversation has been running.
Albert Gu joined Cartesia as Chief Scientist while also accepting an assistant professor position at Carnegie Mellon University. Karan Goel became CEO. Arjun Desai and Brandon Yang took engineering leadership roles. The company built its first product, Sonic, through 2023 and launched it publicly in 2024.
Cartesia raised a $27 million seed round led by Index Ventures, with participation from Lightspeed, General Catalyst, A* Capital, Factory, Conviction, SV Angel, and approximately 90 angel investors. The round was announced in 2024 alongside the first public release of the Sonic API.
In March 2025, the company announced a $64 million Series A led by Kleiner Perkins. Index Ventures, Lightspeed, A*, Factory, Greycroft, Dell Technologies Capital, and Samsung Ventures all participated. The announcement coincided with the launch of Sonic 2.0 and Sonic Turbo. Total capital raised at that point reached $91 million.
In October 2025, Cartesia disclosed a further $100 million round with Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA participating, announced simultaneously with the release of Sonic 3. NVIDIA's participation was notable given its strategic interest in inference-efficient model architectures. Total disclosed funding reached $191 million.
| Round | Amount | Lead investor | Date |
|---|---|---|---|
| Seed | $27M | Index Ventures | 2024 |
| Series A | $64M | Kleiner Perkins | March 2025 |
| Series B | $100M | Kleiner Perkins | October 2025 |
| Total | $191M |
The company's investors span both financial institutions (Index, Kleiner Perkins, Lightspeed) and strategic corporate investors (NVIDIA, Dell Technologies Capital, Samsung Ventures), reflecting the dual relevance of Cartesia's work to both the AI software market and the hardware supply chain. NVIDIA's investment is particularly relevant because SSM inference does not map as cleanly onto standard tensor core operations as transformer attention does, creating an incentive for NVIDIA to support research that optimizes SSM kernels for its GPU architectures.
The original Sonic model launched in 2024 as Cartesia's first commercial TTS product. The company described a model latency of 135ms, positioned as the lowest available for a model of its quality class at the time. The original Sonic supported English and a limited set of additional languages, and introduced Cartesia's streaming API design: rather than generating a complete audio file before returning it, the API begins streaming audio bytes within milliseconds of receiving input text. This design is a direct consequence of the SSM architecture, which can produce audio frames one at a time without needing to re-read the full prior context on each step.
Sonic 2.0 launched on March 6, 2025, alongside the Series A announcement. The update represented a substantial architectural rework. Despite being roughly twice the parameter count of the original Sonic, Sonic 2.0 ran faster. Cartesia attributed this to improvements in the underlying SSM architecture that reduced per-step compute while increasing model capacity. Model latency dropped to 90ms for the full Sonic 2.0 variant.
Sonic Turbo, a smaller and faster sibling released at the same time, achieves 40ms model latency, the lowest published figure for any major TTS provider at that point. Sonic Turbo is available only through Cartesia's first-party API, not through third-party GPU clouds or serving partners.
Both models at launch supported 15 languages: English, French, German, Spanish, Portuguese, Chinese, Japanese, Hindi, Italian, Korean, Dutch, Polish, Russian, Swedish, and Turkish. Voice cloning required just three seconds of audio. The models also introduced two new API endpoints not present in the original Sonic: voice changing (applying a voice style to existing audio) and infill editing (replacing a segment of audio while preserving surrounding context).
In blind preference evaluations, Sonic 2.0 was preferred over ElevenLabs Flash V2 by 61.4% of listeners versus 38.6%, and was preferred over its nearest competitor by 1.5 times overall. Sonic 2.0 and Sonic Turbo are both priced at $46.70 per million characters.
Sonic 3 launched in October 2025 alongside Cartesia's $100 million funding round. The model brought several meaningful advances over Sonic 2.0.
Language support expanded from 15 to 42 languages, covering an estimated 95% of global economic activity. Languages added in Sonic 3 include Arabic, Bulgarian, Czech, Danish, Finnish, Greek, Hebrew, Hungarian, Indonesian, Romanian, Slovak, Ukrainian, Vietnamese, and others.
The most distinctive feature of Sonic 3 is built-in emotional expression. Developers can insert plain-text tags such as [laughter] or [emotion] directly into input text, and the model generates naturally occurring non-verbal vocalizations at the appropriate point in the audio. No competing streaming TTS product offered comparable laughter and emotion generation at launch. The model handles excitement, sadness, hesitation, and similar states without requiring separate API calls or post-processing.
Latency specifications for Sonic 3 are 90ms model latency and 190ms end-to-end latency as measured from API call to the first audio byte arriving at the client over a typical network. The Turbo variant of Sonic 3 maintains the 40ms model latency of Sonic Turbo. Sonic 3 also added fine-grained volume and speed modulation through API parameters.
Retell AI integrated Sonic 3 into its platform at launch as a configuration switch, giving existing Retell customers access to the new model without any API migration work.
In June 2025, Cartesia released Ink, a family of streaming speech-to-text models designed to complement Sonic in full-stack voice pipelines. The debut model, Ink-Whisper, is an optimized variant of OpenAI's Whisper tuned specifically for low-latency transcription in conversational settings.
Ink-Whisper addresses the specific failure modes that matter in voice agent deployments: telephony audio artifacts, proper nouns and domain-specific terminology, background noise, disfluencies and silence, and accent variation. The model uses dynamic chunking to handle variable-length audio and interruptions without requiring fixed frame sizes.
Performance claims center on time-to-complete-transcript (TTCT), where Cartesia stated Ink-Whisper was the fastest streaming STT model tested at launch. Pricing on the Scale plan is $0.13 per hour of audio, billed per second. Integrations exist for Vapi, LiveKit, and Pipecat at launch, with Voiceflow support added later.
With Ink handling the speech-to-text step and Sonic handling text-to-speech, Cartesia's Line platform (described below) can offer a fully Cartesia-owned voice pipeline rather than routing through external ASR providers.
Line is Cartesia's voice agent development and deployment platform, announced in 2025. Where Sonic and Ink are low-level model APIs, Line is a higher-level platform for building, iterating, and operating voice agents without assembling the infrastructure stack from scratch.
The platform is code-first: agents are written as code using the Line SDK, which can be developed locally and deployed with a single command. From a text prompt or template, a developer can have a deployed agent running in minutes. Line handles turn detection, conversation state, LLM integration, and parallel background task coordination.
Key Line features include:
Voximplant, a cloud communications platform, announced support for Cartesia Line agents in February 2026, enabling Line-built voice agents to operate on actual telephony infrastructure for inbound and outbound phone calls.
Cartesia's technical differentiation rests on applying SSMs to audio generation in place of the transformer-based architectures used by most competitors.
The core computational distinction concerns how each architecture handles prior context. A transformer with standard self-attention processes all previous tokens in parallel, with memory and compute scaling as O(n squared) in sequence length. An SSM maintains a fixed-size hidden state updated recurrently, with compute scaling as O(n). For audio specifically, this difference is significant: a 30-second clip sampled at typical codec rates represents thousands of tokens, and a multi-turn conversation may span hundreds of thousands.
Beyond raw compute efficiency, SSMs have a favorable property for streaming generation. Because they operate recurrently, the model can generate the next audio frame without re-reading the entire prior context. The computation required per output step is constant regardless of conversation length. This is what enables the sub-50ms time-to-first-audio figures that Sonic Turbo achieves. Transformer-based TTS systems must typically complete a full forward pass before generating the first token, adding latency proportional to input length.
Cartesia built what it describes as a multi-stream SSM architecture for audio: separate state representations for different data modalities, text conditioning and audio generation, connected through a conditioning mechanism. This allows the model to condition audio output on text input in real time without the two streams interfering with one another's recurrent state.
The architecture also benefits from constant memory consumption during inference. Unlike a transformer KV cache that grows with context length, an SSM's hidden state stays the same size regardless of how long the conversation has run. This matters for infrastructure cost at scale: serving many simultaneous long-running voice calls does not require allocating proportionally larger memory per call.
Cartesia's researchers acknowledge a trade-off: SSMs trained naively can struggle with certain forms of long-range recall compared to attention, because information older than the state's capacity can be lost. The Mamba selective state mechanism addressed part of this by allowing the model to learn which input information to retain versus discard, but it remains an area of ongoing research.
The 40ms time-to-first-audio (TTFA) figure associated with Sonic Turbo is one of the lowest published latency specifications in commercial TTS. To place it in context:
Human conversational response latency begins to feel delayed at roughly 200 to 300ms from when someone finishes speaking. At 40ms model latency, the TTS component of a voice agent pipeline contributes minimal perceptible delay. The bottleneck shifts to other steps: speech recognition, LLM inference, and network transit.
| Provider | Model | TTFA (model) | Notes |
|---|---|---|---|
| Cartesia | Sonic Turbo | 40ms | First-party API only |
| Cartesia | Sonic 3 | 90ms | 190ms end-to-end |
| ElevenLabs | Flash v2.5 | 75ms | Published spec |
| Deepgram | Aura | ~250ms | End-to-end target |
| OpenAI | TTS-1 | 300ms+ | Non-streaming mode |
Vapi, after integrating with all major TTS providers, reported that Cartesia was the only provider achieving consistently sub-200ms end-to-end latency across all languages, which was the stated reason for making Cartesia the default provider in its voice agent platform.
Cartesia measures latency at two distinct points. Model latency is the time between receiving input text and producing the first audio byte internally. End-to-end latency includes the network round trip and is what a developer's application actually observes. The 40ms figure is model latency; end-to-end latency depends on network conditions but Cartesia cites 190ms under typical conditions for Sonic 3.
Cartesia's API is REST-based and supports both synchronous (single response) and streaming modes. The streaming interface uses WebSockets or server-sent events to return audio chunks as they are generated. This is how the sub-200ms end-to-end figures are achieved: the application begins playing audio before the full response has been generated.
The API accepts plain text input along with parameters for voice ID, speed, volume, emotion tags (for Sonic 3), and output audio format. Supported output formats include raw PCM, MP3, and Opus. Separate endpoints handle voice cloning (instant and professional), voice changing, and infill editing.
Pricing uses a credit system. For standard TTS, 1 credit equals 1 character of input text. Professional voice clone generation costs 1.5 credits per character. STT via Ink is billed per second of audio.
| Plan | Monthly cost | Credits included | Voice cloning |
|---|---|---|---|
| Free | $0 | Limited | No |
| Startup | $49 | 1.25M credits | Instant cloning |
| Growth | $99+ | Variable | Instant + Pro cloning |
| Scale | $239 (annual) | $299 credit pool | Full access |
| Enterprise | Custom | Custom | Custom |
At the Startup tier, 1.25M credits per month is sufficient to generate roughly 15 to 20 hours of speech depending on text verbosity. Training a professional voice clone consumes 1M credits as a one-time cost.
For enterprise customers, Cartesia also offers dedicated regional deployments with PCI-compliant configurations. These environments include data isolation, encryption at rest and in transit, and audit logging for compliance-sensitive workloads in healthcare, financial services, and legal applications. Regional deployments are available in North America and Europe.
Cartesia offers two voice cloning tiers with meaningfully different use cases.
Instant voice cloning (IVC) requires as little as three seconds of audio. The resulting clone is available immediately through the API and captures the speaker's accent, timbre, and vocal character. IVC is available on Startup tier and above. The brief audio requirement is itself a product of the SSM architecture: the model can infer a speaker's vocal characteristics from a short sample without needing the extended fine-tuning that transformer-based cloning often requires.
Professional voice cloning (PVC) involves a fine-tuning process that takes longer to complete but produces higher fidelity. PVC is designed for brand voice applications where a company needs consistent, reliable representation of a specific speaker or a custom created voice character. Training a PVC costs 1M credits; generating speech with it costs 1.5 credits per character. PVC is available on Growth tier and above.
Both cloning methods work across all languages Sonic supports. A voice cloned from an English speaker can generate French, Japanese, or Arabic output while preserving the original speaker's vocal identity as closely as possible. This language-portable cloning is used in localization workflows where a single recorded voice needs to cover multiple markets.
Voice cloning is also accessible through Cartesia's web playground for prototyping without writing code.
Cartesia's terms of service require users to have rights to the voice being cloned and prohibit use of cloned voices for fraud, impersonation, or other deceptive purposes. The company does not publish details of technical measures used to detect misuse.
Cartesia runs a significant portion of its GPU inference workload on Together AI's infrastructure rather than operating its own data centers. Together AI provides GPU clusters with NVLink intra-node connectivity, GPU-direct RDMA over InfiniBand for inter-node operations, and WekaFS storage configured for the random-read I/O profile typical of audio training workloads.
The arrangement gives Cartesia deep cluster access to run a custom inference engine optimized specifically for SSM architectures, rather than relying on generic serving stacks designed for transformer inference. Together AI's case study reports that the partnership enables Cartesia to achieve under 200ms end-to-end latency with 2x faster performance relative to other providers, at half the infrastructure cost.
The Sonic model has been served on Together AI's clusters in production since its launch, handling millions of audio minutes daily. Together AI also offers Cartesia Sonic 2.0 and Sonic 3 as hosted model endpoints for enterprise customers who prefer to route through Together AI's compliance and billing infrastructure.
Cartesia's go-to-market relies primarily on developer-led adoption through its API, playground, and documentation, with ecosystem integrations into voice agent platforms as the primary scaling mechanism.
Vapi is a voice agent orchestration platform that handles turn detection, LLM routing, and TTS provider connections for developers building phone-based AI agents. After evaluating all major TTS providers, Vapi selected Cartesia as its default provider and embedded Cartesia in its homepage demo. Vapi cited consistent sub-200ms end-to-end latency across all supported languages as the deciding factor. Ink-Whisper was subsequently added as an STT option within Vapi as well.
Retell AI is a competing voice agent platform that also integrated Cartesia as a first-class option. Retell users can switch to Sonic 3 through a configuration change without any API migration. The integration includes all Sonic 3 capabilities: 42 languages, custom pronunciation dictionaries, speed and volume controls, and emotion tags.
Together AI operates as both an infrastructure partner (described above) and a distribution channel. Enterprises that use Together AI's model serving platform can access Sonic 2.0 and Sonic 3 through Together AI's APIs and billing rather than directly through Cartesia.
| Customer or platform | Category |
|---|---|
| ServiceNow | Enterprise software |
| Cresta | Contact center AI |
| Decagon | Customer support AI |
| Quora | Consumer technology |
| Thoughtly | GTM voice agents |
| Yelp | Reviews and local search |
| DoorDash | Food delivery |
| LiveKit | Real-time audio/video infrastructure |
| Pipecat | Voice agent framework |
| Voiceflow | Conversational AI builder |
| Voximplant | Cloud communications |
The company reported more than 50,000 API customers and millions of conversations per month processed across its infrastructure as of late 2025, with enterprise clients including organizations from financial services, healthcare, and technology sectors.
ElevenLabs and Cartesia are the two companies most frequently compared in developer discussions of voice AI APIs. They have distinct positioning.
ElevenLabs was founded in 2022 and has built its reputation around voice quality and a large library of preset voices, reaching a $3 billion valuation in early 2024. Its strengths are breadth: 70+ languages, 5,000+ voices, a dubbing product, and Conversational AI endpoints. ElevenLabs' standard models are transformer-based, which enables certain forms of expressiveness but limits how low latency can go.
Cartesia's strengths concentrate in latency and the specific feature set for real-time voice agents. Its SSM architecture produces consistently lower model latency than transformer alternatives, and its streaming API design is built from the ground up for conversational use cases.
| Feature | Cartesia Sonic 3 | ElevenLabs |
|---|---|---|
| Lowest model latency | 40ms (Turbo) | 75ms (Flash v2.5) |
| End-to-end latency | ~190ms | ~200ms+ |
| Language count | 42 | 70+ |
| Laughter / emotion tags | Yes | Limited |
| Preset voice library | 450+ | 5,000+ |
| Voice cloning (min audio) | 3 seconds | ~1 minute |
| On-device deployment | Yes (Edge library) | No |
| Built-in STT product | Yes (Ink) | Yes (Scribe) |
| Voice agent platform | Yes (Line) | Yes (Conversational AI) |
| Primary positioning | Real-time voice agents | Content creation, dubbing |
| Relative pricing | Lower | Higher |
Blind preference tests run by Cartesia showed Sonic 2 preferred over ElevenLabs Flash V2 by 61.4% to 38.6% of listeners in head-to-head evaluations. ElevenLabs' higher-tier models produce more expressive voice quality in non-real-time contexts, and ElevenLabs is generally the market preference for content creation where millisecond latency does not matter. For voice agents where every 50ms of latency is perceptible to the caller, Cartesia's architecture advantages are more significant.
In 2025, Cartesia open-sourced Edge, a library for running SSM models directly on device hardware without sending audio or text to cloud infrastructure. The initial target is Apple M-series chips. The Edge library is designed to run Sonic models locally in real time, using the SSM architecture's constant memory footprint to stay within the memory constraints of consumer hardware.
On-device TTS eliminates the network round trip that typically accounts for 100 to 150ms of the end-to-end latency figure in cloud deployments. For applications where the device is reliably close to the model (such as a smartphone running a local voice assistant), on-device execution can push end-to-end latency below 100ms.
On-device deployment also addresses data privacy requirements. For healthcare, legal, financial services, and enterprise security applications, keeping voice data on-device rather than transmitting it to cloud APIs removes a category of compliance exposure entirely. No audio leaves the device, and there is no dependency on network availability.
The Llamba model family, released in February 2025, extends on-device capability to language modeling. Llamba-1B, Llamba-3B, and Llamba-8B are SSM language models distilled from the Llama 3 series. The distillation approach produces models that run with SSM latency characteristics while retaining much of the knowledge from Llama's training data. At 1B to 8B parameters, these models are sized to run on consumer and mobile hardware.
Beyond Sonic and Ink, Cartesia has published SSM research that advances the state of the field.
The Mamba-3B-SlimPJ post demonstrated SSMs matching the best transformer architectures at the 3B parameter scale on language modeling benchmarks, an important proof point for the thesis that SSMs are not limited to specialized audio tasks.
Llamba (February 2025) showed that distilling transformer knowledge into SSM architectures works at scale. Llamba models run faster than their Llama teacher models while retaining most downstream task performance, and they are designed to be deployable on consumer hardware.
Mamba-3 was published at ICLR 2026 in collaboration with researchers at Carnegie Mellon University, Princeton University, and Together AI. The paper introduced three architectural improvements: exponential-trapezoidal discretization, complex-valued state updates, and a Multi-Input Multi-Output (MIMO) recurrence formulation. At the 1.5B parameter scale, Mamba-3 achieved 1.8 percentage points of average downstream accuracy improvement over Mamba-2 while using states half the size. The MIMO variant contributed 1.2 points of that improvement by boosting accuracy without increasing decoding latency.
The primary commercial application for Cartesia's technology is automated voice agents: AI systems handling inbound and outbound phone calls for appointment scheduling, customer service triage, sales qualification, and similar tasks. Companies like Vapi, Retell, Thoughtly, and others build the orchestration layer; Cartesia provides the TTS and STT components.
Latency is the variable that determines whether these interactions feel natural or mechanical. A voice pipeline includes ASR, LLM inference, and TTS in sequence. Minimizing each step's contribution changes whether the caller perceives they are talking to a person or waiting for a system to process their input. Cartesia's position is that reducing TTS latency from 300ms to 40ms removes enough of the gap that the remaining delay from LLM inference becomes the primary perceived bottleneck, not audio generation.
Cartesia's voice cloning and AI voiceover product serve content creators who need consistent narration across large volumes of output: audiobooks, explainer videos, e-learning courses, and podcast-format content. The Sonic 3 emotion controls allow more expressive delivery than models producing flat, neutral speech, which matters when narration needs to hold a listener's attention for extended periods.
Screenreader and real-time document reading applications benefit from low-latency TTS. Cartesia's API is fast enough to drive applications where audio output must track real-time text generation closely, such as read-aloud features for users with visual impairments or reading disabilities. The brevity of the cloning sample requirement also makes it practical for individual users to create personalized voices from their own speech.
With 42-language support in Sonic 3 and voice cloning that preserves vocal identity across languages, Cartesia is used in localization workflows where content originally recorded in one language needs to be converted to another while retaining the speaker's voice characteristics. A corporate training video recorded by an executive in English can be localized to Spanish, French, or Japanese with the same voice.
Several constraints are relevant for teams evaluating Cartesia.
Language support at 42 languages in Sonic 3 is broad but narrower than ElevenLabs at 70+. Lower-resource languages in Southeast Asia, sub-Saharan Africa, and the Middle East are often not covered. Teams with requirements for these languages may need to supplement Cartesia with other providers.
The 500-character input limit on Sonic Turbo constrains use cases involving continuous long-form passages at maximum speed. The standard Sonic 3 does not carry this limit but runs at 90ms rather than 40ms model latency.
Professional voice cloning costs 1M credits to create, which at the Startup tier ($49/month for 1.25M credits) consumes most of a month's credit allocation for a single voice. Teams needing multiple high-fidelity branded voices face meaningful upfront costs.
Sonic Turbo, the model that achieves the 40ms TTFA figure used in most comparisons, is only available through Cartesia's direct API. Third-party platforms that resell or proxy Cartesia serve Sonic 2.0 or Sonic 3 at higher latency.
On-device deployment via the Edge library, while available, is in early development as of 2025 with Apple M-series as the primary supported target. Windows, Android, and embedded device targets are not yet fully supported.
Cartesia's voice library at 450+ voices is smaller than ElevenLabs' 5,000+. Teams looking for a large catalog of diverse preset voices without doing custom cloning may find the selection more limited.