Cartesia

Cartesia is a San Francisco-based AI company focused on real-time voice synthesis, speech recognition, and state space model (SSM) research. Founded in 2023 by researchers from Stanford University's AI Lab, the company builds text-to-speech and speech-to-text systems optimized for interactive applications where latency is the primary constraint: voice agents, customer service automation, accessibility tools, and live audio pipelines. Its flagship product, the Sonic model family, holds records for time-to-first-audio in streaming TTS and powers millions of conversations per month across enterprise deployments.

The company sits at an unusual intersection: it is simultaneously a model research lab publishing work on SSM architectures, including the Mamba line, and a commercial voice platform serving developers at API scale. Sonic 3, released in October 2025, generates speech with 90ms model latency, supports 42 languages, and adds native laughter and fine-grained emotion controls. A parallel speech-to-text product called Ink launched in June 2025, completing Cartesia's ambition to own both sides of the voice pipeline. Total funding reached $191 million by late 2025, with Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA among the investors.

Founding and history

Cartesia traces its origins to the Stanford AI Lab, where a group of PhD students and their advisors spent years developing state-space model architectures as an alternative to the transformer. The key theoretical work arrived in two waves.

Albert Gu completed his Stanford PhD in 2023 with a thesis titled "Modeling sequences with structured state spaces." His work on the S4 (Structured State Space Sequence) model and its successors established that linear recurrent architectures could match transformer performance on sequence modeling benchmarks while using fundamentally less memory and compute. In December 2023, Gu and Tri Dao (associated with the Flash Attention work) co-authored the Mamba paper, which introduced a selective state space mechanism that matched or exceeded transformer performance on language modeling benchmarks while running at significantly lower cost. Mamba demonstrated linear scaling in sequence length versus the quadratic scaling of standard attention, a difference that matters substantially for streaming audio, where a 30-second clip at common codec rates represents thousands of tokens.

Karan Goel, a Stanford PhD student working with professor Christopher Ré, had spent years alongside Gu thinking about how SSMs could move from research benchmarks into production systems. Christopher Ré is a Stanford full professor known for founding Snorkel, SambaNova, and two companies later acquired by Apple. He received a MacArthur Fellowship in 2015. Ré became part of the founding circle alongside Goel, Gu, Arjun Desai, and Brandon Yang.

Cartesia was formally incorporated in 2023. The founding thesis was direct: SSMs offered memory and latency characteristics that transformers could not easily replicate for real-time inference, and audio was the natural first application because conversational voice requires fast, continuous generation rather than batch processing. Generating the next frame of audio from an SSM requires updating a fixed-size hidden state rather than attending over the full context window, which means the time per generation step stays constant regardless of how long the conversation has been running.

Albert Gu joined Cartesia as Chief Scientist while also accepting an assistant professor position at Carnegie Mellon University. Karan Goel became CEO. Arjun Desai and Brandon Yang took engineering leadership roles. The company built its first product, Sonic, through 2023 and launched it publicly in 2024.

Funding

Cartesia raised a $27 million seed round led by Index Ventures, with participation from Lightspeed, General Catalyst, A* Capital, Factory, Conviction, SV Angel, and approximately 90 angel investors. The round was announced in 2024 alongside the first public release of the Sonic API.

In March 2025, the company announced a $64 million Series A led by Kleiner Perkins. Index Ventures, Lightspeed, A*, Factory, Greycroft, Dell Technologies Capital, and Samsung Ventures all participated. The announcement coincided with the launch of Sonic 2.0 and Sonic Turbo. Total capital raised at that point reached $91 million.

In October 2025, Cartesia disclosed a further $100 million round with Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA participating, announced simultaneously with the release of Sonic 3. NVIDIA's participation was notable given its strategic interest in inference-efficient model architectures. Total disclosed funding reached $191 million.

Round	Amount	Lead investor	Date
Seed	$27M	Index Ventures	2024
Series A	$64M	Kleiner Perkins	March 2025
Series B	$100M	Kleiner Perkins	October 2025
Total	$191M

The company's investors span both financial institutions (Index, Kleiner Perkins, Lightspeed) and strategic corporate investors (NVIDIA, Dell Technologies Capital, Samsung Ventures), reflecting the dual relevance of Cartesia's work to both the AI software market and the hardware supply chain. NVIDIA's investment is particularly relevant because SSM inference does not map as cleanly onto standard tensor core operations as transformer attention does, creating an incentive for NVIDIA to support research that optimizes SSM kernels for its GPU architectures.

The Sonic model family

Sonic (original, 2024)

The original Sonic model launched in 2024 as Cartesia's first commercial TTS product. The company described a model latency of 135ms, positioned as the lowest available for a model of its quality class at the time. The original Sonic supported English and a limited set of additional languages, and introduced Cartesia's streaming API design: rather than generating a complete audio file before returning it, the API begins streaming audio bytes within milliseconds of receiving input text. This design is a direct consequence of the SSM architecture, which can produce audio frames one at a time without needing to re-read the full prior context on each step.

Sonic 2.0 and Sonic Turbo (March 2025)

Sonic 2.0 launched on March 6, 2025, alongside the Series A announcement. The update represented a substantial architectural rework. Despite being roughly twice the parameter count of the original Sonic, Sonic 2.0 ran faster. Cartesia attributed this to improvements in the underlying SSM architecture that reduced per-step compute while increasing model capacity. Model latency dropped to 90ms for the full Sonic 2.0 variant.

Sonic Turbo, a smaller and faster sibling released at the same time, achieves 40ms model latency, the lowest published figure for any major TTS provider at that point. Sonic Turbo is available only through Cartesia's first-party API, not through third-party GPU clouds or serving partners.

Both models at launch supported 15 languages: English, French, German, Spanish, Portuguese, Chinese, Japanese, Hindi, Italian, Korean, Dutch, Polish, Russian, Swedish, and Turkish. Voice cloning required just three seconds of audio. The models also introduced two new API endpoints not present in the original Sonic: voice changing (applying a voice style to existing audio) and infill editing (replacing a segment of audio while preserving surrounding context).

In blind preference evaluations, Sonic 2.0 was preferred over ElevenLabs Flash V2 by 61.4% of listeners versus 38.6%, and was preferred over its nearest competitor by 1.5 times overall. Sonic 2.0 and Sonic Turbo are both priced at $46.70 per million characters.

Sonic 3 (October 2025)

Sonic 3 launched in October 2025 alongside Cartesia's $100 million funding round. The model brought several meaningful advances over Sonic 2.0.

Language support expanded from 15 to 42 languages, covering an estimated 95% of global economic activity. Languages added in Sonic 3 include Arabic, Bulgarian, Czech, Danish, Finnish, Greek, Hebrew, Hungarian, Indonesian, Romanian, Slovak, Ukrainian, Vietnamese, and others.

The most distinctive feature of Sonic 3 is built-in emotional expression. Developers can insert plain-text tags such as [laughter] or [emotion] directly into input text, and the model generates naturally occurring non-verbal vocalizations at the appropriate point in the audio. No competing streaming TTS product offered comparable laughter and emotion generation at launch. The model handles excitement, sadness, hesitation, and similar states without requiring separate API calls or post-processing.

Latency specifications for Sonic 3 are 90ms model latency and 190ms end-to-end latency as measured from API call to the first audio byte arriving at the client over a typical network. The Turbo variant of Sonic 3 maintains the 40ms model latency of Sonic Turbo. Sonic 3 also added fine-grained volume and speed modulation through API parameters.

Retell AI integrated Sonic 3 into its platform at launch as a configuration switch, giving existing Retell customers access to the new model without any API migration work.

Ink: speech-to-text

In June 2025, Cartesia released Ink, a family of streaming speech-to-text models designed to complement Sonic in full-stack voice pipelines. The debut model, Ink-Whisper, is an optimized variant of OpenAI's Whisper tuned specifically for low-latency transcription in conversational settings.

Ink-Whisper addresses the specific failure modes that matter in voice agent deployments: telephony audio artifacts, proper nouns and domain-specific terminology, background noise, disfluencies and silence, and accent variation. The model uses dynamic chunking to handle variable-length audio and interruptions without requiring fixed frame sizes.

Performance claims center on time-to-complete-transcript (TTCT), where Cartesia stated Ink-Whisper was the fastest streaming STT model tested at launch. Pricing on the Scale plan is $0.13 per hour of audio, billed per second. Integrations exist for Vapi, LiveKit, and Pipecat at launch, with Voiceflow support added later.

With Ink handling the speech-to-text step and Sonic handling text-to-speech, Cartesia's Line platform (described below) can offer a fully Cartesia-owned voice pipeline rather than routing through external ASR providers.

Line: voice agent platform

Line is Cartesia's voice agent development and deployment platform, announced in 2025. Where Sonic and Ink are low-level model APIs, Line is a higher-level platform for building, iterating, and operating voice agents without assembling the infrastructure stack from scratch.

The platform is code-first: agents are written as code using the Line SDK, which can be developed locally and deployed with a single command. From a text prompt or template, a developer can have a deployed agent running in minutes. Line handles turn detection, conversation state, LLM integration, and parallel background task coordination.

Key Line features include:

Cartesia's Ink STT and Sonic TTS are the default speech models, co-located with the agent runtime to minimize internal latency
Background agent threads that can listen to a conversation, summarize it, or write to external systems while the primary conversation continues
RAG (retrieval-augmented generation) support for giving agents access to live knowledge bases
Call recording, transcript storage, latency metrics, and audit logging for all deployed agents
CLI and GitHub integrations for local development workflows

Voximplant, a cloud communications platform, announced support for Cartesia Line agents in February 2026, enabling Line-built voice agents to operate on actual telephony infrastructure for inbound and outbound phone calls.

State-space architecture for audio

Cartesia's technical differentiation rests on applying SSMs to audio generation in place of the transformer-based architectures used by most competitors.

The core computational distinction concerns how each architecture handles prior context. A transformer with standard self-attention processes all previous tokens in parallel, with memory and compute scaling as O(n squared) in sequence length. An SSM maintains a fixed-size hidden state updated recurrently, with compute scaling as O(n). For audio specifically, this difference is significant: a 30-second clip sampled at typical codec rates represents thousands of tokens, and a multi-turn conversation may span hundreds of thousands.

Beyond raw compute efficiency, SSMs have a favorable property for streaming generation. Because they operate recurrently, the model can generate the next audio frame without re-reading the entire prior context. The computation required per output step is constant regardless of conversation length. This is what enables the sub-50ms time-to-first-audio figures that Sonic Turbo achieves. Transformer-based TTS systems must typically complete a full forward pass before generating the first token, adding latency proportional to input length.

Cartesia built what it describes as a multi-stream SSM architecture for audio: separate state representations for different data modalities, text conditioning and audio generation, connected through a conditioning mechanism. This allows the model to condition audio output on text input in real time without the two streams interfering with one another's recurrent state.

The architecture also benefits from constant memory consumption during inference. Unlike a transformer KV cache that grows with context length, an SSM's hidden state stays the same size regardless of how long the conversation has run. This matters for infrastructure cost at scale: serving many simultaneous long-running voice calls does not require allocating proportionally larger memory per call.

Cartesia's researchers acknowledge a trade-off: SSMs trained naively can struggle with certain forms of long-range recall compared to attention, because information older than the state's capacity can be lost. The Mamba selective state mechanism addressed part of this by allowing the model to learn which input information to retain versus discard, but it remains an area of ongoing research.

Latency records and benchmarks

The 40ms time-to-first-audio (TTFA) figure associated with Sonic Turbo is one of the lowest published latency specifications in commercial TTS. To place it in context:

Human conversational response latency begins to feel delayed at roughly 200 to 300ms from when someone finishes speaking. At 40ms model latency, the TTS component of a voice agent pipeline contributes minimal perceptible delay. The bottleneck shifts to other steps: speech recognition, LLM inference, and network transit.

Provider	Model	TTFA (model)	Notes
Cartesia	Sonic Turbo	40ms	First-party API only
Cartesia	Sonic 3	90ms	190ms end-to-end
ElevenLabs	Flash v2.5	75ms	Published spec
Deepgram	Aura	~250ms	End-to-end target
OpenAI	TTS-1	300ms+	Non-streaming mode

Vapi, after integrating with all major TTS providers, reported that Cartesia was the only provider achieving consistently sub-200ms end-to-end latency across all languages, which was the stated reason for making Cartesia the default provider in its voice agent platform.

Cartesia measures latency at two distinct points. Model latency is the time between receiving input text and producing the first audio byte internally. End-to-end latency includes the network round trip and is what a developer's application actually observes. The 40ms figure is model latency; end-to-end latency depends on network conditions but Cartesia cites 190ms under typical conditions for Sonic 3.

API and pricing

Cartesia's API is REST-based and supports both synchronous (single response) and streaming modes. The streaming interface uses WebSockets or server-sent events to return audio chunks as they are generated. This is how the sub-200ms end-to-end figures are achieved: the application begins playing audio before the full response has been generated.

The API accepts plain text input along with parameters for voice ID, speed, volume, emotion tags (for Sonic 3), and output audio format. Supported output formats include raw PCM, MP3, and Opus. Separate endpoints handle voice cloning (instant and professional), voice changing, and infill editing.

Pricing uses a credit system. For standard TTS, 1 credit equals 1 character of input text. Professional voice clone generation costs 1.5 credits per character. STT via Ink is billed per second of audio.

Plan	Monthly cost	Credits included	Voice cloning
Free	$0	Limited	No
Startup	$49	1.25M credits	Instant cloning
Growth	$99+	Variable	Instant + Pro cloning
Scale	$239 (annual)	$299 credit pool	Full access
Enterprise	Custom	Custom	Custom

At the Startup tier, 1.25M credits per month is sufficient to generate roughly 15 to 20 hours of speech depending on text verbosity. Training a professional voice clone consumes 1M credits as a one-time cost.

For enterprise customers, Cartesia also offers dedicated regional deployments with PCI-compliant configurations. These environments include data isolation, encryption at rest and in transit, and audit logging for compliance-sensitive workloads in healthcare, financial services, and legal applications. Regional deployments are available in North America and Europe.

Voice cloning

Cartesia offers two voice cloning tiers with meaningfully different use cases.

Instant voice cloning (IVC) requires as little as three seconds of audio. The resulting clone is available immediately through the API and captures the speaker's accent, timbre, and vocal character. IVC is available on Startup tier and above. The brief audio requirement is itself a product of the SSM architecture: the model can infer a speaker's vocal characteristics from a short sample without needing the extended fine-tuning that transformer-based cloning often requires.

Professional voice cloning (PVC) involves a fine-tuning process that takes longer to complete but produces higher fidelity. PVC is designed for brand voice applications where a company needs consistent, reliable representation of a specific speaker or a custom created voice character. Training a PVC costs 1M credits; generating speech with it costs 1.5 credits per character. PVC is available on Growth tier and above.

Both cloning methods work across all languages Sonic supports. A voice cloned from an English speaker can generate French, Japanese, or Arabic output while preserving the original speaker's vocal identity as closely as possible. This language-portable cloning is used in localization workflows where a single recorded voice needs to cover multiple markets.

Voice cloning is also accessible through Cartesia's web playground for prototyping without writing code.

Cartesia's terms of service require users to have rights to the voice being cloned and prohibit use of cloned voices for fraud, impersonation, or other deceptive purposes. The company does not publish details of technical measures used to detect misuse.

Infrastructure partnership with Together AI

Cartesia runs a significant portion of its GPU inference workload on Together AI's infrastructure rather than operating its own data centers. Together AI provides GPU clusters with NVLink intra-node connectivity, GPU-direct RDMA over InfiniBand for inter-node operations, and WekaFS storage configured for the random-read I/O profile typical of audio training workloads.

The arrangement gives Cartesia deep cluster access to run a custom inference engine optimized specifically for SSM architectures, rather than relying on generic serving stacks designed for transformer inference. Together AI's case study reports that the partnership enables Cartesia to achieve under 200ms end-to-end latency with 2x faster performance relative to other providers, at half the infrastructure cost.

The Sonic model has been served on Together AI's clusters in production since its launch, handling millions of audio minutes daily. Together AI also offers Cartesia Sonic 2.0 and Sonic 3 as hosted model endpoints for enterprise customers who prefer to route through Together AI's compliance and billing infrastructure.

Customers and integrations

Cartesia's go-to-market relies primarily on developer-led adoption through its API, playground, and documentation, with ecosystem integrations into voice agent platforms as the primary scaling mechanism.

Vapi

Vapi is a voice agent orchestration platform that handles turn detection, LLM routing, and TTS provider connections for developers building phone-based AI agents. After evaluating all major TTS providers, Vapi selected Cartesia as its default provider and embedded Cartesia in its homepage demo. Vapi cited consistent sub-200ms end-to-end latency across all supported languages as the deciding factor. Ink-Whisper was subsequently added as an STT option within Vapi as well.

Retell AI

Retell AI is a competing voice agent platform that also integrated Cartesia as a first-class option. Retell users can switch to Sonic 3 through a configuration change without any API migration. The integration includes all Sonic 3 capabilities: 42 languages, custom pronunciation dictionaries, speed and volume controls, and emotion tags.

Together AI

Together AI operates as both an infrastructure partner (described above) and a distribution channel. Enterprises that use Together AI's model serving platform can access Sonic 2.0 and Sonic 3 through Together AI's APIs and billing rather than directly through Cartesia.

Other platforms and customers

Customer or platform	Category
ServiceNow	Enterprise software
Cresta	Contact center AI
Decagon	Customer support AI
Quora	Consumer technology
Thoughtly	GTM voice agents
Yelp	Reviews and local search
DoorDash	Food delivery
LiveKit	Real-time audio/video infrastructure
Pipecat	Voice agent framework
Voiceflow	Conversational AI builder
Voximplant	Cloud communications

The company reported more than 50,000 API customers and millions of conversations per month processed across its infrastructure as of late 2025, with enterprise clients including organizations from financial services, healthcare, and technology sectors.

Comparison with ElevenLabs

ElevenLabs and Cartesia are the two companies most frequently compared in developer discussions of voice AI APIs. They have distinct positioning.

ElevenLabs was founded in 2022 and has built its reputation around voice quality and a large library of preset voices, reaching a $3 billion valuation in early 2024. Its strengths are breadth: 70+ languages, 5,000+ voices, a dubbing product, and Conversational AI endpoints. ElevenLabs' standard models are transformer-based, which enables certain forms of expressiveness but limits how low latency can go.

Cartesia's strengths concentrate in latency and the specific feature set for real-time voice agents. Its SSM architecture produces consistently lower model latency than transformer alternatives, and its streaming API design is built from the ground up for conversational use cases.

Feature	Cartesia Sonic 3	ElevenLabs
Lowest model latency	40ms (Turbo)	75ms (Flash v2.5)
End-to-end latency	~190ms	~200ms+
Language count	42	70+
Laughter / emotion tags	Yes	Limited
Preset voice library	450+	5,000+
Voice cloning (min audio)	3 seconds	~1 minute
On-device deployment	Yes (Edge library)	No
Built-in STT product	Yes (Ink)	Yes (Scribe)
Voice agent platform	Yes (Line)	Yes (Conversational AI)
Primary positioning	Real-time voice agents	Content creation, dubbing
Relative pricing	Lower	Higher

Blind preference tests run by Cartesia showed Sonic 2 preferred over ElevenLabs Flash V2 by 61.4% to 38.6% of listeners in head-to-head evaluations. ElevenLabs' higher-tier models produce more expressive voice quality in non-real-time contexts, and ElevenLabs is generally the market preference for content creation where millisecond latency does not matter. For voice agents where every 50ms of latency is perceptible to the caller, Cartesia's architecture advantages are more significant.

On-device deployment

In 2025, Cartesia open-sourced Edge, a library for running SSM models directly on device hardware without sending audio or text to cloud infrastructure. The initial target is Apple M-series chips. The Edge library is designed to run Sonic models locally in real time, using the SSM architecture's constant memory footprint to stay within the memory constraints of consumer hardware.

On-device TTS eliminates the network round trip that typically accounts for 100 to 150ms of the end-to-end latency figure in cloud deployments. For applications where the device is reliably close to the model (such as a smartphone running a local voice assistant), on-device execution can push end-to-end latency below 100ms.

On-device deployment also addresses data privacy requirements. For healthcare, legal, financial services, and enterprise security applications, keeping voice data on-device rather than transmitting it to cloud APIs removes a category of compliance exposure entirely. No audio leaves the device, and there is no dependency on network availability.

The Llamba model family, released in February 2025, extends on-device capability to language modeling. Llamba-1B, Llamba-3B, and Llamba-8B are SSM language models distilled from the Llama 3 series. The distillation approach produces models that run with SSM latency characteristics while retaining much of the knowledge from Llama's training data. At 1B to 8B parameters, these models are sized to run on consumer and mobile hardware.

Research contributions

Beyond Sonic and Ink, Cartesia has published SSM research that advances the state of the field.

The Mamba-3B-SlimPJ post demonstrated SSMs matching the best transformer architectures at the 3B parameter scale on language modeling benchmarks, an important proof point for the thesis that SSMs are not limited to specialized audio tasks.

Llamba (February 2025) showed that distilling transformer knowledge into SSM architectures works at scale. Llamba models run faster than their Llama teacher models while retaining most downstream task performance, and they are designed to be deployable on consumer hardware.

Mamba-3 was published at ICLR 2026 in collaboration with researchers at Carnegie Mellon University, Princeton University, and Together AI. The paper introduced three architectural improvements: exponential-trapezoidal discretization, complex-valued state updates, and a Multi-Input Multi-Output (MIMO) recurrence formulation. At the 1.5B parameter scale, Mamba-3 achieved 1.8 percentage points of average downstream accuracy improvement over Mamba-2 while using states half the size. The MIMO variant contributed 1.2 points of that improvement by boosting accuracy without increasing decoding latency.

Use cases

Voice agents and telephony

The primary commercial application for Cartesia's technology is automated voice agents: AI systems handling inbound and outbound phone calls for appointment scheduling, customer service triage, sales qualification, and similar tasks. Companies like Vapi, Retell, Thoughtly, and others build the orchestration layer; Cartesia provides the TTS and STT components.

Latency is the variable that determines whether these interactions feel natural or mechanical. A voice pipeline includes ASR, LLM inference, and TTS in sequence. Minimizing each step's contribution changes whether the caller perceives they are talking to a person or waiting for a system to process their input. Cartesia's position is that reducing TTS latency from 300ms to 40ms removes enough of the gap that the remaining delay from LLM inference becomes the primary perceived bottleneck, not audio generation.

Content creation and media production

Cartesia's voice cloning and AI voiceover product serve content creators who need consistent narration across large volumes of output: audiobooks, explainer videos, e-learning courses, and podcast-format content. The Sonic 3 emotion controls allow more expressive delivery than models producing flat, neutral speech, which matters when narration needs to hold a listener's attention for extended periods.

Accessibility

Screenreader and real-time document reading applications benefit from low-latency TTS. Cartesia's API is fast enough to drive applications where audio output must track real-time text generation closely, such as read-aloud features for users with visual impairments or reading disabilities. The brevity of the cloning sample requirement also makes it practical for individual users to create personalized voices from their own speech.

Localization

With 42-language support in Sonic 3 and voice cloning that preserves vocal identity across languages, Cartesia is used in localization workflows where content originally recorded in one language needs to be converted to another while retaining the speaker's voice characteristics. A corporate training video recorded by an executive in English can be localized to Spanish, French, or Japanese with the same voice.

Limitations

Several constraints are relevant for teams evaluating Cartesia.

Language support at 42 languages in Sonic 3 is broad but narrower than ElevenLabs at 70+. Lower-resource languages in Southeast Asia, sub-Saharan Africa, and the Middle East are often not covered. Teams with requirements for these languages may need to supplement Cartesia with other providers.

The 500-character input limit on Sonic Turbo constrains use cases involving continuous long-form passages at maximum speed. The standard Sonic 3 does not carry this limit but runs at 90ms rather than 40ms model latency.

Professional voice cloning costs 1M credits to create, which at the Startup tier ($49/month for 1.25M credits) consumes most of a month's credit allocation for a single voice. Teams needing multiple high-fidelity branded voices face meaningful upfront costs.

Sonic Turbo, the model that achieves the 40ms TTFA figure used in most comparisons, is only available through Cartesia's direct API. Third-party platforms that resell or proxy Cartesia serve Sonic 2.0 or Sonic 3 at higher latency.

On-device deployment via the Edge library, while available, is in early development as of 2025 with Apple M-series as the primary supported target. Windows, Android, and embedded device targets are not yet fully supported.

Cartesia's voice library at 450+ voices is smaller than ElevenLabs' 5,000+. Teams looking for a large catalog of diverse preset voices without doing custom cloning may find the selection more limited.

References

Cartesia Company page. cartesia.ai/company
Fortune, "Exclusive: Cartesia, voice AI startup, raises $64 million Series A." March 11, 2025. fortune.com/2025/03/11/exclusive-cartesia-voice-ai-startup-raises-64-million-series-a/
Cartesia blog, "Series A and the future of voice AI." March 2025. cartesia.ai/blog/series-a
Index Ventures, "Building the Next Generation of Real-Time AI Models." indexventures.com/perspectives/building-the-next-generation-of-real-time-ai-models-our-investment-in-cartesia/
Kleiner Perkins, "Cartesia: Pioneering real-time voice AI." kleinerperkins.com/perspectives/cartesia-pioneering-real-time-voice-ai/
Cartesia blog, "Announcing Sonic: a low-latency voice model for lifelike speech." cartesia.ai/blog/sonic
Cartesia docs, "Sonic 3." docs.cartesia.ai/build-with-cartesia/tts-models/latest
Cartesia customer story, "Vapi chooses Cartesia as their default provider for voice agents." cartesia.ai/customers/vapi
Cartesia customer story, "How Cartesia Powers Retell's Voice Agents at Scale." cartesia.ai/customers/retell
Cartesia blog, "Introducing Professional Voice Cloning." cartesia.ai/blog/pro-voice-cloning
Cartesia blog, "The on-device intelligence update." cartesia.ai/blog/on-device
Cartesia blog, "Llamba: scaling distilled recurrent models for efficient language processing." cartesia.ai/blog/llamba-distillation
Gu, Albert and Dao, Tri. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. December 2023.
OpenReview, "Mamba-3: Improved Sequence Modeling using State Space Principles." ICLR 2026. openreview.net/forum?id=HwCvaJOiCj
Cartesia pricing page. cartesia.ai/pricing
Cartesia, "Cartesia vs ElevenLabs." cartesia.ai/vs/cartesia-vs-elevenlabs
Together AI, "How Cartesia Runs Real-Time Voice AI on Together AI's GPU Infrastructure." together.ai/customers/cartesia
Together AI, "Together AI Chooses Cartesia as Dedicated Model Partner for Enterprise Voice AI." cartesia.ai/customers/together-ai
Dell Technologies Capital, "Inside Cartesia's Jump from Research to Voice AI Leadership." delltechnologiescapital.com/resources/cartesia-voice-ai
Cartesia blog, "State of voice AI 2024." cartesia.ai/blog/state-of-voice-ai-2024
Traded VC, "Cartesia Secures $100 Million Investment From Kleiner Perkins Index Ventures Lightspeed And NVIDIA." traded.co/vc/deal/cartesia-secures-100-million-investment-from-kleiner-perkins-index-ventures-lightspeed-and-nvidia/
Cartesia blog, "Introducing Ink: speech-to-text models for real-time conversation." cartesia.ai/blog/introducing-ink-speech-to-text
Cartesia, "Introducing Line: The Modern Voice Agent Development Platform." cartesia.ai/blog/introducing-line-for-voice-agents
GlobeNewswire, "Voximplant Brings Cartesia Line Voice Agents into Real Calls." February 12, 2026.
Fundraise Insider, "Cartesia Secures $27 Million to Revolutionize Real-Time AI with State-Space Models." fundraiseinsider.com/blog/cartesia/
Maginative, "Cartesia Raises $64M to Advance Real-Time Voice AI with Sonic 2.0." maginative.com/article/cartesia-raises-64m-to-advance-real-time-voice-ai-with-sonic-2-0/
Layercode blog, "Faster, more expressive voice AI agents with Cartesia Sonic-3." layercode.com/blog/faster-more-expressive-voice-ai-agents-with-cartesia-sonic-3
Albert Gu Google Scholar. scholar.google.com/citations?user=DVCHv1kAAAAJ
Christopher Ré, Stanford CS page. cs.stanford.edu/~chrismre/
MacArthur Foundation, "Christopher Ré." macfound.org/fellows/class-of-2015/christopher-r

Founding and history

Funding

The Sonic model family

Sonic (original, 2024)

Sonic 2.0 and Sonic Turbo (March 2025)

Sonic 3 (October 2025)

Ink: speech-to-text

Line: voice agent platform

State-space architecture for audio

Latency records and benchmarks

API and pricing

Voice cloning

Infrastructure partnership with Together AI

Customers and integrations

Vapi

Retell AI

Together AI

Other platforms and customers

Comparison with ElevenLabs

On-device deployment

Research contributions

Use cases

Voice agents and telephony

Content creation and media production

Accessibility

Localization

Limitations

See also

References

Improve this article

Related Articles

Moshi

Sesame (AI company)

Hume AI

Deepgram Nova-3

AssemblyAI

Inworld AI

Founding and history

Funding

The Sonic model family

Sonic (original, 2024)

Sonic 2.0 and Sonic Turbo (March 2025)

Sonic 3 (October 2025)

Ink: speech-to-text

Line: voice agent platform

State-space architecture for audio

Latency records and benchmarks

API and pricing

Voice cloning

Infrastructure partnership with Together AI

Customers and integrations

Vapi

Retell AI

Together AI

Other platforms and customers

Comparison with ElevenLabs

On-device deployment

Research contributions

Use cases

Voice agents and telephony

Content creation and media production

Accessibility

Localization

Limitations

See also

References

Related Articles

Moshi

Sesame (AI company)

Hume AI

Deepgram Nova-3

AssemblyAI

Inworld AI