Rime (company)
Last reviewed
Jun 4, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 2,110 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 4, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 2,110 words
Add missing citations, update stale details, or suggest a clearer explanation.
Rime (also styled Rime Labs, and reachable at rime.ai) is an American artificial intelligence company that builds text-to-speech and spoken-language models tuned specifically for business voice agents: phone-ordering systems, interactive voice response (IVR) trees, contact centers, and other high-volume calling applications. Founded in 2022 and based in San Francisco, the company is led by chief executive Lily Clifford, a former Stanford computational-linguistics PhD student, and approaches speech synthesis from a sociolinguistics angle. Rather than the polished "radio voice" common to earlier TTS systems, Rime trains on a proprietary dataset of spontaneous, full-duplex conversation recorded from everyday speakers, aiming for voices that breathe, hesitate, use filler words, laugh, and code-switch the way real people do. Its models, including Mist, Arcana, and Coda, are marketed on a combination of naturalness, low latency, deterministic pronunciation control, and the ability to run on-premises. By early 2026 the company said its technology powered on the order of 100 million phone conversations per month.
Rime was founded in 2022 in San Francisco. (Some profiles and bios date the founding to 2023, reflecting the gap between early experimentation and incorporation; the company's own materials and its seed-round announcement give 2022.) The three co-founders are:
| Co-founder | Role | Background |
|---|---|---|
| Lily Clifford | CEO | Linguistics graduate of Pitzer College; PhD student in computational linguistics at Stanford, specializing in sociophonetics (how social and demographic factors shape speech), before leaving the program to work on speech synthesis full-time |
| Brooke Larson | Co-founder | PhD linguist; previously worked on Amazon Alexa |
| Ares Geovanos | Co-founder | Stanford-trained engineer and product builder |
Clifford has described the company's origin as a reaction against TTS systems trained on audiobooks and scripted, self-conscious "studio" speech. Drawing on her sociolinguistics training, she argued that compelling, human-like synthesis required capturing unselfconscious, conversational speech instead. To build that foundation, the founders set up an in-house recording studio in San Francisco's Mid-Market neighborhood and assembled what the company describes as a very large proprietary dataset of full-duplex, speech-to-speech interactions, including interruptions, backchannel responses ("mm-hmm"), laughter, breathing, and disfluencies. That dataset, annotated in-house by multilingual PhD linguists to a claimed 98 to 100 percent transcription accuracy, became the basis for Rime's models.
On May 29, 2025, Rime announced a $5.5 million seed round led by Unusual Ventures, with participation from Founders You Should Know, Cadenza, and a group of angel investors that included Aaron King, Alex Levin, Rebecca Greene, Michael Akilian, Maran Nelson, Nick Arner, Molly Mielke, Arnaud Schenk, Coyne Lloyd, Sarah Veit Wallis, Mike Heller, Zhenya Loginov, and Monica Black. The financing was first reported by Axios. Coverage at the time put the company's total funding raised at more than $8 million across two rounds, implying an earlier pre-seed or angel financing before the 2025 seed.
| Round | Date | Amount | Lead investor |
|---|---|---|---|
| Seed | May 29, 2025 | $5.5 million | Unusual Ventures |
| Total raised (reported) | as of mid-2025 | ~$8.6 million over 2 rounds | n/a |
The company said the seed money would go toward expanding its team and continuing to build out its technology to serve a growing enterprise customer base. As of early 2025 Rime described itself as a lean team of roughly nine to ten people.
(As of this writing no Series A round has been publicly announced. Claims of a later or larger round should be treated with caution until confirmed by the company or a primary report.)
Rime ships speech-synthesis models through a developer API and through voice-AI platform partners. Its lineup has evolved quickly, with two model families, the speed-oriented Mist line and the expressive Arcana line, later joined by a flagship model called Coda. A common production pattern is to use more than one model in the same call: an expressive model for open-ended greetings and responses, and a deterministic model for reading out product names, customer IDs, and spelled email addresses.
Mist is Rime's original low-latency model, built for high-volume, business-critical speech. Mist v2, announced March 6, 2025, was positioned as one of the fastest conversational TTS models available, with model latency as low as about 70 milliseconds when deployed on-premises and time-to-first-audio in the low hundreds of milliseconds in the cloud (the company and partners have cited figures around 225 ms p50 on dedicated cloud endpoints). Its headline feature is deterministic pronunciation control: a developer can define once, through the API, how a particular word or term should sound, and that pronunciation then holds consistently across every voice, flow, and channel. Mist v2 supports English and Spanish. A later iteration, Mist v3, was framed as enterprise-scale TTS. Mist is a non-autoregressive model, which contributes to its speed and determinism.
Arcana, introduced May 9, 2025, is Rime's expressive "spoken language model." Architecturally it is a multimodal, autoregressive model that generates discrete audio tokens from text: it pairs a pre-trained large language model backbone with a high-resolution audio codec, auto-regressively decoding flattened codec representations from coarse to fine. It was trained in three stages (pre-training on a large corpus of text-audio pairs, supervised fine-tuning on Rime's proprietary conversational dataset, and speaker-specific fine-tuning for its flagship voices). Distinctive capabilities include generating an effectively unlimited number of novel voices from a short text description or a fictional name (for example, "a young woman working in tech"), inserting paralinguistic elements such as laughter and sighs via inline tokens like <laugh>, whispering, sarcasm, and seamless multilingual code-switching within a single utterance. At launch Rime cited a time-to-first-token around 200 ms and public-cloud latency around 300 ms, and pointedly declined to publish first-party comparison benchmarks, framing skepticism of such charts as part of its pitch.
Arcana v2 expanded the model to more languages and added on-prem deployment, advertising 300+ total voices with 35 flagship voices spanning English (including UK, Australian, and Southern US accents), Spanish, natively bilingual English/Spanish, French, and German. Arcana v3, announced February 4, 2026, cut on-prem model latency to roughly 120 ms time-to-first-byte (about 200 ms via the cloud API), pushed concurrency past 100 simultaneous generations per machine, added word-level timestamps, and extended language coverage to ten languages: English, Hindi, Spanish, Arabic, French, Portuguese, German, Japanese, Hebrew, and Tamil.
Coda, launched alongside the general availability of Rime's on-prem offering in late 2025, became the company's recommended flagship model for production deployments. Rime describes Coda as combining enterprise-grade speed and concurrency with highly expressive, real-human-sounding voices, built on an LLM backbone paired with a dedicated speech inference engine. It offers on the order of 180 unique human voices, supports brand voice cloning, runs in the cloud or on-prem, and emphasizes stronger multilingual coverage.
In April 2025 Rime open-sourced Rimecaster, a speaker-representation model that converts voice samples into dense vector embeddings capturing speaker-specific characteristics. Built on NVIDIA's TitaNet architecture and trained on Rime's full-duplex, multilingual conversational data, the company says it produces embeddings roughly four times denser than baseline approaches. Released under a CC-BY-4.0 license on Hugging Face and compatible with NVIDIA NeMo, it had been downloaded thousands of times shortly after release. Rime positions it as infrastructure to help the broader community build better voice-AI models rather than as a revenue product.
Rime models can be consumed through its own cloud API or deployed in a customer's virtual private cloud or fully on-premises on the customer's own GPUs. Rime On-Prem reached general availability on November 6, 2025, with the company advertising up to roughly 2x higher concurrency for Arcana on NVIDIA H100 GPUs. On-prem and VPC options, together with SOC 2 Type II and HIPAA compliance (and PCI compliance cited by partners), are aimed at regulated buyers in healthcare, financial services, and telecom that cannot send audio to a third-party cloud. Rime's models are also distributed through voice-AI and telephony platforms, including Together AI, SignalWire, Telnyx, and others, making them selectable as a TTS engine inside those stacks.
Rime's central thesis is that the bottleneck for business voice AI is data, not model size. Where many TTS systems learn from audiobooks, podcasts, or other curated read-speech, Rime trains on full-duplex recordings of real, spontaneous conversation with "everyday people," and labels that audio for sociolinguistically meaningful detail: accent and dialect, allophonic variation, prosodic stress, and both subconscious and rhetorical use of filler words, pauses, laughter, and other paralinguistics. The stated goal is speech that sounds like a real person on a phone call rather than a voice actor in a booth, which the company argues matters commercially because callers respond differently to natural-sounding agents. Rime has said it trained its models on the Weights & Biases platform. Its work sits at the intersection of generative AI, speech recognition and synthesis, conversational AI, and the broader move toward production AI agents that handle live customer interactions.
Rime's models are used heavily in restaurant phone ordering, healthcare back-office automation, telecom support, agent training, and enterprise customer support. The company has said it serves more than 20 large enterprise customers and, by its own and partners' accounts, grew from powering tens of millions of phone conversations per month in early 2025 to on the order of 100 million per month by late 2025, having roughly doubled its customer base over that period. One frequently cited deployment runs through ConverseNow, a restaurant voice-AI vendor that Rime says powers roughly 80 percent of Wingstop and Domino's phone orders in North America. Rime and its distribution partner Together AI have reported customer outcomes including a 15 percent lift in sales at a national restaurant chain, a 75 percent reduction in call abandonment at a telecom provider, and a 10 percent increase in call success rates, though these figures come from the company and should be read as vendor-reported results.
Rime competes in the fast-growing market for real-time, conversational TTS aimed at voice agents and contact centers. Its most prominent rivals are ElevenLabs, Cartesia, Play.ht (PlayAI), and Deepgram (Aura), along with the speech offerings of large cloud providers such as Google, Microsoft Azure, and Amazon (Polly) and labs like OpenAI. In third-party comparisons, Cartesia's Sonic and Deepgram's Aura are typically cited as the latency leaders, ElevenLabs as the broadest and most natural multilingual voice library, and Play.ht as among the widest in language coverage. Rime differentiates on three axes: training data drawn from real, full-duplex conversation rather than read-speech; deterministic, developer-controllable pronunciation aimed at production reliability; and flexible deployment including true on-prem for regulated, high-volume callers. Observers also note that, as a seed-stage company, Rime is less heavily capitalized than ElevenLabs and some other rivals, which it counters by focusing narrowly on the business voice-agent use case rather than the broad creator and media market.