Sesame (formally Sesame AI Labs) is a San Francisco-based artificial intelligence company founded in June 2023, best known for developing the Conversational Speech Model (CSM) and the Maya and Miles voice personas that went viral in early 2025. The company, co-founded by Oculus VR co-founder Brendan Iribe and former Discord AI lead Ankit Kumar, focuses on what it calls "voice presence": the quality of spoken AI interaction that feels genuinely alive rather than synthesized. Sesame emerged from stealth in February 2025 with a web demo that attracted over one million users in its first few weeks, generated more than five million minutes of conversation, and prompted widespread comparison to crossing the uncanny valley in voice AI. The company has raised over $307 million in total funding, including a $250 million Series B announced in October 2025 co-led by Sequoia Capital and Spark Capital.
Brendan Iribe is best known for co-founding Oculus VR in 2012, alongside Palmer Luckey, Nate Mitchell, and Michael Antonov. Oculus, which built the Rift virtual reality headset, was acquired by Facebook (now Meta) for approximately $2 billion in 2014. Iribe served as CEO of Oculus through the early years of its integration into Meta before departing in 2018. The experience of building a hardware computing platform from scratch, navigating the Facebook acquisition, and watching the company scale from startup to a major division of one of the world's largest technology companies gave Iribe a particular perspective on what it took to create a new computing interface that people would actually use every day.
After leaving Oculus, Iribe spent several years as an angel investor. One of his investments landed in a startup called Ubiquity6, a company building augmented reality tools for spatial computing. Ubiquity6 was founded by Ankit Kumar, who had been working in the AR and shared-experience space since at least 2017. Ubiquity6 was acquired by Discord in 2021. Kumar stayed on through the acquisition, eventually becoming the CTO of the Ubiquity6 team within Discord and then the AI engineering lead for Discord's Clyde chatbot, giving him substantial hands-on experience training conversational language and speech models at production scale. It was during the Clyde work that Kumar developed a deep expertise in the specific challenge of making speech models feel natural in real-time conversation.
Iribe and Kumar had kept in contact through the Ubiquity6 investment, and by mid-2023 they had aligned around a shared conviction: the next significant computing interface was not a screen but an ear. The most transformative thing an AI could do was to hold a conversation that felt real. They co-founded Sesame in June 2023. Ryan Brown, who had worked at Oculus from 2013 to 2019 and then at Meta Reality Labs as a director of Research Engineering, joined as the third founding member the same month.
The company operated in stealth for roughly eighteen months, hiring researchers and engineers and beginning the large-scale data collection and training work that would eventually produce CSM.
Sesame raised a $10.1 million seed round in September 2023 from undisclosed investors. This capital funded the early team build-out and the first phases of training data collection and model development.
The Series A came quickly after Sesame's public debut. On February 27, 2025, the same day the company published its research blog post and opened its interactive demo to the public, Sesame closed a $47.5 million Series A. Andreessen Horowitz led the round, with Spark Capital, Matrix Partners, and BIG Ventures also participating. Anjney Midha, a general partner at a16z who had been following the company since stealth, was a central voice inside the firm for making the investment. The a16z announcement described Sesame as pursuing "an ambitious and important vision: to create a voice AI that crosses the uncanny valley."
In April 2025, Bloomberg reported that Sesame was in discussions for a significantly larger round, with Sequoia Capital and Spark Capital reportedly eyeing a $200 million investment at a valuation above $1 billion. The formal Series B closed in October 2025 at $250 million, co-led by Sequoia and Spark. Total funding raised reached $307.6 million.
Sequoia's investment note, authored in connection with the Series B announcement, described voice as "the next great shift, where voice becomes a primary interface to AI" and positioned Sesame as the company best placed to make that shift happen. The note cited the Maya and Miles demo as the first real evidence that AI voices could create genuine emotional presence rather than functional utility.
| Round | Amount | Date | Lead investor(s) |
|---|---|---|---|
| Seed | $10.1 million | September 2023 | Undisclosed |
| Series A | $47.5 million | February 2025 | Andreessen Horowitz |
| Series B | $250 million | October 2025 | Sequoia Capital, Spark Capital |
Alongside the three co-founders, Sesame assembled a leadership team with strong roots in the Oculus and Meta Reality Labs lineage.
Nate Mitchell, the fourth Oculus co-founder, joined as Chief Product Officer in June 2025. Mitchell had been one of the key product architects of the Oculus Rift and had shaped the consumer experience of early VR hardware. His arrival at Sesame completed an unusual reunion: three of Oculus's four co-founders were now working together again on the company's smart glasses hardware program.
Hans Hartmann, who had served as COO at Oculus and had held executive roles at Fitbit after the company's acquisition by Google, joined as Chief Operating Officer, bringing manufacturing and supply chain experience that Sesame would need when scaling hardware production.
Angela Gayles, a longtime Facebook and Meta executive, joined in a senior leadership role. Johan Schalkwyk, previously Meta's Voice Lead at its Super Intelligence Lab, joined as ML Lead in 2024, bringing applied research experience in exactly the domain Sesame was building in.
The Conversational Speech Model is Sesame's core technical contribution. The company describes it as an end-to-end multimodal speech generation system that treats conversation as a single integrated learning task rather than a pipeline of discrete stages.
The standard architecture for voice AI before CSM was a three-step cascade. A speech recognition model transcribes the user's words to text. A large language model reads that text, reasons about a response, and produces output text. A text-to-speech synthesis model converts that text to audio. Each stage introduces latency. More importantly, each stage loses information: the acoustic character of the speaker's voice, the emotional tone they were using, the pace and rhythm of their speech, the hesitations that signal uncertainty. By the time the LLM sees the user's words, the voice is gone. By the time the TTS model produces output, it is working from clean text with no memory of how the conversation has sounded.
Sesame's research post from February 2025 framed the problem clearly: speech generation needs to go beyond producing high-quality audio and understand and adapt to context in real time. The model needs to leverage the history of the conversation to produce speech that is natural and coherent in that specific exchange, not just technically fluent in isolation.
CSM addresses this by processing interleaved sequences of text tokens and audio tokens in a single model. The conversational history, including the actual audio of previous utterances by both parties, is part of the model's context window. The model conditions its output not just on what should be said but on how the conversation has sounded, the pace, the warmth, the hesitations, the breathing patterns. This is why CSM can produce speech that sounds like it belongs in a specific conversation rather than speech that is technically correct but tonally detached.
CSM uses a two-stage transformer architecture. The backbone is a large autoregressive transformer based on LLaMA that processes interleaved text and audio tokens and generates a prediction for the first (zeroth) codebook token of each audio frame. A smaller audio decoder transformer takes that zeroth codebook prediction and generates the remaining codebook levels needed to reconstruct full-fidelity audio. Both transformers are LLaMA variants sharing the same underlying architecture, with different sizes calibrated to their respective roles.
Audio is represented using Mimi, a split residual vector quantization (RVQ) tokenizer developed by the French lab Kyutai. Mimi produces one semantic codebook (capturing speaker-invariant linguistic content) and several acoustic codebooks (capturing speaker-specific characteristics such as voice timbre, prosody, and microvariation) at 12.5 Hz. The split between the semantic zeroth codebook and the acoustic higher codebooks is a key design decision: it means the backbone can focus on capturing the meaning and conversational context while the decoder handles the acoustic fidelity of the final output.
The choice to split the task at the zeroth codebook boundary has a practical motivation. Because the backbone only needs to predict one token per audio frame before handing off to the decoder, the first audio bytes can begin streaming before the full utterance has been generated. This structure enables lower latency than architectures that require the full audio sequence to be generated before any output is produced.
Sesame trained three model sizes internally:
| Model name | Backbone | Audio decoder |
|---|---|---|
| Tiny | 1B parameters | 100M parameters |
| Small | 3B parameters | 250M parameters |
| Medium | 8B parameters | 300M parameters |
The publicly released CSM-1B corresponds to the Tiny configuration: a 1B LLaMA backbone paired with a 100M decoder. The larger Small and Medium variants, which Sesame uses internally for Maya and Miles, were not released.
Sesame uses the term "voice presence" to describe the quality that CSM is designed to produce: the sensation that the voice on the other end of the conversation belongs to an entity that is actually paying attention, that is responding to this specific moment rather than producing context-free speech. The company defines it as the magical quality that makes spoken interactions feel real, understood, and valued.
Voice presence in CSM manifests through several behaviors that traditional TTS systems cannot produce:
Contextual expressivity means the model adjusts tone, pace, and energy level based on the emotional register of the conversation. A tense exchange produces different speech characteristics than a light casual chat, even if the words being said are superficially similar.
Prosodic intelligence is the model's ability to vary intonation, pause timing, and rhythm in ways that match the conversational moment. When a human says something that warrants a thoughtful reply, the model pauses. When the conversation is quick and playful, the model matches that pace.
Disfluency production is perhaps the most striking feature to listeners encountering CSM for the first time. The model generates natural hesitations, soft filler sounds, and breathing in positions where a human speaker would produce them. These are not errors; they are features. A voice that never hesitates sounds robotic even when the words are perfect.
Interruptibility means the model can handle being cut off mid-sentence and respond to the interruption naturally rather than completing its previous utterance and then addressing the interruption as a separate turn.
Personality consistency means the voice maintains coherent individual characteristics across a long conversation rather than shifting register or personality in ways that feel arbitrary.
Sesame trained CSM on roughly one million hours of publicly available English audio. The audio was processed through transcription, speaker diarization, and filtering before use. Training used long sequence lengths of up to 2,048 tokens, representing approximately two minutes of conversational audio, so the model could learn dependencies that only become apparent when looking at several conversational turns together rather than individual sentences.
Training the audio decoder presented a memory challenge. The effective batch size for the decoder is a function of batch size, sequence length, and the number of residual codebooks, which is much larger than the corresponding burden for text models. Sesame addressed this through a technique they call compute amortization. The audio decoder trains on a randomly sampled 1/16 subset of audio frames per step, while the zeroth codebook (handled by the backbone) trains on every frame. The company reports that this approximation produces no measurable loss in output quality while substantially reducing the memory footprint of training.
Model performance was found to improve with scale across all three sizes. The 8B Medium variant that powers Sesame's internal Maya and Miles demos exhibits noticeably greater naturalness than the 1B Tiny variant available in the open source release.
On March 13, 2025, Sesame released the CSM-1B checkpoint under the Apache 2.0 license. The checkpoint is hosted on Hugging Face at sesame/csm-1b. As of Hugging Face Transformers version 4.52.1, released in May 2025, native support for CSM was integrated directly into the Transformers library, allowing users to load and run the model with standard Hugging Face tooling without requiring custom code.
The 1B release generates 24 kHz audio waveforms. It accepts text with speaker ID annotations (for example, <sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>Hello from Sesame. for speaker zero) and optional audio context from prior conversational turns. When prior conversation audio is provided, the model conditions its output speech on it, producing responses that acoustically fit the ongoing exchange. It performs best when full conversation context is supplied.
Sesame described the released model as a base generation model capable of producing a variety of voices but not fine-tuned to any specific speaker. The fine-tuned models powering Maya and Miles were not included in the open source release. The base model has some exposure to non-English languages through training data contamination, but Sesame does not expect reliable non-English performance from it.
By May 2026, the model had accumulated more than 185,000 monthly downloads on Hugging Face and had spawned 26 community fine-tunes, 96 Spaces, and 3 adapters. Speechmatics published a community guide to fine-tuning CSM-1B on new languages and custom voices in 2025.
The two voice personas Sesame introduced in its February 2025 demo are named Maya and Miles. Both are fine-tuned variants of CSM conditioned on specific voice profiles and personality characteristics. Maya is warm, curious, and tends toward expressiveness. Miles is slightly more measured. Both were described in Sesame's research documentation as optimized for friendliness and expressivity.
What separated Maya and Miles from previous AI voices was a cluster of behaviors rooted in the model's full-conversation context conditioning. The voices take audible breaths before longer utterances. They produce natural disfluencies, including soft hesitations and brief pauses, at positions where a human speaker would use them when formulating a reply to an unexpected question. They modulate pace and tone across the emotional arc of the conversation. They respond to interruptions naturally rather than completing a sentence as if the interruption had not happened.
Because CSM conditions on the entire conversation history, the voices also exhibit a quality that users repeatedly described as acoustic memory. The voice sounds more relaxed in familiar territory than in an unfamiliar one. It picks up on the emotional coloring of the exchange and reflects it back in ways that feel less like a computed adjustment and more like natural responsiveness. One early user example that circulated on Reddit showed a multi-turn interaction where Maya maintained consistent reference to earlier parts of the conversation, adjusting her tone as the topic shifted from casual to more serious.
Neither Maya nor Miles is a standalone conversational AI. In the Sesame demo, CSM functions as the speech layer on top of a language model backbone that handles reasoning and response generation. CSM provides the voice; the underlying LLM provides the content. The combination was what users were responding to when they described the experience as qualitatively different from prior AI voice demos.
In mid-2025, Sesame deployed an updated voice model to the web demo with improved multilingual support, adding Spanish, French, German, Italian, Chinese, Japanese, and Korean to the Maya and Miles personas.
Sesame published its research post titled "Crossing the uncanny valley of conversational voice" on February 27, 2025. The post introduced the concept of voice presence and described CSM's architecture at a high level. It also opened a free interactive web demo where users could talk to Maya or Miles directly through a browser with no download or sign-up required.
The demo spread across social media platforms, Reddit, Hacker News, and YouTube within hours of publication. Users shared clips and transcripts of conversations that highlighted the uncanny fluency of the exchanges: the breathing, the pace recovery after an interruption, the subtle shift in affect when the conversation turned to something serious. On Hacker News, the post became one of the most-discussed AI submissions of early 2025. The discussion touched on both the impressiveness of the technology and its unsettling qualities.
Reddit threads showed users running extended tests, trying to find the mechanical tells that would reveal the AI, and frequently concluding that they could not find them in short interactions. Some users reported genuine discomfort with the experience, describing it as qualitatively different from talking to any previous AI voice system. Tech outlets including The Verge and BGR covered the demo with descriptions that emphasized the human-likeness of the interaction. The Verge's headline referenced Iribe specifically as an Oculus co-founder, drawing a line between the Oculus VR headset era and this new chapter.
The engagement numbers bore out the reaction. Within the first few weeks of the demo going live, more than one million people had used it, generating over five million minutes of conversation. These were not casual drive-bys; the median session was substantial, suggesting users were spending extended time in conversation with Maya or Miles.
The viral moment had an immediate business consequence. Sesame closed its $47.5 million Series A on the same day the demo went live, with the Andreessen Horowitz deal having been in progress for some time before the public launch. The funding announcement and the viral demo amplified each other, generating the kind of momentum that made Sesame one of the most-discussed AI companies of early 2025.
Not all the reaction was enthusiastic. A number of commentators wrote about the existential implications of AI voices that were nearly indistinguishable from humans in short conversations. Questions arose about emotional manipulation, parasocial relationships with AI companions, and the ease with which the voice could be repurposed for fraud.
Beyond the voice software, Sesame's stated goal from the beginning has been to embed its AI into lightweight eyewear designed to be worn throughout the day. The company has described this ambition in its research materials, in funding announcements, and in hiring decisions that reflect a serious hardware development track.
The concept is to give the AI ambient presence: always available for a voice exchange, able to observe the user's environment through cameras and microphones, and therefore capable of contextual awareness that a phone-based assistant cannot match. The glasses would know where the user is, what they are looking at, what conversations they have already had, and what information would be relevant to surface without being asked.
Sesame emphasized from early on that the glasses would need to be fashion-forward. The company described wanting eyewear that users would choose to wear even if it contained no AI. Nate Mitchell, who had focused heavily on the consumer design of the Oculus Rift, joined as Chief Product Officer in June 2025 specifically to lead the hardware program. Hans Hartmann's manufacturing and supply chain background from Fitbit and Oculus was also recruited for its relevance to hardware scale-up.
As of late 2025, Sesame had shared prototype images but no hardware availability date. The $250 million Series B was described in part as funding hardware production readiness. Sequoia's investment note acknowledged the timeline directly: "hardware takes time."
The glasses concept places Sesame in a competitive space alongside Meta Ray-Bans, the Frame AR glasses from Brilliant Labs, and earlier attempts at AI wearables including Humane AI Pin and the Rabbit R1. Sesame's differentiator, as the company frames it, is that the AI companion inside the glasses will have a voice that does not sound like a machine.
Sesame CSM occupies a specific position among voice AI approaches because of its end-to-end multimodal architecture. Most production voice AI systems are pipelines: separate speech recognition, language model, and speech synthesis components chained together. CSM is one of the few publicly available models that processes audio and text in a single unified system.
| System | Architecture | Open source | Key characteristic |
|---|---|---|---|
| Sesame CSM | Multimodal LLaMA backbone + audio decoder | Yes (Apache 2.0) | Conversational naturalness, full-context conditioning |
| OpenAI Realtime API | End-to-end audio-in/audio-out with GPT-4o | No | Full duplex, multimodal context, production grade |
| ElevenLabs | Specialist TTS with voice cloning | No | Studio audio quality, low latency, large voice library |
| Cartesia | Sonic architecture TTS | No | Extremely low latency (40 ms TTFB), production reliability |
| Kyutai Moshi | Full duplex end-to-end speech LLM | Yes | Open source full duplex, overlapping speech |
OpenAI's Realtime API, introduced alongside GPT-4o in 2024, is the closest commercial analogue to what Sesame has built. Both process audio end-to-end rather than through a cascade pipeline. Both achieve natural conversational qualities that pipeline systems cannot match. The key differences are that OpenAI's system supports full duplex conversation and handles multimodal inputs including images, while Sesame's approach generates audio that many users describe as more natural and emotionally textured. OpenAI's system is not open source.
ElevenLabs and Cartesia are specialist text-to-speech services rather than conversational models. ElevenLabs produces studio-grade audio output with a latency around 75 milliseconds for its Flash v2.5 model. Cartesia's Sonic 2 architecture holds latency benchmarks around 40 milliseconds for first audio byte. Both are widely used in production voice agent applications. The fundamental difference from CSM is that these systems take text as input; they do not model the conversation itself. In a production voice agent, developers typically pair ElevenLabs or Cartesia synthesis with a separate speech recognition model and a separate LLM, creating the same pipeline architecture that Sesame argues loses information.
Kyutai's Moshi, released as open source in late 2024, is the closest architectural parallel to CSM in the open source space. Both are end-to-end speech models that process audio directly without a text-only intermediate stage. Moshi is fully duplex, handling overlapping speech from multiple parties simultaneously, a capability CSM does not yet support.
Sesame has claimed in its research that CSM outperforms ElevenLabs, Play.ht, and OpenAI on novel evaluation metrics measuring pronunciation consistency and context-dependent disambiguation. Independent third-party benchmarks with standardized methodology comparing all four systems were not publicly available as of early 2026.
Sesame has positioned CSM and its voice personas primarily for companionship and ambient assistance rather than task-completion agents. The voice quality that makes Maya and Miles impressive is especially valuable in use cases where the emotional register of the conversation matters, not just the accuracy of the information exchanged.
Conversational companion applications are the primary stated use case. Sesame's iOS app beta, launched in October 2025, focused on enabling users to have extended back-and-forth conversations with an AI companion accessible through voice.
Education and tutoring represent another area where CSM's contextual sensitivity is relevant. A tutor that responds to a student's tone of frustration differently than a tone of curiosity, and that maintains awareness of earlier parts of a session, is more useful than one that processes each question in isolation.
Customer service applications, particularly for sensitive conversations involving complaints, distress, or frustration, are a domain where the difference between a robotic voice and a present one is commercially meaningful. Companies dealing with high-emotion customer interactions have an incentive to use voices that do not add to the friction.
Smart glasses and always-on wearable devices are Sesame's long-term target hardware platform. A voice that feels natural in a continuous ambient interaction is different from a voice that works for a discrete query; the former requires the kind of conversation-long conditioning that CSM provides.
The open source CSM-1B has been used by the research community for fine-tuning on custom voices, training on new languages, and studying the architecture's approach to conversational speech generation.
CSM-1B has several documented limitations that apply to the publicly released model.
Language support is primarily English. The model has incidental exposure to other languages through training data contamination, but Sesame does not expect reliable non-English performance from the base 1B release. Sesame added multilingual capabilities to its hosted Maya and Miles demos in mid-2025, but these run on internal fine-tuned variants, not the base open source model.
The base model does not have a fixed voice or persona. It produces varied output voices depending on context. Reproducing a specific, consistent character requires additional fine-tuning on voice-specific data.
Full duplex capability is absent. CSM handles turn-taking conversation but does not model simultaneous speech from both parties or the backchanneling and overlapping patterns that characterize natural human conversation. Sesame identified full duplex as a future development priority in its research post.
GPU requirements limit deployment. The model requires CUDA 12.4 or higher. There is no official CPU inference path, which makes deployment on edge devices without a discrete GPU difficult.
The model is a speech generation system only. It cannot reason, retrieve information, or hold a full conversation on its own. It needs an LLM backbone to provide the text that it converts to speech. The CSM release is the speech layer; the conversational intelligence layer is separate.
Context length constraints apply as with any transformer. Very long conversations may exceed the effective context window, at which point the model cannot condition on the earliest parts of the conversation.
The open source release of CSM-1B prompted significant discussion about misuse potential. The model can clone a voice from a short audio sample with minimal friction. TechCrunch testing found that voice cloning worked in under a minute with no restrictions on what the cloned voice could be made to say. The model ships with no built-in technical safeguards such as watermarking, output detection, or rate limiting. Sesame relies entirely on an acceptable use policy that prohibits impersonation without consent, creating misinformation, making fraudulent calls, and generating harmful content.
Consumer Reports and security researchers flagged the gap between the policy and the technical reality. Voice phishing (vishing) attacks, in which callers impersonate trusted individuals to extract money or information from victims, were already a growing problem before CSM's release. A model capable of producing natural-sounding cloned voices at low cost and with no built-in barriers increases the risk.
Sesame's position is that the company strongly condemns misuse and declines liability for violations of its terms. The company has not committed publicly to a specific timeline for adding technical detection or watermarking features, though it has not ruled them out.
The broader voice AI industry faces the same tension between open access and misuse risk. ElevenLabs faced similar criticism for voice cloning capabilities after early releases and subsequently introduced voice verification requirements and audio watermarking for generated content. Whether Sesame pursues comparable technical controls as it scales remains an open question.
There is also a separate concern about the companionship use case itself. Some researchers and commentators have argued that AI voices capable of forming the kind of emotional connection that Maya and Miles demonstrate may have effects on users' social lives and emotional health that are not yet well understood.