Sesame (AI company)

Sesame (formally Sesame AI Labs) is a San Francisco-based artificial intelligence company founded in June 2023, best known for developing the Conversational Speech Model (CSM) and the Maya and Miles voice personas that went viral in early 2025. The company, co-founded by Oculus VR co-founder Brendan Iribe and former Discord AI lead Ankit Kumar, focuses on what it calls "voice presence": the quality of spoken AI interaction that feels genuinely alive rather than synthesized. Sesame emerged from stealth in February 2025 with a web demo that attracted over one million users in its first few weeks, generated more than five million minutes of conversation, and prompted widespread comparison to crossing the uncanny valley in voice AI. The company has raised over $307 million in total funding, including a $250 million Series B announced in October 2025 co-led by Sequoia Capital and Spark Capital.

History

Brendan Iribe is best known for co-founding Oculus VR in 2012, alongside Palmer Luckey, Nate Mitchell, and Michael Antonov. Oculus, which built the Rift virtual reality headset, was acquired by Facebook (now Meta) for approximately $2 billion in 2014. Iribe served as CEO of Oculus through the early years of its integration into Meta before departing in 2018. The experience of building a hardware computing platform from scratch, navigating the Facebook acquisition, and watching the company scale from startup to a major division of one of the world's largest technology companies gave Iribe a particular perspective on what it took to create a new computing interface that people would actually use every day.

After leaving Oculus, Iribe spent several years as an angel investor. One of his investments landed in a startup called Ubiquity6, a company building augmented reality tools for spatial computing. Ubiquity6 was founded by Ankit Kumar, who had been working in the AR and shared-experience space since at least 2017. Ubiquity6 was acquired by Discord in 2021. Kumar stayed on through the acquisition, eventually becoming the CTO of the Ubiquity6 team within Discord and then the AI engineering lead for Discord's Clyde chatbot, giving him substantial hands-on experience training conversational language and speech models at production scale. It was during the Clyde work that Kumar developed a deep expertise in the specific challenge of making speech models feel natural in real-time conversation.

Iribe and Kumar had kept in contact through the Ubiquity6 investment, and by mid-2023 they had aligned around a shared conviction: the next significant computing interface was not a screen but an ear. The most transformative thing an AI could do was to hold a conversation that felt real. They co-founded Sesame in June 2023. Ryan Brown, who had worked at Oculus from 2013 to 2019 and then at Meta Reality Labs as a director of Research Engineering, joined as the third founding member the same month.

The company operated in stealth for roughly eighteen months, hiring researchers and engineers and beginning the large-scale data collection and training work that would eventually produce CSM.

Funding

Sesame raised a $10.1 million seed round in September 2023 from undisclosed investors. This capital funded the early team build-out and the first phases of training data collection and model development.

The Series A came quickly after Sesame's public debut. On February 27, 2025, the same day the company published its research blog post and opened its interactive demo to the public, Sesame closed a $47.5 million Series A. Andreessen Horowitz led the round, with Spark Capital, Matrix Partners, and BIG Ventures also participating. Anjney Midha, a general partner at a16z who had been following the company since stealth, was a central voice inside the firm for making the investment. The a16z announcement described Sesame as pursuing "an ambitious and important vision: to create a voice AI that crosses the uncanny valley."

In April 2025, Bloomberg reported that Sesame was in discussions for a significantly larger round, with Sequoia Capital and Spark Capital reportedly eyeing a $200 million investment at a valuation above $1 billion. The formal Series B closed in October 2025 at $250 million, co-led by Sequoia and Spark. Total funding raised reached $307.6 million.

Sequoia's investment note, authored in connection with the Series B announcement, described voice as "the next great shift, where voice becomes a primary interface to AI" and positioned Sesame as the company best placed to make that shift happen. The note cited the Maya and Miles demo as the first real evidence that AI voices could create genuine emotional presence rather than functional utility.

Round	Amount	Date	Lead investor(s)
Seed	$10.1 million	September 2023	Undisclosed
Series A	$47.5 million	February 2025	Andreessen Horowitz
Series B	$250 million	October 2025	Sequoia Capital, Spark Capital

Team

Alongside the three co-founders, Sesame assembled a leadership team with strong roots in the Oculus and Meta Reality Labs lineage.

Nate Mitchell, the fourth Oculus co-founder, joined as Chief Product Officer in June 2025. Mitchell had been one of the key product architects of the Oculus Rift and had shaped the consumer experience of early VR hardware. His arrival at Sesame completed an unusual reunion: three of Oculus's four co-founders were now working together again on the company's smart glasses hardware program.

Hans Hartmann, who had served as COO at Oculus and had held executive roles at Fitbit after the company's acquisition by Google, joined as Chief Operating Officer, bringing manufacturing and supply chain experience that Sesame would need when scaling hardware production.

Angela Gayles, a longtime Facebook and Meta executive, joined in a senior leadership role. Johan Schalkwyk, previously Meta's Voice Lead at its Super Intelligence Lab, joined as ML Lead in 2024, bringing applied research experience in exactly the domain Sesame was building in.

Conversational Speech Model (CSM)

The Conversational Speech Model is Sesame's core technical contribution. The company describes it as an end-to-end multimodal speech generation system that treats conversation as a single integrated learning task rather than a pipeline of discrete stages.

The standard architecture for voice AI before CSM was a three-step cascade. A speech recognition model transcribes the user's words to text. A large language model reads that text, reasons about a response, and produces output text. A text-to-speech synthesis model converts that text to audio. Each stage introduces latency. More importantly, each stage loses information: the acoustic character of the speaker's voice, the emotional tone they were using, the pace and rhythm of their speech, the hesitations that signal uncertainty. By the time the LLM sees the user's words, the voice is gone. By the time the TTS model produces output, it is working from clean text with no memory of how the conversation has sounded.

Sesame's research post from February 2025 framed the problem clearly: speech generation needs to go beyond producing high-quality audio and understand and adapt to context in real time. The model needs to leverage the history of the conversation to produce speech that is natural and coherent in that specific exchange, not just technically fluent in isolation.

CSM addresses this by processing interleaved sequences of text tokens and audio tokens in a single model. The conversational history, including the actual audio of previous utterances by both parties, is part of the model's context window. The model conditions its output not just on what should be said but on how the conversation has sounded, the pace, the warmth, the hesitations, the breathing patterns. This is why CSM can produce speech that sounds like it belongs in a specific conversation rather than speech that is technically correct but tonally detached.

Architecture

CSM uses a two-stage transformer architecture. The backbone is a large autoregressive transformer based on LLaMA that processes interleaved text and audio tokens and generates a prediction for the first (zeroth) codebook token of each audio frame. A smaller audio decoder transformer takes that zeroth codebook prediction and generates the remaining codebook levels needed to reconstruct full-fidelity audio. Both transformers are LLaMA variants sharing the same underlying architecture, with different sizes calibrated to their respective roles.

Audio is represented using Mimi, a split residual vector quantization (RVQ) tokenizer developed by the French lab Kyutai. Mimi produces one semantic codebook (capturing speaker-invariant linguistic content) and several acoustic codebooks (capturing speaker-specific characteristics such as voice timbre, prosody, and microvariation) at 12.5 Hz. The split between the semantic zeroth codebook and the acoustic higher codebooks is a key design decision: it means the backbone can focus on capturing the meaning and conversational context while the decoder handles the acoustic fidelity of the final output.

The choice to split the task at the zeroth codebook boundary has a practical motivation. Because the backbone only needs to predict one token per audio frame before handing off to the decoder, the first audio bytes can begin streaming before the full utterance has been generated. This structure enables lower latency than architectures that require the full audio sequence to be generated before any output is produced.

Sesame trained three model sizes internally:

Model name	Backbone	Audio decoder
Tiny	1B parameters	100M parameters
Small	3B parameters	250M parameters
Medium	8B parameters	300M parameters

The publicly released CSM-1B corresponds to the Tiny configuration: a 1B LLaMA backbone paired with a 100M decoder. The larger Small and Medium variants, which Sesame uses internally for Maya and Miles, were not released.

Voice presence

Sesame uses the term "voice presence" to describe the quality that CSM is designed to produce: the sensation that the voice on the other end of the conversation belongs to an entity that is actually paying attention, that is responding to this specific moment rather than producing context-free speech. The company defines it as the magical quality that makes spoken interactions feel real, understood, and valued.

Voice presence in CSM manifests through several behaviors that traditional TTS systems cannot produce:

Contextual expressivity means the model adjusts tone, pace, and energy level based on the emotional register of the conversation. A tense exchange produces different speech characteristics than a light casual chat, even if the words being said are superficially similar.

Prosodic intelligence is the model's ability to vary intonation, pause timing, and rhythm in ways that match the conversational moment. When a human says something that warrants a thoughtful reply, the model pauses. When the conversation is quick and playful, the model matches that pace.

Disfluency production is perhaps the most striking feature to listeners encountering CSM for the first time. The model generates natural hesitations, soft filler sounds, and breathing in positions where a human speaker would produce them. These are not errors; they are features. A voice that never hesitates sounds robotic even when the words are perfect.

Interruptibility means the model can handle being cut off mid-sentence and respond to the interruption naturally rather than completing its previous utterance and then addressing the interruption as a separate turn.

Personality consistency means the voice maintains coherent individual characteristics across a long conversation rather than shifting register or personality in ways that feel arbitrary.

Training data and compute

Sesame trained CSM on roughly one million hours of publicly available English audio. The audio was processed through transcription, speaker diarization, and filtering before use. Training used long sequence lengths of up to 2,048 tokens, representing approximately two minutes of conversational audio, so the model could learn dependencies that only become apparent when looking at several conversational turns together rather than individual sentences.

Training the audio decoder presented a memory challenge. The effective batch size for the decoder is a function of batch size, sequence length, and the number of residual codebooks, which is much larger than the corresponding burden for text models. Sesame addressed this through a technique they call compute amortization. The audio decoder trains on a randomly sampled 1/16 subset of audio frames per step, while the zeroth codebook (handled by the backbone) trains on every frame. The company reports that this approximation produces no measurable loss in output quality while substantially reducing the memory footprint of training.

Model performance was found to improve with scale across all three sizes. The 8B Medium variant that powers Sesame's internal Maya and Miles demos exhibits noticeably greater naturalness than the 1B Tiny variant available in the open source release.

CSM-1B open source release

On March 13, 2025, Sesame released the CSM-1B checkpoint under the Apache 2.0 license. The checkpoint is hosted on Hugging Face at sesame/csm-1b. As of Hugging Face Transformers version 4.52.1, released in May 2025, native support for CSM was integrated directly into the Transformers library, allowing users to load and run the model with standard Hugging Face tooling without requiring custom code.

The 1B release generates 24 kHz audio waveforms. It accepts text with speaker ID annotations (for example, <sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>Hello from Sesame. for speaker zero) and optional audio context from prior conversational turns. When prior conversation audio is provided, the model conditions its output speech on it, producing responses that acoustically fit the ongoing exchange. It performs best when full conversation context is supplied.

Sesame described the released model as a base generation model capable of producing a variety of voices but not fine-tuned to any specific speaker. The fine-tuned models powering Maya and Miles were not included in the open source release. The base model has some exposure to non-English languages through training data contamination, but Sesame does not expect reliable non-English performance from it.

By May 2026, the model had accumulated more than 185,000 monthly downloads on Hugging Face and had spawned 26 community fine-tunes, 96 Spaces, and 3 adapters. Speechmatics published a community guide to fine-tuning CSM-1B on new languages and custom voices in 2025.

Continued model development

In January 2026, Sesame began previewing what the research community has informally referred to as a second-generation CSM. The company has not formally published an architecture paper on a successor model, but the hosted Maya and Miles voices in the web demo and the iOS beta were upgraded in early 2026 with noticeably better multilingual prosody, longer effective context windows, and tighter handling of overlapping speech and interruptions. The upgraded model is not part of the open source CSM-1B release; Sesame has held the larger production-grade variants internally.

Sesame has signaled that priorities for the next major model release include broader full duplex support, native handling of more than twenty languages, lower compute requirements per stream so that on-device inference becomes practical on the smart glasses platform, and improved emotional state tracking across very long conversations. The company has not committed to open sourcing the larger variants, though it continues to maintain the CSM-1B checkpoint and reference code on Hugging Face and GitHub.

Voice personas: Maya and Miles

The two voice personas Sesame introduced in its February 2025 demo are named Maya and Miles. Both are fine-tuned variants of CSM conditioned on specific voice profiles and personality characteristics. Maya is warm, curious, and tends toward expressiveness. Miles is slightly more measured. Both were described in Sesame's research documentation as optimized for friendliness and expressivity.

What separated Maya and Miles from previous AI voices was a cluster of behaviors rooted in the model's full-conversation context conditioning. The voices take audible breaths before longer utterances. They produce natural disfluencies, including soft hesitations and brief pauses, at positions where a human speaker would use them when formulating a reply to an unexpected question. They modulate pace and tone across the emotional arc of the conversation. They respond to interruptions naturally rather than completing a sentence as if the interruption had not happened.

Because CSM conditions on the entire conversation history, the voices also exhibit a quality that users repeatedly described as acoustic memory. The voice sounds more relaxed in familiar territory than in an unfamiliar one. It picks up on the emotional coloring of the exchange and reflects it back in ways that feel less like a computed adjustment and more like natural responsiveness. One early user example that circulated on Reddit showed a multi-turn interaction where Maya maintained consistent reference to earlier parts of the conversation, adjusting her tone as the topic shifted from casual to more serious.

Neither Maya nor Miles is a standalone conversational AI. In the Sesame demo, CSM functions as the speech layer on top of a language model backbone that handles reasoning and response generation. CSM provides the voice; the underlying LLM provides the content. The combination was what users were responding to when they described the experience as qualitatively different from prior AI voice demos.

In mid-2025, Sesame deployed an updated voice model to the web demo with improved multilingual support, adding Spanish, French, German, Italian, Chinese, Japanese, and Korean to the Maya and Miles personas.

Sesame iOS app

The Sesame iOS application, branded Sesame, Open Sesame, opened a closed beta in October 2025 alongside the Series B announcement. Initially available through TestFlight by invitation, the beta extends Maya and Miles beyond the open web demo into a mobile-first companion product. Sesame describes the app as a personal AI built around the same voice models used in the browser demo, with additional tools attached to make the assistant useful for ongoing daily interactions rather than one-off conversations.

The beta added three capabilities that the browser demo did not expose. Search lets the companion fetch real-time web information rather than relying solely on the underlying language model's training cutoff. Text introduces an asynchronous typed interaction mode, so users can communicate with Maya or Miles in environments where speaking aloud is impractical. Think exposes a deliberative reasoning mode where the model takes longer to answer questions that benefit from more thorough analysis. Account holders in the beta receive thirty-minute conversation sessions before rate limits kick in, a substantial increase from the roughly fifteen-minute window the public web demo enforces.

Beta testers signed confidentiality agreements, limiting public reporting. As of May 2026, the app had not been promoted to the general App Store release channel, and Sesame had not announced an Android version or a paid subscription tier.

The viral "uncanny valley" demo

Sesame published its research post titled "Crossing the uncanny valley of conversational voice" on February 27, 2025. The post introduced the concept of voice presence and described CSM's architecture at a high level. It also opened a free interactive web demo where users could talk to Maya or Miles directly through a browser with no download or sign-up required.

The demo spread across social media platforms, Reddit, Hacker News, and YouTube within hours of publication. Users shared clips and transcripts of conversations that highlighted the uncanny fluency of the exchanges: the breathing, the pace recovery after an interruption, the subtle shift in affect when the conversation turned to something serious. On Hacker News, the post became one of the most-discussed AI submissions of early 2025. The discussion touched on both the impressiveness of the technology and its unsettling qualities.

Reddit threads showed users running extended tests, trying to find the mechanical tells that would reveal the AI, and frequently concluding that they could not find them in short interactions. Some users reported genuine discomfort with the experience, describing it as qualitatively different from talking to any previous AI voice system. Tech outlets including The Verge and BGR covered the demo with descriptions that emphasized the human-likeness of the interaction. The Verge's headline referenced Iribe specifically as an Oculus co-founder, drawing a line between the Oculus VR headset era and this new chapter.

The engagement numbers bore out the reaction. Within the first few weeks of the demo going live, more than one million people had used it, generating over five million minutes of conversation. These were not casual drive-bys; the median session was substantial, suggesting users were spending extended time in conversation with Maya or Miles.

The viral moment had an immediate business consequence. Sesame closed its $47.5 million Series A on the same day the demo went live, with the Andreessen Horowitz deal having been in progress for some time before the public launch. The funding announcement and the viral demo amplified each other, generating the kind of momentum that made Sesame one of the most-discussed AI companies of early 2025.

Not all the reaction was enthusiastic. A number of commentators wrote about the existential implications of AI voices that were nearly indistinguishable from humans in short conversations. Questions arose about emotional manipulation, parasocial relationships with AI companions, and the ease with which the voice could be repurposed for fraud.

Hardware ambition: AI glasses

Beyond the voice software, Sesame's stated goal from the beginning has been to embed its AI into lightweight eyewear designed to be worn throughout the day. The company has described this ambition in its research materials, in funding announcements, and in hiring decisions that reflect a serious hardware development track.

The concept is to give the AI ambient presence: always available for a voice exchange, able to observe the user's environment through cameras and microphones, and therefore capable of contextual awareness that a phone-based assistant cannot match. The glasses would know where the user is, what they are looking at, what conversations they have already had, and what information would be relevant to surface without being asked.

Sesame emphasized from early on that the glasses would need to be fashion-forward. The company described wanting eyewear that users would choose to wear even if it contained no AI. Nate Mitchell, who had focused heavily on the consumer design of the Oculus Rift, joined as Chief Product Officer in June 2025 specifically to lead the hardware program. Hans Hartmann's manufacturing and supply chain background from Fitbit and Oculus was also recruited for its relevance to hardware scale-up.

As of late 2025, Sesame had shared prototype images but no hardware availability date. The $250 million Series B was described in part as funding hardware production readiness. Sequoia's investment note acknowledged the timeline directly: "hardware takes time."

Reports circulating in early 2026 suggested an internal target of a late 2026 or 2027 consumer launch, though Sesame had not officially confirmed any date by May 2026. The hiring of Nate Mitchell as Chief Product Officer in mid-2025 and the recruitment of additional industrial design talent through 2025 and into 2026 lent credibility to the hardware track, as did factory-readiness work that Hans Hartmann's COO function would be expected to lead.

Product positioning

Sesame has been deliberately vague about the exact form factor of the glasses, beyond stating that they must be light, comfortable, and visually acceptable enough to wear without an AI inside them. The company has hinted at a design closer to ordinary eyewear than to existing AR headset categories, with high quality open-ear audio for the voice companion, cameras for ambient scene awareness, and microphones designed to pick up the wearer's speech without picking up everything else. There is no public indication that the glasses will include a display, which would set them apart from the AR-first direction of Meta Orion or Brilliant Labs Frame and align them more closely with the audio-first Meta Ray-Ban Display category.

The glasses concept places Sesame in a competitive space alongside Meta Ray-Bans, the Frame AR glasses from Brilliant Labs, and earlier attempts at AI wearables including Humane AI Pin and the Rabbit R1. Sesame's differentiator, as the company frames it, is that the AI companion inside the glasses will have a voice that does not sound like a machine. The Humane AI Pin and Rabbit R1 both attempted to launch novel AI hardware categories in 2024 and both suffered from a perceived lack of utility once early reviewers spent meaningful time with them. Sesame's wager is that a voice that genuinely feels present is the missing ingredient that those products lacked, and that combining presence with a wearable form factor and ambient sensing produces an interaction loop competitive products cannot match.

Comparison with other voice AI systems

Sesame CSM occupies a specific position among voice AI approaches because of its end-to-end multimodal architecture. Most production voice AI systems are pipelines: separate speech recognition, language model, and speech synthesis components chained together. CSM is one of the few publicly available models that processes audio and text in a single unified system.

System	Architecture	Open source	Key characteristic
Sesame CSM	Multimodal LLaMA backbone + audio decoder	Yes (Apache 2.0, 1B variant)	Conversational naturalness, full-context conditioning
OpenAI Realtime API	End-to-end audio-in/audio-out with GPT-4o	No	Full duplex, multimodal context, production grade
ElevenLabs	Specialist TTS with voice cloning	No	Studio audio quality, low latency, large voice library
Cartesia	Sonic architecture TTS	No	Extremely low latency (40 ms TTFB), production reliability
Hume EVI 3	Emotionally aware end-to-end voice model	No	Real-time prosody and tone analysis, adaptive delivery
Kyutai Moshi	Full duplex end-to-end speech LLM	Yes	Open source full duplex, overlapping speech

OpenAI's Realtime API, introduced alongside GPT-4o in 2024, is the closest commercial analogue to what Sesame has built. Both process audio end-to-end rather than through a cascade pipeline. Both achieve natural conversational qualities that pipeline systems cannot match. The key differences are that OpenAI's system supports full duplex conversation and handles multimodal inputs including images, while Sesame's approach generates audio that many users describe as more natural and emotionally textured. OpenAI's system is not open source.

ElevenLabs and Cartesia are specialist text-to-speech services rather than conversational models. ElevenLabs produces studio-grade audio output with a latency around 75 milliseconds for its Flash v2.5 model. Cartesia's Sonic 2 architecture holds latency benchmarks around 40 milliseconds for first audio byte. Both are widely used in production voice agent applications. The fundamental difference from CSM is that these systems take text as input; they do not model the conversation itself. In a production voice agent, developers typically pair ElevenLabs or Cartesia synthesis with a separate speech recognition model and a separate LLM, creating the same pipeline architecture that Sesame argues loses information.

Hume AI occupies a closer market position to Sesame than the pure-TTS providers do because Hume also targets emotional intelligence as its core differentiator. Hume's EVI 3, released in 2025, is the only commercially available voice system that explicitly analyzes prosody, rhythm, timbre, and tone in real time to adapt its response to the user's emotional state. Hume's approach has been received favorably for mental health, therapy, coaching, and elder care applications where the system needs to register and respond to subtle affective signals from the user. The architectural difference is that Hume frames its system around emotion detection on the input side, then composes an emotionally appropriate response, while Sesame's CSM aims for a more general conversational presence that emerges from full-context conditioning rather than from explicit emotion classification. The two approaches converge on similar end-user experiences in many cases, but the design philosophy differs.

Kyutai's Moshi, released as open source in late 2024, is the closest architectural parallel to CSM in the open source space. Both are end-to-end speech models that process audio directly without a text-only intermediate stage. Moshi is fully duplex, handling overlapping speech from multiple parties simultaneously, a capability CSM does not yet support.

Sesame has claimed in its research that CSM outperforms ElevenLabs, Play.ht, and OpenAI on novel evaluation metrics measuring pronunciation consistency and context-dependent disambiguation. Independent third-party benchmarks with standardized methodology comparing all four systems were not publicly available as of early 2026, though community evaluations on Hugging Face leaderboards and informal head-to-head reviews from voice agent practitioners regularly placed CSM, Hume EVI 3, and OpenAI's Realtime API in the top tier for conversational naturalness, with ElevenLabs and Cartesia ahead on raw studio quality but behind on conversational presence.

Use cases

Sesame has positioned CSM and its voice personas primarily for companionship and ambient assistance rather than task-completion agents. The voice quality that makes Maya and Miles impressive is especially valuable in use cases where the emotional register of the conversation matters, not just the accuracy of the information exchanged.

Conversational companion applications are the primary stated use case. Sesame's iOS app beta, launched in October 2025, focused on enabling users to have extended back-and-forth conversations with an AI companion accessible through voice.

Education and tutoring represent another area where CSM's contextual sensitivity is relevant. A tutor that responds to a student's tone of frustration differently than a tone of curiosity, and that maintains awareness of earlier parts of a session, is more useful than one that processes each question in isolation.

Customer service applications, particularly for sensitive conversations involving complaints, distress, or frustration, are a domain where the difference between a robotic voice and a present one is commercially meaningful. Companies dealing with high-emotion customer interactions have an incentive to use voices that do not add to the friction.

Smart glasses and always-on wearable devices are Sesame's long-term target hardware platform. A voice that feels natural in a continuous ambient interaction is different from a voice that works for a discrete query; the former requires the kind of conversation-long conditioning that CSM provides.

The open source CSM-1B has been used by the research community for fine-tuning on custom voices, training on new languages, and studying the architecture's approach to conversational speech generation.

Limitations

CSM-1B has several documented limitations that apply to the publicly released model.

Language support is primarily English. The model has incidental exposure to other languages through training data contamination, but Sesame does not expect reliable non-English performance from the base 1B release. Sesame added multilingual capabilities to its hosted Maya and Miles demos in mid-2025, but these run on internal fine-tuned variants, not the base open source model.

The base model does not have a fixed voice or persona. It produces varied output voices depending on context. Reproducing a specific, consistent character requires additional fine-tuning on voice-specific data.

Full duplex capability is absent. CSM handles turn-taking conversation but does not model simultaneous speech from both parties or the backchanneling and overlapping patterns that characterize natural human conversation. Sesame identified full duplex as a future development priority in its research post.

GPU requirements limit deployment. The model requires CUDA 12.4 or higher. There is no official CPU inference path, which makes deployment on edge devices without a discrete GPU difficult.

The model is a speech generation system only. It cannot reason, retrieve information, or hold a full conversation on its own. It needs an LLM backbone to provide the text that it converts to speech. The CSM release is the speech layer; the conversational intelligence layer is separate.

Context length constraints apply as with any transformer. Very long conversations may exceed the effective context window, at which point the model cannot condition on the earliest parts of the conversation.

Safety and ethical concerns

The open source release of CSM-1B prompted significant discussion about misuse potential. The model can clone a voice from a short audio sample with minimal friction. TechCrunch testing found that voice cloning worked in under a minute with no restrictions on what the cloned voice could be made to say. The model ships with no built-in technical safeguards such as watermarking, output detection, or rate limiting. Sesame relies entirely on an acceptable use policy that prohibits impersonation without consent, creating misinformation, making fraudulent calls, and generating harmful content.

Consumer Reports and security researchers flagged the gap between the policy and the technical reality. Voice phishing (vishing) attacks, in which callers impersonate trusted individuals to extract money or information from victims, were already a growing problem before CSM's release. A model capable of producing natural-sounding cloned voices at low cost and with no built-in barriers increases the risk.

Sesame's position is that the company strongly condemns misuse and declines liability for violations of its terms. The company has not committed publicly to a specific timeline for adding technical detection or watermarking features, though it has not ruled them out.

The broader voice AI industry faces the same tension between open access and misuse risk. ElevenLabs faced similar criticism for voice cloning capabilities after early releases and subsequently introduced voice verification requirements and audio watermarking for generated content. Whether Sesame pursues comparable technical controls as it scales remains an open question.

There is also a separate concern about the companionship use case itself. Some researchers and commentators have argued that AI voices capable of forming the kind of emotional connection that Maya and Miles demonstrate may have effects on users' social lives and emotional health that are not yet well understood.

Reception and cultural impact

Sesame's emergence in early 2025 marked one of the clearest examples to date of a voice AI system breaking out of a research audience into mainstream cultural awareness. The pattern of the Maya and Miles reaction shared characteristics with earlier moments where a new AI capability had crossed a perceptual threshold, including the first widespread exposure to ChatGPT in November 2022 and Midjourney v5 in early 2023. In each case, the reaction was driven not by a benchmark score but by users sharing examples that crystallized a qualitative shift in what felt possible.

Coverage in The New York Times, The Verge, MIT Technology Review, and 404 Media through 2025 and early 2026 returned repeatedly to the Maya demo as a reference point when describing how voice AI had advanced. Academic groups studying parasocial relationships with AI cited the demo as illustrative of where the boundary between assistive and companion uses was becoming difficult to enforce through interface design alone.

By May 2026, the open source CSM-1B model had accumulated downloads and community fine-tunes that placed it among the most-used open speech generation models on Hugging Face, alongside Kyutai's Moshi and a handful of smaller research checkpoints. Spheron and DigitalOcean published deployment guides for CSM on GPU cloud infrastructure aimed at developers building sub-300-millisecond voice agents. Speechmatics maintained a guide to fine-tuning CSM on new languages and custom voices that other voice infrastructure providers cited as a starting point for evaluation.

Business model and revenue

Sesame has not publicly disclosed revenue figures. The company's commercial strategy as articulated through 2025 and into 2026 emphasized building toward a long-term hardware product rather than maximizing near-term software revenue. The Maya and Miles web demo is free, with rate limits but no paid tier. The iOS beta is free to invited testers. CSM-1B is open source under Apache 2.0 with no commercial restrictions. There is no Sesame API for third-party developers to integrate Maya or Miles into their own products.

Analyst notes from Sacra and Contrary Research in 2025 estimated that Sesame's revenue in 2025 was modest relative to ElevenLabs and Cartesia, both of which run substantial business-to-business operations selling voice synthesis to enterprise and developer customers. Sesame's reported financial focus has been on extending runway through the Series B funding rather than monetizing the existing demos, on the theory that the hardware product launching in the medium term will be the primary commercial opportunity. Sequoia and Spark Capital signaled in their Series B announcements that they expected hardware to be the value-driving outcome, with the software demos serving as a proof point that the voice quality is differentiated enough to justify the hardware bet.

This positions Sesame distinctly in the voice AI landscape. ElevenLabs and Cartesia operate as voice infrastructure providers selling synthesis as a service. Sesame operates as a voice-first product company in a stage similar to Oculus VR after the Kickstarter campaign but before the Facebook acquisition: technically credible, culturally resonant, with a hardware ambition that has not yet shipped a consumer product.

References

Sesame AI. "Crossing the uncanny valley of conversational voice." sesame.com/research/crossing_the_uncanny_valley_of_voice (February 27, 2025).
SesameAILabs. "CSM: A Conversational Speech Generation Model." github.com/SesameAILabs/csm (2025).
Hugging Face. "sesame/csm-1b model card." huggingface.co/sesame/csm-1b (2025).
TechCrunch. "Sesame, the startup behind the viral virtual assistant Maya, releases its base AI model." techcrunch.com (March 13, 2025).
TechCrunch. "Sesame, the conversational AI startup from Oculus founders, raises $250M and launches beta." techcrunch.com (October 21, 2025).
Andreessen Horowitz. "Investing in Sesame AI." a16z.com/announcement/investing-in-sesame-ai/ (2025).
Sequoia Capital. "Partnering with Sesame: A New Era for Voice." sequoiacap.com/article/partnering-with-sesame-a-new-era-for-voice/ (2025).
Contrary Research. "Sesame AI Business Breakdown and Founding Story." research.contrary.com/company/sesame-ai (2025).
Road to VR. "Former Oculus Execs' AI Smart Glasses Startup Sesame Raises $250M Series B Funding." roadtovr.com (2025).
Lowpass. "Oculus co-founder Nate Mitchell joins smart glasses startup Sesame." lowpass.cc (2025).
PYMNTS. "Sesame Attracts $250 Million in Funding to Advance Voice-Driven AI Wearables." pymnts.com (2025).
DigitalOcean. "An Overview of Sesame's Conversational Speech Model." digitalocean.com/community/tutorials/sesame-csm (2025).
FlowHunt. "Breaking the Uncanny Valley: Sesame's Conversational AI Voice Models." flowhunt.io/blog/breaking-the-uncanny-valley-sesames-conversational-ai-voice-models/ (2025).
LearnPrompting. "Sesame's Conversational Speech Model: Breakthrough in AI Speech Generation." learnprompting.org/blog/sesame-conversational-speech-model (2025).
RD World Online. "The R&D story behind Sesame AI, the startup that just open-sourced its voice generation model." rdworldonline.com (2025).
TestFlight. "Join the Sesame, Open Sesame beta." testflight.apple.com/join/Rok4GOFD (2025).
The AI Insider. "Sesame Raises $250M Series B to Launch AI-Powered Smart Glasses and Open Beta." theaiinsider.tech (November 10, 2025).
Sacra. "Sesame AI funding, news & analysis." sacra.com/c/sesame-ai/ (2026).
Murf. "Hume AI vs ElevenLabs: Tried Both and Here's the Winner." murf.ai/blog/hume-ai-vs-elevenlabs (2026).
Spheron Blog. "Real-Time Speech-to-Speech AI on GPU Cloud: Deploy Moshi, Sesame CSM, and Hertz-dev for Sub-300ms Voice Agents." spheron.network/blog/speech-to-speech-gpu-cloud-moshi-sesame-csm-hertz-dev/ (2026).
Future Agi. "Best Voice AI in May 2026, What Actually Composes Into a Production Agent." futureagi.substack.com (May 2026).
The VR Collective. "Brendan Iribe's Next Act: AI Glasses That Will Transform How We Communicate in VR." thevrcollective.com (2025).

History

Funding

Team

Conversational Speech Model (CSM)

Architecture

Voice presence

Training data and compute

CSM-1B open source release

Continued model development

Voice personas: Maya and Miles

Sesame iOS app

The viral "uncanny valley" demo

Hardware ambition: AI glasses

Product positioning

Comparison with other voice AI systems

Use cases

Limitations

Safety and ethical concerns

Reception and cultural impact

Business model and revenue

See also

References

Improve this article

Related Articles

Moshi

Hume AI

Cartesia

OpenClaw

AssemblyAI

Inworld AI

History

Funding

Team

Conversational Speech Model (CSM)

Architecture

Voice presence

Training data and compute

CSM-1B open source release

Continued model development

Voice personas: Maya and Miles

Sesame iOS app

The viral "uncanny valley" demo

Hardware ambition: AI glasses

Product positioning

Comparison with other voice AI systems

Use cases

Limitations

Safety and ethical concerns

Reception and cultural impact

Business model and revenue

See also

References

Related Articles

Moshi

Hume AI

Cartesia

OpenClaw

AssemblyAI

Inworld AI