Sesame CSM
Last reviewed
May 16, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 3,811 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 3,811 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sesame CSM (Conversational Speech Model) is an open weights speech generation model developed by Sesame AI, the San Francisco startup founded by former Oculus co-founder Brendan Iribe and Ankit Kumar. The 1 billion parameter checkpoint, called CSM-1B, was released on March 13, 2025 under the Apache License 2.0 after the company's hosted voice companions, Maya and Miles, went viral in late February 2025 for sounding closer to a human conversational partner than any prior commercial text-to-speech system. CSM uses a two-transformer architecture built on a LLaMA backbone and a smaller audio decoder that emits Mimi codec tokens, generating speech directly from interleaved text and audio history rather than from text alone.
The model is one of the more closely watched open-source AI audio releases of 2025 because it ties three threads together at once. It is the first public artifact from a team led by the same people who built consumer virtual reality at Oculus before Facebook's 2014 acquisition. It anchors a hardware roadmap that includes a planned line of always-on Sesame smart glasses. And it competes head to head with proprietary expressive voice systems from ElevenLabs v3 and Hume Octave 2 while shipping its base weights for free.
This article covers the company background, the Maya viral moment of February 2025, the published CSM architecture, the open release on Hugging Face, the capabilities and limits of the 1B checkpoint, the Sesame glasses hardware project, the comparison with competing voice models, and the critical reception during 2025 and into 2026.
Sesame AI was founded in 2023 in San Francisco. Brendan Iribe, who co-founded Oculus VR in 2012 and ran the company as CEO until it was acquired by Facebook in 2014 for roughly $2 billion, started Sesame with Ankit Kumar, formerly the chief technology officer of the augmented reality startup Ubiquity6. The early team pulled heavily from Iribe's Oculus and Meta Reality Labs network. By the time the company announced its $250 million Series B in October 2025, the senior team also included Nate Mitchell as chief product officer, Hans Hartmann as chief operating officer, Ryan Brown as a director of engineering, and Angela Gayles as a former long term Facebook and Meta executive.
The founding thesis is that voice, not text, will be the dominant interface for general purpose AI, and that the hardware to deliver that interface should look more like a pair of glasses than a phone or a headset. The combination of voice model plus wearable explains why a software startup is also designing eyewear, and why the public research has focused so heavily on conversational realism rather than on broader speech synthesis tasks like audiobook narration.
Sequoia Capital and Spark Capital led the October 2025 Series B, with Andreessen Horowitz and Matrix Partners participating from the previous round. The round size was confirmed at $250 million. The company did not disclose its post-money valuation, though earlier coverage in April 2025 had reported that Sequoia and Spark were eyeing a deal that would place Sesame near a $1 billion valuation, an outcome that the October round appears to have surpassed.
In late February 2025, Sesame opened a public web demo at sesame.com that let visitors talk to two voice agents named Maya and Miles. Maya was given a slightly raspy, warm female voice and an informal personality. Miles was male, a little drier, and more even tempered. Neither model was branded as an assistant in the productivity sense. They were presented as voice companions, designed to hold an open ended conversation, listen, interrupt, and respond with audible breaths, hesitations, and laughter.
The demo spread rapidly on X (formerly Twitter), Reddit, and YouTube. Clips showed users testing Maya with prompts about emotional topics, philosophical questions, and intentionally awkward conversational gambits to try to break the illusion of a human speaker. Coverage from TechCrunch, Beebom, Dataconomy, and others used the phrase "uncanny valley" repeatedly, and many reviewers said they had to remind themselves that the speaker on the other end was a model. Sesame later disclosed that more than 1 million unique users tried the public demo during this period and generated over 5 million minutes of conversation, a level of organic engagement that is unusual for a research demo from a company most consumers had never heard of.
The viral moment is the immediate context for the CSM open release. Sesame had effectively shown the world a finished product weeks before publishing any research and weeks before shipping any code. The 1B base model that arrived in March was the publicly releasable foundation underneath Maya, with a much smaller parameter count than the model running on the demo site and without any of the company specific fine tuning, persona prompts, or backend orchestration.
The published architecture is documented in Sesame's research note "Crossing the uncanny valley of conversational voice," first posted on February 27, 2025. CSM is a multimodal text and speech model that operates directly on residual vector quantization (RVQ) audio tokens, not on raw waveforms or mel spectrograms. Two autoregressive transformers, both based on the LLaMA family, sit at the heart of the system.
The first transformer, called the backbone, ingests an interleaved stream of text tokens (encoded with a standard LLaMA tokenizer) and audio tokens. It produces the zeroth codebook of the Mimi audio tokenizer, which carries the bulk of the semantic information in the speech signal. The second transformer, called the audio decoder, is smaller and faster. It takes the backbone's hidden state and emits the remaining N minus 1 acoustic codebooks needed to reconstruct intelligible audio at 12.5 Hz frame rate. Mimi itself is a split RVQ codec released by Kyutai that produces one semantic codebook and several acoustic codebooks per audio frame, which makes it well suited to this split transformer design.
The split is not just an engineering optimization. By giving the backbone access to past audio and past text simultaneously, CSM can ground its next utterance in the actual acoustic style of the conversation so far, including the speaker's pitch contour, pace, and emotional state. This is the mechanism that lets Maya match a user's energy when they sound tired, sound excited when they sound excited, and pause more often when the conversation slows down. Most prior production voice systems instead generate from a fixed reference embedding and a text string, which is why their prosody often feels flat across long exchanges.
Sesame trained three model sizes for the research paper: a Tiny configuration with a 1B backbone and a 100M decoder, a Small configuration with a 3B backbone and a 250M decoder, and a Medium configuration with an 8B backbone and a 300M decoder. All three were trained on 2048 token sequences, which corresponds to roughly two minutes of audio per training example, over five epochs. The released open weights are the Tiny configuration, hence the CSM-1B name.
To make training tractable on long sequences, the team used a compute amortization scheme. The audio decoder was trained on only a randomly sampled 1/16 of audio frames per training step, while the backbone saw every frame. Sesame reported that this preserved the fidelity of the full RVQ reconstruction while substantially cutting peak memory.
The training corpus was described as approximately one million hours of publicly available audio that was transcribed, diarized, and segmented. The dataset was predominantly English, which is consistent with the model's stronger English performance and weaker results in other languages.
For evaluation, the team argued that the standard automatic speech metrics (word error rate, speaker similarity) had saturated for the latest generation of speech models and no longer separated good from great. They introduced two new objective benchmarks. The first, Homograph Disambiguation, tests whether the model pronounces words like "lead" (the metal versus the verb) correctly given context. The second, Pronunciation Consistency, checks whether the model holds a specific pronunciation of a name or unusual word stable across a multi turn dialogue. They also ran subjective Comparative Mean Opinion Score (CMOS) studies with 80 raters on the Expresso dataset, comparing CSM both in isolation and with realistic conversational context.
CSM-1B was published on Hugging Face under sesame/csm-1b on March 13, 2025, with the model card and a reference repository at github.com/SesameAILabs/csm. The license is Apache 2.0, which permits commercial use, modification, and redistribution with attribution. The repository ships inference code, a watermarking module, and example notebooks; it does not include training code or the larger Small and Medium checkpoints from the research paper, both of which remain proprietary.
A few weeks after release, Hugging Face shipped native support for CSM in the Transformers library starting in version 4.52.1 (May 20, 2025), exposing a CsmForConditionalGeneration class and a matching AutoProcessor that converts text plus optional audio context into Mimi tokens and back. The integration supports batched inference, torch.compile with full CUDA graphs for low latency, static cache for repeated short prompts, and gradient checkpointing for fine tuning. Within the first month after release, the Hugging Face page reported more than 200,000 monthly downloads, 98 community Spaces (interactive demos, including head to head arenas with other open speech models), and over 30 derivative checkpoints (fine tunes, adapters, quantizations, and one merge).
The model card is explicit about what CSM-1B is not. It is a base speech generation model, not a finished voice product. It cannot generate text; developers are expected to pair it with a separate language model for any system that needs to plan replies or answer questions. It does not ship with named voices. The default sample script feeds the model a speaker ID such as <sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>Hello from Sesame. and the resulting voice is essentially random because the base model never met that speaker during training. Coherent voice identity requires audio context: a short snippet of the desired speaker, included as part of the prompt, biases generation toward that timbre and accent.
The model card also imposes an honor system on use. The license itself is permissive, but the README asks developers and operators not to use CSM for impersonation without consent, for fraud, or for misinformation. Sesame ships an audio watermarker with the repository, but critics have pointed out that the watermark is not cryptographically enforced and that nothing in the open weights themselves prevents removal of the watermarking step. A TechCrunch piece at release time noted bluntly that the model "has no real safeguards to speak of" and relies on goodwill from the people running it.
For non-English use the model card is more cautious. The training data is overwhelmingly English. The model can produce speech in other languages because there is some contamination in the corpus, but the README does not recommend it, and fine tunes for other languages have been a major focus of the open source derivative community during 2025 and 2026. Speechmatics, among others, has published a public guide on how to fine tune CSM-1B for new languages and voice profiles using the Transformers Trainer interface.
The following table summarizes the documented capabilities and limits of the released CSM-1B weights, based on the official model card and the research note. Capability claims for the larger Sesame internal models that power Maya and Miles are not included in this table because their weights are not public.
| Capability | Status in CSM-1B | Notes |
|---|---|---|
| English speech generation from text | Supported | Primary trained capability, used by every reference notebook |
| Multi speaker synthesis via speaker ID tags | Supported | Format is <sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>text or <sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>text; identity is random without audio context |
| Audio prompted voice consistency | Supported | A short reference clip biases timbre, accent, and pace |
| Conversational context grounding | Supported | Past Segment objects (speaker, transcript, audio) are concatenated into the prompt |
| Disfluencies, breaths, hesitations | Supported | Emerge naturally from training data, not from explicit tags |
| Long form audio (over two minutes per call) | Limited | Training sequences capped at roughly two minutes; longer clips require stitching |
| Languages other than English | Limited | README explicitly does not recommend it; community fine tunes exist for some languages |
| Voice cloning from one minute of audio | Possible | Demonstrated in community projects; not an official Sesame feature |
| Text generation or reasoning | Not supported | Model produces audio tokens only, needs an external LLM for dialogue planning |
| Built in safety filtering | Not enforced | Watermarker included, but the license is permissive and there is no content classifier |
| Fine tuning | Supported | Native integration with the Transformers Trainer, gradient checkpointing supported |
| Batched inference | Supported | Hugging Face integration ships batch and static cache support |
| CPU inference | Supported but slow | Reference repository targets CUDA 12.4 or 12.6 for usable latency |
| Streaming or real time output | Partial | Possible with torch.compile and static cache; not a turnkey feature in the released code |
The practical envelope for CSM-1B is something like the following. With a modern consumer GPU (16 GB of VRAM is comfortable) a developer can generate 30 seconds of natural English speech in conversational style, with a chosen voice cloned from a one minute reference clip, in noticeably less than real time. Output is intelligible, expressive, and almost always free of obvious robotic artefacts. The model occasionally mispronounces unusual proper nouns and can stumble on long lists of numbers, but it handles common homographs, sarcasm, questions, and interruptions in a way that earlier open systems usually could not.
What CSM-1B is not good at is generating speech that sounds like a polished broadcast voice reading scripted copy. It was trained primarily on conversational data and it tends to keep that flavor even when the prompt is formal. Audiobook narration, news reading, or long form podcast monologues with a single fixed voice are still better suited to systems like ElevenLabs v3, which were trained explicitly for that style.
From the start, Sesame has described itself as a hardware company that needed to build the speech model first. The eyewear project, often called Sesame glasses in coverage and the Sesame companion in company materials, is the product that the conversational voice work is meant to inhabit.
Details on the device remain partial, but a public picture has come together across 2025 and into 2026. Iribe and his co-founders have said the device will be lightweight, designed for all day wear, and aimed at fashion sensibilities first rather than at obvious technology aesthetics. There is no integrated display in the descriptions that have been shared publicly. Instead, the glasses are meant to carry high quality audio, microphones, and an AI companion that, in the company's words, "observes the world alongside you." The companion is expected to draw on a successor of CSM for its voice, and on a yet undisclosed language model for its reasoning.
The October 2025 funding announcement was paired with an invite only iOS app beta. That app, which beta testers signed confidentiality agreements over, is described as letting users "search, text and think" through the Sesame voice agent without requiring the glasses hardware. The pattern follows a familiar consumer hardware playbook: ship the software experience first on a phone so that the eventual hardware launch has a real user base from day one.
No retail availability date for the glasses themselves has been announced. The company has consistently said that hardware takes time and has not committed to a window. Coverage from PYMNTS and TechCrunch describes the device as a multi year project.
The table below compares CSM-1B against two of the most visible expressive voice systems released around the same period, ElevenLabs v3 and Hume Octave 2. The comparison is restricted to publicly documented attributes; pricing, model size, and weights availability are presented as of mid 2026.
| Attribute | Sesame CSM-1B | ElevenLabs v3 | Hume Octave 2 |
|---|---|---|---|
| Provider | Sesame AI | ElevenLabs | Hume AI |
| First release | March 13, 2025 (open) | 2025 (closed) | 2025 (closed) |
| Open weights | Yes, Apache 2.0 | No, hosted API only | No, hosted API only |
| Parameter count | 1 billion (Tiny configuration) | Not disclosed | Not disclosed |
| Architecture | Two LLaMA transformers, Mimi codec output | Proprietary, not disclosed | Proprietary expressive TTS, emotional control |
| Primary input | Text plus optional audio history | Text plus voice ID and style controls | Text plus emotional intent description |
| Conversational context grounding | Yes, audio plus text history | Limited, single utterance focus | Limited to recent turn cues |
| Native multilingual coverage | English only officially | 30 plus languages | Broad multilingual support |
| Voice cloning | Possible from short reference, community driven | Yes, instant and professional clones | Yes, voice design tools |
| Built in conversational agent | No, base model | No, separate Conversational AI product | Optional, paired with Hume's empathic LLM |
| Hardware target | Sesame smart glasses (planned) | Cloud API and SDKs | Cloud API and SDKs |
| Best at | Realistic two way dialogue with context | Polished long form narration and high fidelity speech | Emotionally expressive speech with explicit emotional control |
| Notable limitation | English centric, no built in voices, no content filter | Closed weights, no on device option | Closed weights, narrower public footprint |
The differences are easier to see when looking at what each system does first. ElevenLabs v3 is optimized for narrators and creators who want a controllable, stable voice that can deliver a long script with clean pronunciation in many languages. Hume Octave 2 is optimized for emotional steering, letting a developer describe how a line should sound and having the model deliver that emotion. CSM is optimized for live conversation, where the model needs to hear the previous turn and respond in a register that matches it. None of the three is strictly better than the others; they are answers to different problems.
Where CSM has been most disruptive is in the open weights category. Until its release, the strongest open speech models, including XTTS from Coqui and Bark from Suno, sounded clearly machine generated in conversational settings. CSM closed enough of that gap that hobbyists, indie game studios, and academic groups began building real products on top of the open checkpoint within weeks. The 32 derivative checkpoints visible on Hugging Face by early 2026 reflect that activity.
The initial reception of CSM and the Maya demo was overwhelmingly positive on the experiential axis and noticeably nervous on the safety axis. TechCrunch's release coverage, Beebom's first hands on, and Dataconomy's writeup all described the experience of talking to Maya as the closest a freely accessible model had come to passing a casual Turing style audio test. Beebom's reviewer wrote that the conversation "felt like talking to a real person" and noted natural pauses, audible thinking sounds, and contextually appropriate humor.
Researchers and engineers in the open source audio community were similarly positive about the open release. Posts on Hugging Face, Towards Data Science, and DigitalOcean's community blog walked through the architecture and praised the compute amortization trick that let Sesame train a long context speech model on a constrained budget. Speechmatics published a fine tuning recipe for new languages within weeks. ComfyUI added node support so visual workflow users could pipe text through CSM.
The nervous coverage focused on three points. The first was impersonation. The TechCrunch piece quoted developers who reported cloning a colleague's voice from a single short voicemail and producing convincing speech, with nothing in the model or the license preventing that workflow. The second was the lack of content filtering. CSM will generate whatever it is told to generate, and the watermarker is easy to bypass for anyone willing to retrain or to use the model offline. The third was the broader pattern: a company releasing a strong open speech model while simultaneously running a hosted product that is much more capable, in effect outsourcing the harder safety work to the open source ecosystem.
Sesame's own response has been to keep the public weights at the Tiny scale, to keep the Maya backing model proprietary, and to ship a watermarker without claiming it is a security boundary. The model card asks users not to misuse the model and does not claim to enforce that ask. Critics of this approach have argued that it offloads risk; defenders have argued that the Tiny model is no more capable than several closed offerings already on the market and that holding it back would not have meaningfully changed the threat model.
In the year following the release, CSM was widely used as a baseline in academic comparisons for conversational speech, both as an open reference point and as a target to beat. By early 2026, several derivative models trained on the CSM-1B checkpoint had appeared, including multilingual fine tunes and quantized variants targeted at edge devices. The viral Maya demo, meanwhile, continued to operate at sesame.com through 2025 and was eventually moved into the invite only iOS beta announced alongside the Series B round in October 2025.