Sesame CSM

AI Models Generative AI Open Source AI Speech & Audio AI

21 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v2 · 4,238 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Sesame CSM (Conversational Speech Model) is an open weights speech generation model from Sesame AI, a San Francisco startup co-founded by former Oculus chief executive Brendan Iribe. Released as a 1 billion parameter checkpoint called CSM-1B on March 13, 2025 under the Apache License 2.0, CSM generates expressive, context aware speech directly from interleaved text and audio history rather than from text alone, using a two-transformer architecture with a LLaMA backbone and a smaller audio decoder that emits Mimi codec tokens.^[1]^[2] The model is the public foundation underneath Maya and Miles, the hosted voice companions whose late February 2025 demo went viral for sounding closer to a human conversational partner than any prior commercial text-to-speech system.^[4]

Sesame frames the goal of CSM as achieving what it calls voice presence, defined as "the magical quality that makes spoken interactions feel real, understood, and valued."^[1] In its research note the company writes that it is "creating conversational partners that do not just process requests; they engage in genuine dialogue that builds confidence and trust over time."^[1] CSM-1B is the smallest of three model sizes trained for that research and the only one whose weights were released; the larger models that power the public Maya demo remain proprietary.

The model is one of the more closely watched open-source AI audio releases of 2025 because it ties three threads together at once. It is the first public artifact from a team led by the same people who built consumer virtual reality at Oculus before Facebook's 2014 acquisition. It anchors a hardware roadmap that includes a planned line of always-on Sesame smart glasses. And it competes head to head with proprietary expressive voice systems from ElevenLabs v3 and Hume Octave 2 while shipping its base weights for free.

This article covers the company background, the Maya viral moment of February 2025, the published CSM architecture, the open release on Hugging Face, the capabilities and limits of the 1B checkpoint, the Sesame glasses hardware project, the comparison with competing voice models, and the critical reception during 2025 and into 2026.

What is Sesame AI?

Sesame AI was founded in 2023 in San Francisco. Brendan Iribe, who co-founded Oculus VR in 2012 and ran the company as CEO until it was acquired by Facebook in 2014 for roughly $2 billion, started Sesame with Ankit Kumar, formerly the chief technology officer of the augmented reality startup Ubiquity6, and Ryan Brown, a former Oculus hardware architect.^[5] The early team pulled heavily from Iribe's Oculus and Meta Reality Labs network. By the time the company announced its $250 million Series B in October 2025, the senior team also included Nate Mitchell, an Oculus co-founder, who joined as chief product officer in June 2025, Hans Hartmann as chief operating officer, and Angela Gayles, a former long term Facebook and Meta executive.^[5]

The founding thesis is that voice, not text, will be the dominant interface for general purpose AI, and that the hardware to deliver that interface should look more like a pair of glasses than a phone or a headset. The combination of voice model plus wearable explains why a software startup is also designing eyewear, and why the public research has focused so heavily on conversational realism rather than on broader speech synthesis tasks like audiobook narration.

Sequoia Capital and Spark Capital led the October 2025 Series B, with Andreessen Horowitz and Matrix Partners participating from the previous round. The round size was confirmed at $250 million, bringing Sesame's total funding to roughly $307.6 million.^[5]^[6] The company did not disclose its post-money valuation, though earlier coverage in April 2025 had reported that Sequoia and Spark were eyeing a deal that would place Sesame near a $1 billion valuation, an outcome that the October round appears to have surpassed.^[12]

Sequoia, announcing its investment, described the technology as one that "doesn't just translate LLM output into audio, it generates speech directly, capturing the rhythm, emotion, and expressiveness of real dialogue," and said the firm's partners found that in testing "the voices felt alive, engaging, witty, even surprising."^[14]

Why did the Maya demo go viral?

In late February 2025, Sesame opened a public web demo at sesame.com that let visitors talk to two voice agents named Maya and Miles. Maya was given a slightly raspy, warm female voice and an informal personality. Miles was male, a little drier, and more even tempered. Neither model was branded as an assistant in the productivity sense. They were presented as voice companions, designed to hold an open ended conversation, listen, interrupt, and respond with audible breaths, hesitations, and laughter.

The demo spread rapidly on X (formerly Twitter), Reddit, and YouTube. Clips showed users testing Maya with prompts about emotional topics, philosophical questions, and intentionally awkward conversational gambits to try to break the illusion of a human speaker. Coverage from TechCrunch, Beebom, Dataconomy, and others used the phrase "uncanny valley" repeatedly, and many reviewers said they had to remind themselves that the speaker on the other end was a model. Sesame later disclosed that more than 1 million unique users tried the public demo during this period and generated over 5 million minutes of conversation, a level of organic engagement that is unusual for a research demo from a company most consumers had never heard of.^[5]

The viral moment is the immediate context for the CSM open release. Sesame had effectively shown the world a finished product weeks before publishing any research and weeks before shipping any code. The 1B base model that arrived in March was the publicly releasable foundation underneath Maya, with a much smaller parameter count than the model running on the demo site and without any of the company specific fine tuning, persona prompts, or backend orchestration.^[4]

How does the CSM architecture work?

The published architecture is documented in Sesame's research note "Crossing the uncanny valley of conversational voice," first posted on February 27, 2025.^[1] CSM is a multimodal text and speech model that operates directly on residual vector quantization (RVQ) audio tokens, not on raw waveforms or mel spectrograms. Two autoregressive transformers, both based on the LLaMA family, sit at the heart of the system.^[1]

The first transformer, called the backbone, ingests an interleaved stream of text tokens (encoded with a standard LLaMA tokenizer) and audio tokens. It produces the zeroth codebook of the Mimi audio tokenizer, which carries the bulk of the semantic information in the speech signal. The second transformer, called the audio decoder, is smaller and faster. It takes the backbone's hidden state and emits the remaining N minus 1 acoustic codebooks needed to reconstruct intelligible audio at a 12.5 Hz frame rate.^[1] Mimi itself is a split RVQ codec released by Kyutai that produces one semantic codebook and several acoustic codebooks per audio frame, which makes it well suited to this split transformer design. Sesame describes CSM as a single-stage model, in contrast to traditional two-stage pipelines that fully decouple semantic and acoustic token generation, an approach the company argues improves both efficiency and expressivity.^[1]

The split is not just an engineering optimization. By giving the backbone access to past audio and past text simultaneously, CSM can ground its next utterance in the actual acoustic style of the conversation so far, including the speaker's pitch contour, pace, and emotional state. This is the mechanism that lets Maya match a user's energy when they sound tired, sound excited when they sound excited, and pause more often when the conversation slows down. Most prior production voice systems instead generate from a fixed reference embedding and a text string, which is why their prosody often feels flat across long exchanges.

Sesame trained three model sizes for the research paper: a Tiny configuration with a 1B backbone and a 100M decoder, a Small configuration with a 3B backbone and a 250M decoder, and a Medium configuration with an 8B backbone and a 300M decoder.^[1] All three were trained on 2048 token sequences, which corresponds to roughly two minutes of audio per training example, over five epochs. The released open weights are the Tiny configuration, hence the CSM-1B name.

Configuration	Backbone	Audio decoder	Public weights
Tiny (CSM-1B)	1B	100M	Yes, Apache 2.0
Small	3B	250M	No, proprietary
Medium	8B	300M	No, proprietary

To make training tractable on long sequences, the team used a compute amortization scheme. The audio decoder was trained on only a randomly sampled 1/16 of audio frames per training step, while the backbone saw every frame. Sesame reported that this preserved the fidelity of the full RVQ reconstruction while substantially cutting peak memory.^[1]

The training corpus was described as approximately one million hours of predominantly English audio that was transcribed, diarized, and segmented.^[1] The dataset's English bias is consistent with the model's stronger English performance and weaker results in other languages.

For evaluation, the team argued that the standard automatic speech metrics (word error rate, speaker similarity) had saturated for the latest generation of speech models and no longer separated good from great. They introduced two new objective benchmarks. The first, Homograph Disambiguation, tests whether the model pronounces words like "lead" (the metal versus the verb) correctly given context. The second, Pronunciation Continuation Consistency, checks whether the model holds a specific pronunciation of a name or unusual word stable across a multi turn dialogue.^[1] They also ran subjective Comparative Mean Opinion Score (CMOS) studies with 80 raters on the Expresso dataset, comparing CSM both in isolation and with realistic conversational context.

Is Sesame CSM open source?

CSM-1B was published on Hugging Face under sesame/csm-1b on March 13, 2025, with the model card and a reference repository at github.com/SesameAILabs/csm.^[2]^[3] The license is Apache 2.0, which permits commercial use, modification, and redistribution with attribution. The repository ships inference code, a watermarking module, and example notebooks; it does not include training code or the larger Small and Medium checkpoints from the research paper, both of which remain proprietary.^[3]

A few weeks after release, Hugging Face shipped native support for CSM in the Transformers library starting in version 4.52.1 (May 20, 2025), exposing a CsmForConditionalGeneration class and a matching AutoProcessor that converts text plus optional audio context into Mimi tokens and back.^[13] The integration supports batched inference, torch.compile with full CUDA graphs for low latency, static cache for repeated short prompts, and gradient checkpointing for fine tuning. Within the first month after release, the Hugging Face page reported more than 200,000 monthly downloads, 98 community Spaces (interactive demos, including head to head arenas with other open speech models), and over 30 derivative checkpoints (fine tunes, adapters, quantizations, and one merge).^[2]

The model card is explicit about what CSM-1B is not. It is a base speech generation model, not a finished voice product. It cannot generate text; developers are expected to pair it with a separate language model for any system that needs to plan replies or answer questions. It does not ship with named voices. The default sample script feeds the model a speaker ID such as [0]Hello from Sesame. and the resulting voice is essentially random because the base model never met that speaker during training. Coherent voice identity requires audio context: a short snippet of the desired speaker, included as part of the prompt, biases generation toward that timbre and accent.^[2]

The model card also imposes an honor system on use. The license itself is permissive, but the README asks developers and operators not to use CSM for impersonation without consent, for fraud, or for misinformation. Sesame ships an audio watermarker with the repository, but critics have pointed out that the watermark is not cryptographically enforced and that nothing in the open weights themselves prevents removal of the watermarking step. A TechCrunch piece at release time noted bluntly that the model "has no real safeguards to speak of" and relies on goodwill from the people running it.^[4]

For non-English use the model card is more cautious. The training data is overwhelmingly English. The model can produce speech in other languages because there is some contamination in the corpus, but the README does not recommend it, and fine tunes for other languages have been a major focus of the open source derivative community during 2025 and 2026. Speechmatics, among others, has published a public guide on how to fine tune CSM-1B for new languages and voice profiles using the Transformers Trainer interface.^[10]

What can CSM-1B do?

The following table summarizes the documented capabilities and limits of the released CSM-1B weights, based on the official model card and the research note. Capability claims for the larger Sesame internal models that power Maya and Miles are not included in this table because their weights are not public.

Capability	Status in CSM-1B	Notes
English speech generation from text	Supported	Primary trained capability, used by every reference notebook
Multi speaker synthesis via speaker ID tags	Supported	Format is `[0]text` or `[1]text`; identity is random without audio context
Audio prompted voice consistency	Supported	A short reference clip biases timbre, accent, and pace
Conversational context grounding	Supported	Past `Segment` objects (speaker, transcript, audio) are concatenated into the prompt
Disfluencies, breaths, hesitations	Supported	Emerge naturally from training data, not from explicit tags
Long form audio (over two minutes per call)	Limited	Training sequences capped at roughly two minutes; longer clips require stitching
Languages other than English	Limited	README explicitly does not recommend it; community fine tunes exist for some languages
Voice cloning from one minute of audio	Possible	Demonstrated in community projects; not an official Sesame feature
Text generation or reasoning	Not supported	Model produces audio tokens only, needs an external LLM for dialogue planning
Built in safety filtering	Not enforced	Watermarker included, but the license is permissive and there is no content classifier
Fine tuning	Supported	Native integration with the Transformers Trainer, gradient checkpointing supported
Batched inference	Supported	Hugging Face integration ships batch and static cache support
CPU inference	Supported but slow	Reference repository targets CUDA 12.4 or 12.6 for usable latency
Streaming or real time output	Partial	Possible with `torch.compile` and static cache; not a turnkey feature in the released code

The practical envelope for CSM-1B is something like the following. With a modern consumer GPU (16 GB of VRAM is comfortable) a developer can generate 30 seconds of natural English speech in conversational style, with a chosen voice cloned from a one minute reference clip, in noticeably less than real time. Output is intelligible, expressive, and almost always free of obvious robotic artefacts. The model occasionally mispronounces unusual proper nouns and can stumble on long lists of numbers, but it handles common homographs, sarcasm, questions, and interruptions in a way that earlier open systems usually could not.

What CSM-1B is not good at is generating speech that sounds like a polished broadcast voice reading scripted copy. It was trained primarily on conversational data and it tends to keep that flavor even when the prompt is formal. Audiobook narration, news reading, or long form podcast monologues with a single fixed voice are still better suited to systems like ElevenLabs v3, which were trained explicitly for that style.

What are the Sesame glasses?

From the start, Sesame has described itself as a hardware company that needed to build the speech model first. The eyewear project, often called Sesame glasses in coverage and the Sesame companion in company materials, is the product that the conversational voice work is meant to inhabit.

Details on the device remain partial, but a public picture has come together across 2025 and into 2026. Iribe and his co-founders have said the device will be lightweight, designed for all day wear, and aimed at fashion sensibilities first rather than at obvious technology aesthetics. There is no integrated display in the descriptions that have been shared publicly. Instead, the glasses are meant to carry high quality audio, microphones, and an AI companion that, in the company's words, "observes the world alongside you." The companion is expected to draw on a successor of CSM for its voice, and on a yet undisclosed language model for its reasoning.

The October 2025 funding announcement was paired with an invite only iOS app beta. That app, which beta testers signed confidentiality agreements over, is described as letting users "search, text and think" through the Sesame voice agent without requiring the glasses hardware.^[5] The pattern follows a familiar consumer hardware playbook: ship the software experience first on a phone so that the eventual hardware launch has a real user base from day one.

No retail availability date for the glasses themselves has been announced. The company has consistently said that hardware takes time and has not committed to a window. Coverage from PYMNTS and TechCrunch describes the device as a multi year project.^[6]

How does CSM compare to ElevenLabs and Hume?

The table below compares CSM-1B against two of the most visible expressive voice systems released around the same period, ElevenLabs v3 and Hume Octave 2. The comparison is restricted to publicly documented attributes; pricing, model size, and weights availability are presented as of mid 2026.

Attribute	Sesame CSM-1B	ElevenLabs v3	Hume Octave 2
Provider	Sesame AI	ElevenLabs	Hume AI
First release	March 13, 2025 (open)	2025 (closed)	2025 (closed)
Open weights	Yes, Apache 2.0	No, hosted API only	No, hosted API only
Parameter count	1 billion (Tiny configuration)	Not disclosed	Not disclosed
Architecture	Two LLaMA transformers, Mimi codec output	Proprietary, not disclosed	Proprietary expressive TTS, emotional control
Primary input	Text plus optional audio history	Text plus voice ID and style controls	Text plus emotional intent description
Conversational context grounding	Yes, audio plus text history	Limited, single utterance focus	Limited to recent turn cues
Native multilingual coverage	English only officially	30 plus languages	Broad multilingual support
Voice cloning	Possible from short reference, community driven	Yes, instant and professional clones	Yes, voice design tools
Built in conversational agent	No, base model	No, separate Conversational AI product	Optional, paired with Hume's empathic LLM
Hardware target	Sesame smart glasses (planned)	Cloud API and SDKs	Cloud API and SDKs
Best at	Realistic two way dialogue with context	Polished long form narration and high fidelity speech	Emotionally expressive speech with explicit emotional control
Notable limitation	English centric, no built in voices, no content filter	Closed weights, no on device option	Closed weights, narrower public footprint

The differences are easier to see when looking at what each system does first. ElevenLabs v3 is optimized for narrators and creators who want a controllable, stable voice that can deliver a long script with clean pronunciation in many languages. Hume Octave 2 is optimized for emotional steering, letting a developer describe how a line should sound and having the model deliver that emotion. CSM is optimized for live conversational AI, where the model needs to hear the previous turn and respond in a register that matches it. None of the three is strictly better than the others; they are answers to different problems.

Where CSM has been most disruptive is in the open weights category. Until its release, the strongest open speech models, including XTTS from Coqui and Bark from Suno, sounded clearly machine generated in conversational settings. CSM closed enough of that gap that hobbyists, indie game studios, and academic groups began building real products on top of the open checkpoint within weeks. The 32 derivative checkpoints visible on Hugging Face by early 2026 reflect that activity.

How was Sesame CSM received?

The initial reception of CSM and the Maya demo was overwhelmingly positive on the experiential axis and noticeably nervous on the safety axis. TechCrunch's release coverage, Beebom's first hands on, and Dataconomy's writeup all described the experience of talking to Maya as the closest a freely accessible model had come to passing a casual Turing style audio test. Beebom's reviewer wrote that the conversation "felt like talking to a real person" and noted natural pauses, audible thinking sounds, and contextually appropriate humor.^[7] Dataconomy similarly headlined its coverage "You Can Now Try The AI That Made Maya Go Viral," capturing how the open release let the wider public reproduce the demo that had circulated weeks earlier.^[8]

Researchers and engineers in the open source audio community were similarly positive about the open release. Posts on Hugging Face, Towards Data Science, and DigitalOcean's community blog walked through the architecture and praised the compute amortization trick that let Sesame train a long context speech model on a constrained budget.^[9]^[11] Speechmatics published a fine tuning recipe for new languages within weeks.^[10] ComfyUI added node support so visual workflow users could pipe text through CSM.

The nervous coverage focused on three points. The first was impersonation. The TechCrunch piece quoted developers who reported cloning a colleague's voice from a single short voicemail and producing convincing speech, with nothing in the model or the license preventing that workflow. The second was the lack of content filtering. CSM will generate whatever it is told to generate, and the watermarker is easy to bypass for anyone willing to retrain or to use the model offline. The third was the broader pattern: a company releasing a strong open speech model while simultaneously running a hosted product that is much more capable, in effect outsourcing the harder safety work to the open source ecosystem.^[4]

Sesame's own response has been to keep the public weights at the Tiny scale, to keep the Maya backing model proprietary, and to ship a watermarker without claiming it is a security boundary. The model card asks users not to misuse the model and does not claim to enforce that ask. Critics of this approach have argued that it offloads risk; defenders have argued that the Tiny model is no more capable than several closed offerings already on the market and that holding it back would not have meaningfully changed the threat model.

In the year following the release, CSM was widely used as a baseline in academic comparisons for conversational speech, both as an open reference point and as a target to beat. By early 2026, several derivative models trained on the CSM-1B checkpoint had appeared, including multilingual fine tunes and quantized variants targeted at edge devices. The viral Maya demo, meanwhile, continued to operate at sesame.com through 2025 and was eventually moved into the invite only iOS beta announced alongside the Series B round in October 2025.^[5]

ELI5: what is Sesame CSM?

Imagine a computer voice that does not just read words out loud but actually listens to how you are talking and answers in a matching mood. If you sound sleepy, it slows down. If you sound excited, it perks up. It even breathes, pauses, and laughs like a person. That is Sesame CSM. The company Sesame built a small version of this voice and gave it away for free so anyone can use it in their own apps, while keeping the bigger, even more lifelike version (the one behind its famous Maya demo) for itself. Sesame eventually wants this voice to live inside a pair of smart glasses you wear all day.

References

Sesame AI. "Crossing the uncanny valley of conversational voice." Research note, February 27, 2025. https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice ↩
Sesame AI. "sesame/csm-1b." Model card on Hugging Face, March 13, 2025. https://huggingface.co/sesame/csm-1b ↩
SesameAILabs. "csm: A Conversational Speech Generation Model." GitHub repository. https://github.com/SesameAILabs/csm ↩
Wiggers, Kyle. "Sesame, the startup behind the viral virtual assistant Maya, releases its base AI model." TechCrunch, March 13, 2025. https://techcrunch.com/2025/03/13/sesame-the-startup-behind-the-viral-virtual-assistant-maya-releases-its-base-ai-model/ ↩
Wiggers, Kyle. "Sesame, the conversational AI startup from Oculus founders, raises $250M and launches beta." TechCrunch, October 21, 2025. https://techcrunch.com/2025/10/21/sesame-the-conversational-ai-startup-from-oculus-founders-raises-250m-and-launches-beta/ ↩
PYMNTS. "Sesame Attracts $250 Million in Funding to Advance Voice-Driven AI Wearables." October 21, 2025. https://www.pymnts.com/artificial-intelligence-2/2025/sesame-attracts-250-million-in-funding-to-advance-voice-driven-ai-wearables/ ↩
Beebom. "I Tried Sesame AI's Voice Companion, and It Was Like Talking to a Real Person." March 2025. https://beebom.com/sesame-ai-voice-companion-maya-experience-like-talking-to-real-person/ ↩
Dataconomy. "You Can Now Try The AI That Made Maya Go Viral." March 14, 2025. https://dataconomy.com/2025/03/14/you-can-now-try-the-ai-that-made-maya-go-viral/ ↩
DigitalOcean Community. "An Overview of Sesame's Conversational Speech Model." Tutorial. https://www.digitalocean.com/community/tutorials/sesame-csm ↩
Speechmatics. "How to Finetune Sesame AI's Speech Model on New Languages and Voices." Blog post, 2025. https://blog.speechmatics.com/sesame-finetune ↩
Towards Data Science. "Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech." 2025. https://towardsdatascience.com/sesame-speech-model-how-this-viral-ai-model-generates-human-like-speech/ ↩
Winbuzzer. "Sesame AI's Hyper-Realistic Voice Assistant Nears $1B Valuation as Sequoia, Spark Eye $200M Investment." April 1, 2025. https://winbuzzer.com/2025/04/01/sesame-ais-hyper-realistic-voice-assistant-nears-1b-valuation-as-sequoia-spark-eye-200m-investment-xcxwbn/ ↩
Hugging Face. "Transformers v4.52.1 release notes," May 20, 2025. https://github.com/huggingface/transformers/releases ↩
Sequoia Capital. "Partnering with Sesame: A New Era for Voice." October 2025. https://www.sequoiacap.com/article/partnering-with-sesame-a-new-era-for-voice/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

CosyVoice ElevenLabs v3 F5-TTS Hume Octave 2 Text-to-Speech Models

What is Sesame AI?

Why did the Maya demo go viral?

How does the CSM architecture work?

Is Sesame CSM open source?

What can CSM-1B do?

What are the Sesame glasses?

How does CSM compare to ElevenLabs and Hume?

How was Sesame CSM received?

ELI5: what is Sesame CSM?

See also

References

Improve this article

Related Articles

Lyria

Suno v5

ElevenLabs Music

ElevenLabs v3

Hume Octave 2

Stable Audio 2.5

What links here

Related Articles

Lyria

Suno v5

ElevenLabs Music

ElevenLabs v3

Hume Octave 2

Stable Audio 2.5

What links here