Moshi is a full-duplex speech-to-speech foundation model developed by Kyutai, a French nonprofit artificial intelligence research laboratory. Announced on July 3, 2024, and fully open-sourced on September 17, 2024, Moshi was the first publicly available AI system capable of listening and speaking at the same time, with a theoretical response latency of 160 milliseconds and a practical latency of around 200 milliseconds in production. The model operates without breaking conversation into discrete turns; it processes the user's audio and generates its own speech in parallel, removing the start-stop structure that characterized earlier voice assistants.
Built on a 7-billion-parameter text language model called Helium and a streaming neural audio codec called Mimi, Moshi introduced an architectural technique called Inner Monologue, in which the model generates text tokens aligned with its audio output to improve linguistic quality. Its code is available under MIT and Apache 2.0 licenses, and model weights are released under CC-BY 4.0, permitting both commercial and noncommercial use. Subsequent work from Kyutai extended the architecture into simultaneous speech translation (Hibiki, released February 2025), a cascaded voice pipeline wrapping any text LLM (Unmute, released May 2025), and a vision-augmented variant (MoshiVis, released March 2025).
Kyutai is a Paris-based nonprofit AI research laboratory announced in September 2023 and formally unveiled at the ai-PULSE conference in November 2023. The lab received approximately 300 million euros at launch, with 100 million euros from Xavier Niel's Iliad Group, 100 million euros from Rodolphe Saade's CMA CGM shipping conglomerate, and additional contributions from Eric Schmidt's Schmidt Sciences and other donors. Niel, the founder of French telecommunications company Free and a prominent technology investor, described the lab as a response to concerns that large technology companies were increasingly suppressing academic publication and reducing access to AI research.
The scientific leadership at Kyutai was drawn largely from Meta's Fundamental AI Research (FAIR) division and Google DeepMind. Patrick Perez, who had previously worked at Valeo and Inria, became chief executive. Edouard Grave, a former Meta FAIR researcher specializing in natural language processing, joined as chief language officer. Alexandre Defossez, who led audio research at Meta FAIR and is known for work on the EnCodec neural audio codec, joined as head of audio research. Laurent Mazare, previously at DeepMind, became head of engineering. Neil Zeghidour, a former Google DeepMind researcher with expertise in generative audio, joined as audio research advisor. Hervé Jegou, a co-founder with a background in computer vision at FAIR, later became an alumnus of the team.
Scaleway, Niel's cloud computing division, committed 1,000 Nvidia H100 GPUs to the lab at cost. Kyutai's founding commitment was to release all models, training code, and data openly, distinguishing it from most well-funded AI laboratories that retain proprietary control over their systems.
Voice interfaces have existed in consumer technology since Apple's Siri in 2011 and Amazon's Alexa in 2014, but both relied on a cascaded pipeline architecture: a speech-to-text component transcribed audio into words, a language model processed the text, and a text-to-speech component rendered the response as audio. Each step added latency, and the pipeline required explicit turn-taking. Users had to stop speaking before the system would process the utterance, and the system had to finish generating before speaking.
OpenAI demonstrated an alternative approach with GPT-4o in May 2024, showing a model that could listen and respond naturally without artificial pauses. However, OpenAI did not release the voice capabilities immediately for general use, and it did not make the underlying model available as open-source software. Kyutai's Moshi was designed to fill that gap: a research system with comparable full-duplex capabilities that researchers, developers, and companies could run, study, and modify directly.
At the July 3 announcement in Paris, Kyutai made the model accessible at moshi.chat within hours, which was described at the time as the first publicly testable generative voice AI of its kind. Yann LeCun, who serves as a scientific advisor to Kyutai, shared the demo publicly on the same day.
Moshi's architecture combines three main components: the Helium text language model, the Mimi neural audio codec, and a hierarchical audio generation module called the RQ-Transformer. The system processes two parallel audio streams simultaneously, one representing the model's own speech and one representing the user's speech, enabling continuous conversation without enforcing speaker turns.
Helium is a 7-billion-parameter decoder-only text language model trained from scratch by Kyutai. Its architecture uses 32 transformer layers with a model dimension of 4,096 and 32 attention heads. The context length is 4,096 tokens. It incorporates RMS normalization, Rotary Position Embeddings (RoPE), FlashAttention, and Gated Linear Units, which are design choices found in contemporary open-weight models such as LLaMA.
Helium was trained on 2.1 trillion tokens of English text. The training corpus was composed of 12.5 percent curated high-quality sources including Wikipedia, StackExchange, and scientific articles, and 87.5 percent filtered CommonCrawl web data. The filtering pipeline used FNV-1a hashing for deduplication, fastText classifiers for language identification, and quality scoring to remove low-quality web content. Training ran for 500,000 steps with a batch size of 4.2 million tokens and a learning rate of 3 times 10 to the negative fourth.
On standard reasoning and knowledge benchmarks, Helium achieved 79.6 on ARC-Easy, 55.9 on ARC-Challenge, and 54.3 on MMLU, placing it in the competitive range for 7-billion-parameter models trained on similar data volumes. Kyutai later released Helium as a standalone text model.
Mimi is a streaming neural audio codec that compresses 24 kHz audio into a sequence of discrete tokens at 12.5 frames per second, with a bitrate of 1.1 kilobits per second. The codec is built on a SeaNet autoencoder with causal convolutions, which allows it to operate in streaming mode with only 80 milliseconds of latency, meaning it can begin encoding and decoding before a full audio segment is available.
The encoder uses stride factors of 4, 5, 6, and 8 to achieve 12.5 Hz downsampling. Transformer bottleneck layers with 8 transformer layers, 8 attention heads, and a context window of 250 frames (20 seconds) are inserted between the encoder and decoder to capture long-range dependencies in the audio. The decoder uses symmetric transposed convolutions to reconstruct the waveform.
Quantization uses 8 residual vector quantizers (RVQ), each with a codebook of 2,048 entries. Mimi splits the quantization into two stages: the first quantizer uses semantic supervision distilled from WavLM to encode linguistically meaningful content, and the remaining 7 quantizers handle acoustic reconstruction. This split allows the first codebook to carry phonetic and semantic information, while the others refine audio quality. Mimi was trained with adversarial objectives, using a discriminator and feature-matching loss without a direct reconstruction loss, which the Kyutai team found produced better perceptual quality.
Compared with earlier neural codecs such as SpeechTokenizer and SemantiCodec, Mimi achieves lower bitrates while maintaining higher intelligibility and audio quality, owing to its streaming architecture and the semantic-acoustic split in the quantizer.
The core generative module in Moshi is a two-level transformer designed to predict sequences of audio tokens at multiple timescales. It consists of a Temporal Transformer and a Depth Transformer.
The Temporal Transformer has the same architecture as Helium: 32 layers, 4,096 dimensions, 32 attention heads. At each 12.5 Hz timestep, it processes the full context of previous tokens and produces a hidden state that feeds into the Depth Transformer.
The Depth Transformer is a smaller model with 6 layers, 1,024 dimensions, and 16 attention heads. Given the Temporal Transformer's hidden state for a single timestep, the Depth Transformer predicts K sub-sequences of tokens, where each sub-sequence corresponds to one of the 8 RVQ codebooks for each of the two audio streams. The Depth Transformer uses per-codebook parameters, meaning each codebook level has its own learned projections, which the Kyutai team found improved generation quality in ablation studies.
Moshi models two parallel audio streams jointly. Each stream contains one semantic token at 12.5 Hz and 7 acoustic tokens from the acoustic quantizers. With two streams, the total number of token sequences per timestep is 17 (2 streams times 8 codebooks per stream, plus 1 for the Inner Monologue text token). The two streams are modeled without any speaker turn supervision; the model learns to coordinate speaking and listening from data alone.
An acoustic delay of 1 to 2 steps between the semantic token and the acoustic tokens was found in ablations to significantly improve audio quality, allowing the model time to commit to a semantic direction before generating the detailed acoustic representation.
Inner Monologue is Moshi's mechanism for generating text tokens aligned with its own audio output. Rather than operating in a fully audio-only mode, Moshi predicts a text token as a prefix to each frame's audio tokens, representing what Moshi is about to say at that moment. These text tokens are time-aligned using Whisper word-level timestamps mapped to the 12.5 Hz framerate of Mimi.
In English conversational speech, the Inner Monologue tokens consist of approximately 35 percent actual word tokens and 65 percent padding tokens that fill the space between words. The padding structure was designed to maintain temporal alignment even during pauses and between words.
Inner Monologue improves linguistic quality substantially, because the text prediction provides the model with an explicit intermediate representation of its intended speech content, which then constrains the audio generation. The same mechanism can also be used for streaming automatic speech recognition and text-to-speech: by adjusting the delay between text and audio, the model can operate in ASR mode (text lags audio) or TTS mode (text precedes audio), all within the same weights.
Moshi's training proceeds in five stages.
In stage one, Helium is pre-trained on 2.1 trillion text tokens over 500,000 steps.
In stage two, the audio generation capability is added through unsupervised pre-training on a dataset of 7 million hours of English speech transcribed by Whisper large-v3. Training runs for 1 million steps on single-stream audio, with 30 percent of text token positions masked randomly, a random delay between text and audio varying between positive and negative 0.6 seconds, and 50 percent of steps alternating with text-only data to prevent catastrophic forgetting of the language model's knowledge.
In stage three, multi-stream post-training introduces the two-speaker architecture using 100,000 steps. Kyutai applied pyAnnote speaker diarization to the unsupervised audio data to separate speaker segments and simulate dual streams. This stage teaches the model to model two simultaneous audio streams.
In stage four, Moshi is fine-tuned for 10,000 steps on the Fisher corpus, a dataset of approximately 2,000 hours of telephone conversations with separate audio channels per speaker, recorded at 8 kHz and upsampled to 24 kHz using AudioSR. This stage gives Moshi genuine exposure to natural overlapping conversation.
In stage five, instruction fine-tuning on a custom dataset of over 20,000 hours of synthetic speech trains Moshi to behave as a helpful conversational assistant. User inputs were generated by Helium and converted to speech using a text-to-speech system, while Moshi's responses were synthesized from a single voice actor across over 70 speaking styles and emotional registers. Robustness augmentation during this stage included gain variation between negative 24 dB and positive 15 dB, background noise addition, and echo and reverberation simulation.
Full-duplex operation is the defining characteristic that separates Moshi from all voice AI systems that preceded it. In a conventional voice pipeline, the conversation proceeds in half-duplex: either the user speaks or the system speaks, but not both at once. The system must detect when the user has stopped speaking (end-of-turn detection), transcribe the utterance, process it, generate a response, and then speak. Any interruption requires complex state management.
Moshi eliminates this structure entirely. At every 12.5 Hz timestep (every 80 milliseconds), the model simultaneously processes an incoming audio frame from the user and generates an outgoing audio frame of its own. The two streams are independent in the sense that the model does not enforce turns, but they are jointly conditioned in the Temporal Transformer, so Moshi's output takes the user's concurrent speech into account.
This architecture enables natural conversational behaviors that cascaded systems cannot reproduce: interrupting and being interrupted without losing context, trailing off when the user starts speaking, picking up a thought after an interruption, and reacting to emotional content in the user's voice in real time.
The practical latency of 200 milliseconds is below the threshold at which humans typically perceive a conversational delay as unnatural. The Kyutai paper notes that the average human response latency across ten languages in human-to-human telephone conversations is approximately 230 milliseconds, making Moshi's response speed comparable to natural human dialogue.
Kyutai made the model weights and inference code publicly available on September 17, 2024, following the July 3 announcement and several weeks of access via moshi.chat. The release included inference code in three separate backends designed for different deployment contexts.
The PyTorch backend is intended for research and experimentation. It requires a GPU with at least 24 gigabytes of VRAM and supports standard development workflows.
The MLX backend targets Apple Silicon, enabling local inference on macOS and, with quantization, on iPhone 15 Pro. This represented one of the first demonstrations of a real-time speech-to-speech model running on consumer mobile hardware.
The Rust backend using the Candle framework is designed for production deployment. It supports CUDA on Linux and Metal on macOS and is optimized for low-latency serving.
Model weights are available in two voice variants: Moshika, which uses a female synthetic voice, and Moshiko, which uses a male synthetic voice. All variants are available in bf16 precision, with int8 and int4 quantized formats available for some backends.
The Mimi codec weights are released separately on Hugging Face, allowing developers to use the codec independently for tasks such as streaming speech tokenization.
Licensing uses a two-layer structure. Python code is under MIT, Rust code is under Apache 2.0, and model weights are under CC-BY 4.0. The CC-BY 4.0 license permits commercial use with attribution, making the weights usable in production applications without a proprietary license.
Moshi was added to the Hugging Face Transformers library on October 16, 2024, providing integration with the standard model hub infrastructure and the transformers API.
When Moshi was announced in July 2024, OpenAI's GPT-4o voice mode had been demonstrated but not yet made available for public use. OpenAI's Realtime API, which provides programmatic access to the same voice capabilities, was released in October 2024. This made Moshi and the OpenAI Realtime API the two primary options for developers building full-duplex voice applications as of late 2024.
The table below summarizes the main differences between the two systems at the time of the open-source release.
| Feature | Moshi (Kyutai) | OpenAI Realtime API |
|---|---|---|
| Architecture | End-to-end audio-native | Audio-native (GPT-4o) |
| Model size | 7.7B parameters | Undisclosed (GPT-4o scale) |
| Practical latency | ~200ms | 232-320ms |
| Full-duplex | Yes | Yes |
| Open source | Yes (CC-BY 4.0 weights) | No |
| Self-hosting | Yes (GPU or Apple Silicon) | No (API only) |
| Audio input cost | Free (self-hosted) | $0.06/min |
| Audio output cost | Free (self-hosted) | $0.24/min |
| Language support | Primarily English | Multiple languages |
| Reasoning quality | 7B-scale | GPT-4o-scale |
| Function calling | No | Yes |
| Context length | 4,096 tokens | 8,192 tokens |
| Commercial use | Yes (with attribution) | Yes (subscription) |
The primary trade-off between the two systems is model capability versus cost and openness. GPT-4o is a much larger model with stronger reasoning, broader language support, function calling, and integration with OpenAI's broader ecosystem. Moshi operates at 7-billion-parameter scale, which limits the depth of its knowledge and its ability to handle complex multi-step questions. In informal testing by reviewers at Tom's Guide and Odisha TV, Moshi produced responses that were less coherent in long conversations than GPT-4o, and it could become repetitive when corrected.
However, Moshi's openness has practical advantages that the Realtime API cannot offer. Developers can fine-tune Moshi for specific domains, run it on private infrastructure without sending audio to a cloud provider, deploy it on Apple Silicon hardware without an API subscription, and inspect and modify the model weights directly. For applications in healthcare, legal services, or any domain where audio data must remain on-premises, self-hosting Moshi is a credible option that the Realtime API does not support.
Kyutai later released Unmute (May 2025) as a complementary product that wraps any text language model, including larger and more capable ones, with Kyutai's speech-to-text and text-to-speech technology. Unmute preserves the function calling, tool use, and reasoning capabilities of the underlying text model while adding the low-latency voice interface. Kyutai positioned Unmute as the solution for users who need stronger reasoning than Moshi provides, while Moshi remains the solution for applications where audio-native naturalness, emotion, and sub-200-millisecond latency are the highest priorities.
Voice assistants and companions. Moshi's architecture enables voice interfaces that respond without the half-duplex pauses that made earlier systems feel robotic. Developers have used it as a base for prototyping conversational agents where naturalness of interaction is valued over encyclopedic knowledge depth.
On-device and privacy-sensitive applications. The MLX backend for Apple Silicon and the ability to run quantized models on an iPhone 15 Pro make Moshi suitable for applications where audio cannot leave the device. Healthcare applications, personal journaling tools, and accessibility applications are areas where on-device voice AI has clear user trust advantages.
Customer support automation. At 7-billion-parameter scale, Moshi can handle defined question-and-answer domains effectively when fine-tuned on domain-specific conversational data. The low latency reduces the delay that makes automated phone systems frustrating, and full-duplex operation allows agents to interrupt and correct themselves in the way human agents do.
Research on spoken dialogue. As an open-weight model with a published technical paper, Moshi has been used in academic research on full-duplex dialogue, audio codec design, and the training of multimodal foundation models. The release of Helium separately has supported research on text language models for low-resource training regimes.
Multilingual speech processing via Hibiki. The Mimi codec and multi-stream architecture developed for Moshi underpin Hibiki, Kyutai's simultaneous speech translation model. Hibiki can translate French speech to English in real time while preserving the speaker's voice, with applications in live interpretation for conferences and media.
Streaming speech recognition. Moshi's Inner Monologue mechanism produces time-aligned transcripts of both the user and the model as a byproduct of the main generation process. This can serve as a streaming ASR system, and Kyutai later released Kyutai STT as a standalone speech-to-text product derived from the same technology.
Moshi has several documented limitations that users and developers should take into account.
The model is primarily trained on English and performs substantially worse on other languages. While the Mimi codec handles multilingual audio, the Helium backbone and the instruction fine-tuning data are English-focused. French performance is better than other non-English languages given Kyutai's location and team composition, but still below the English baseline.
At 7-billion-parameter scale, Moshi's factual knowledge and reasoning depth are limited compared with larger proprietary models. In reviews, testers found that Moshi could answer general knowledge questions adequately but struggled with detailed technical questions, current events post its training cutoff, and multi-step reasoning chains.
Long conversations degrade in coherence. The 4,096-token context limit means that information from early in a conversation becomes unavailable to the model. In extended conversations, Moshi can lose track of prior context and become repetitive or inconsistent.
Moshi has no function calling or tool use capability. It cannot query external databases, execute code, or call APIs during inference. For applications that require these capabilities, Kyutai's Unmute product wrapping a capable text model is the recommended alternative.
The model lacks explicit emotion understanding in the user's speech beyond what is implicit in acoustic patterns. While the multi-stream architecture means Moshi receives the user's audio rather than a transcript, and can in principle respond to prosody and tone, the instruction fine-tuning data was not specifically designed to train nuanced emotional responsiveness.
Safety evaluation by Kyutai identified toxicity and data memorization as areas requiring ongoing monitoring. The team explored both signal-based and generative audio watermarking approaches to help detect Moshi-generated audio, but these are not mandatory in the released model.
Hibiki is a simultaneous speech translation model released by Kyutai on February 5, 2025. It uses the same multi-stream architecture as Moshi and the Mimi codec to process source speech in one stream and generate translated target speech in another. Unlike offline translation systems that wait for a complete utterance before translating, Hibiki translates in chunks as the speaker continues, accumulating enough context to produce a correct translation without waiting for sentence boundaries.
At release, Hibiki supported French-to-English translation only. The model is available in two sizes: Hibiki 2B, a 2-billion-parameter variant intended for server deployment, and Hibiki 1B (also called Hibiki-M), a 1-billion-parameter variant designed for on-device inference on smartphones. The model preserves the speaker's voice characteristics in the translated output through classifier-free guidance for voice similarity, a feature that distinguishes it from text-based translation systems.
Kyutai described Hibiki as the first model to provide an experience of interpretation close to human professional interpreters in quality and pace. The weights and code carry the same licensing terms as Moshi: MIT and Apache 2.0 for code, CC-BY 4.0 for model weights.
MoshiVis was released in March 2025 as a vision-augmented extension of Moshi. It adds a frozen 400-million-parameter PaliGemma2 vision encoder and approximately 206 million cross-attention adapter parameters, allowing Moshi to take an image as an additional input and discuss it in natural spoken conversation. The added latency is approximately 7 milliseconds per inference step on a MacMini with an M4 Pro chip, keeping the total inference step time at around 55 milliseconds and well within real-time constraints.
MoshiVis is the first open-source model that supports real-time speech-to-speech conversation about visual inputs. Applications include described narration of images for visually impaired users, conversational product inspection, and live scene description.
Unmute, released as open source on July 3, 2025, is a voice layer for text language models. It wraps Kyutai's speech-to-text and text-to-speech components around any text-based LLM, converting it to a voice interface without requiring audio-native retraining. The speech-to-text component uses semantic voice activity detection that avoids cutting off mid-sentence, and the text-to-speech component begins speaking before the full response has been generated to minimize perceived latency.
Unmute allows developers to combine the reasoning and tool-use capabilities of large text models with the audio quality and low latency of Kyutai's speech technology. Kyutai positioned it as the practical complement to Moshi: Moshi for audio-native naturalness, Unmute for complex task completion.