Kyutai
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,753 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,753 words
Add missing citations, update stale details, or suggest a clearer explanation.
Kyutai is a privately funded nonprofit artificial intelligence research laboratory based in Paris, France, founded in November 2023 with an initial commitment of approximately 300 million euros pledged over five years.[^1][^2] The lab was created jointly by the French telecommunications group Iliad (Xavier Niel), the shipping and logistics group CMA CGM (Rodolphe Saadé), and the philanthropic foundation Schmidt Futures / Schmidt Sciences (Eric Schmidt), and was announced at the Scaleway ai-PULSE conference on 17 November 2023.[^1][^2][^3] Kyutai positions itself as an open-science laboratory: every major release is accompanied by model weights, inference code, and a technical report, in contrast with the closed-weights approach taken by most American frontier laboratories.[^3][^4] Since 2024, the lab has concentrated on speech and audio: it has shipped the full-duplex spoken-dialogue model Moshi, the streaming neural audio codec Mimi, the speech-to-speech translation model Hibiki, the small multilingual language model Helium, and the modular voice stack Unmute, among other artifacts.[^4][^5][^6][^7][^8]
| Field | Value |
|---|---|
| Type | Privately funded nonprofit research laboratory |
| Founded | November 2023 (announced 17 November 2023, Paris)[^1] |
| Headquarters | Paris, France[^1] |
| Founding backers | Iliad (Xavier Niel), CMA CGM (Rodolphe Saadé), Schmidt Futures (Eric Schmidt)[^1][^2] |
| Initial funding | Approximately 300 million euros over five years[^1][^2] |
| Chief Executive | Patrick Pérez[^9] |
| Scientific advisors | Yann LeCun, Yejin Choi, Bernhard Schölkopf[^3] |
| Compute | Access to a 1,000-GPU NVIDIA H100 cluster at Scaleway[^3] |
| Headline releases | Moshi (2024-09-18), Mimi (2024-09), Helium 1 (2025), Hibiki (2025-02), Unmute (2025-05) and Pocket TTS (2026-01)[^4][^5][^6][^7][^8][^10] |
| Default license | Code under MIT / Apache 2.0, weights under CC-BY 4.0[^4][^7] |
Kyutai was unveiled on 17 November 2023 at Scaleway's ai-PULSE conference in Paris, in front of an audience that included French president Emmanuel Macron.[^2][^3] The announcement framed Kyutai as Europe's first independent privately funded nonprofit dedicated specifically to open AI research, modelled in spirit on the early Mila and FAIR labs but operating without a parent corporation.[^2][^3] Iliad founder Xavier Niel committed 100 million euros through his telecoms group, CMA CGM head Rodolphe Saadé committed a further 100 million euros through the shipping group, and Eric Schmidt contributed an undisclosed amount via Schmidt Futures (later renamed Schmidt Sciences), bringing the publicly reported envelope to roughly 300 million euros over five years.[^1][^2][^3] Press coverage at launch frequently quoted the dollar-equivalent figure of about 330 million USD.[^1]
A central architectural choice at founding was the decision to release every artifact, including weights, inference code, training recipes, and a written technical report, under permissive licenses.[^3][^4] Kyutai's leadership has repeatedly described this approach as "open science" rather than merely "open source", because the laboratory commits not only to releasing artifacts but to documenting the experimental decisions that produced them.[^3][^11]
The six scientists named at launch were Patrick Pérez (formerly scientific director of Valeo.ai), Hervé Jégou (formerly Director of Research at Meta FAIR), Edouard Grave (formerly Meta FAIR), Alexandre Défossez (formerly Meta FAIR), Neil Zeghidour (formerly Google DeepMind), and Laurent Mazaré (formerly Google DeepMind).[^9][^11][^12] Several of these researchers were already well known in the speech and audio community: Défossez was a lead author of EnCodec, Zeghidour was the lead author of SoundStream, and Jégou had been involved with FAISS and large-scale retrieval work at Meta. Pérez took the role of CEO, and the lab named Yann LeCun (Meta), Yejin Choi (then at the University of Washington), and Bernhard Schölkopf (Max Planck Institute) as scientific advisors.[^3]
Because Iliad also controls the French cloud provider Scaleway, Kyutai negotiated at-cost access to a dedicated cluster of 1,000 NVIDIA H100 GPUs hosted by Scaleway in Paris.[^3] This arrangement gave the laboratory training capacity comparable, at founding, to a mid-sized commercial laboratory and allowed it to pretrain multibillion-parameter speech and language models in-house rather than renting hyperscaler capacity.[^3][^13]
Kyutai's first headline release came in July 2024, when a public demonstration of a real-time spoken conversational model was held in Paris.[^4] The corresponding technical report and code, named Moshi, were published on 17 September 2024 (arXiv:2410.00037, dated September 17 with a minor revision on October 2).[^4][^13] Moshi was the first full-duplex spoken-dialogue model released with open weights, with a self-reported theoretical end-to-end latency of 160 milliseconds and a measured practical latency of approximately 200 milliseconds.[^4][^13] Released alongside Moshi was the streaming neural audio codec Mimi, which Moshi uses to tokenize its input and output audio at 12.5 Hz.[^4][^5][^13]
Through 2025 the laboratory broadened its release portfolio. In January 2025 it released a preview of Helium-1, a 2-billion-parameter multilingual text model, with the full model card and a paper on the modular variant published in April 2025.[^6][^14] In February 2025 it released Hibiki, a streaming French-to-English speech-to-speech translation model.[^7][^15] In May 2025 the laboratory unveiled Unmute, a modular pipeline that wraps any text-only large language model with streaming speech-to-text and text-to-speech components.[^8][^16] In January 2026 it released Pocket TTS, a 100-million-parameter text-to-speech model designed to run in real time on a CPU; multilingual support for six languages (English, French, German, Spanish, Portuguese and Italian) followed on 4 May 2026.[^10][^17]
Kyutai is organised as a privately funded nonprofit under French law and does not have shareholders or commercial customers.[^1][^2] Its founding documents describe the mission as "building and democratizing artificial general intelligence through open science", with an emphasis on releasing every artifact, including training data and code where copyright permits, alongside the model weights themselves.[^11][^3]
The 300-million-euro envelope reported at launch is structured as a five-year commitment.[^1][^2] Of that, Iliad and CMA CGM each pledged 100 million euros, while Schmidt Futures contributed an undisclosed remainder.[^1][^2] Because there is no commercial product or customer revenue, the laboratory is not under pressure to ship a paid service, and its outputs (papers, weights, code, datasets) are explicitly framed as a public-research deliverable.[^3][^11] The board includes representatives from each of the three backers, and operational independence is documented in the founding statutes (as described in Iliad's launch press release).[^2]
The laboratory operates a relatively small permanent staff (its public team page in May 2026 lists six leadership roles, six technical-staff researchers, several postdoctoral researchers and PhD students, and a small operations team).[^9] In addition to Pérez as CEO, the published organisation chart names Edouard Grave as Chief Language Officer, Laurent Mazaré as Head of Engineering, Alexandre Défossez as Head of Audio Research, Sarah Hôte as Head of Operations, and Jennifer Coscas as Head of Legal and Partnerships.[^9] Neil Zeghidour is listed as Audio Research Advisor, and Hervé Jégou, who co-founded the laboratory and contributed to early Moshi work, is listed in May 2026 as a notable alumnus.[^9]
The following sections summarise the principal artifacts released or co-released by Kyutai through May 2026. All releases are distributed from the kyutai-labs organisation on GitHub and the kyutai organisation on Hugging Face.[^18][^19]
Mimi is the neural audio codec at the centre of Kyutai's audio stack. It processes 24 kHz mono audio, produces a discrete-token representation at 12.5 Hz, and operates at a bitrate of 1.1 kbps.[^5][^13] Architecturally, Mimi pairs a convolutional encoder-decoder with a residual vector quantiser of 16 codebooks (8 acoustic plus a semantic codebook trained via distillation), and includes a small transformer block in the bottleneck.[^5][^20] The default published configuration uses a 24 kHz sampling rate, a hidden size of 512, a codebook size of 2,048, and 32 quantiser channels.[^20]
A key engineering property of Mimi is that it is fully streaming and causal: the encoder uses causal convolutions and can be fed one frame at a time, with an 80-millisecond frame size matching Moshi's 12.5 Hz token rate.[^5][^13] In the Moshi paper Kyutai reports that Mimi outperforms prior streaming codecs such as EnCodec and SoundStream in subjective listening quality at comparable bitrate, while also serving as a semantic tokeniser of the kind required for downstream speech language modelling.[^4][^13] In the technical report the semantic-token stream is described as being obtained via distillation from a self-supervised speech encoder, an approach inspired by SpeechTokenizer.[^4][^21]
Mimi was released in September 2024 alongside Moshi, with weights under CC-BY 4.0 and code under MIT (Python) and Apache 2.0 (Rust).[^4][^18] It has since been integrated as a first-class model in the Hugging Face transformers library under the MimiModel class.[^20]
Moshi is Kyutai's full-duplex spoken-dialogue model. Its public release on 18 September 2024 and the corresponding technical report (arXiv:2410.00037) were authored by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.[^4][^13]
Architecturally, Moshi consists of a 7-billion-parameter temporal "backbone" Transformer that predicts the next set of multistream tokens at each 80-millisecond frame, paired with a smaller depth Transformer that models inter-codebook dependencies within a frame.[^4][^18] At each timestep the model jointly predicts: (i) the next text token of its own internal monologue, (ii) the next set of acoustic Mimi tokens corresponding to its own speech, and (iii) the next set of Mimi tokens for the user's input speech, which lets the model perform implicit voice-activity detection and simultaneous listening and speaking.[^4][^13] Kyutai calls this construction "Inner Monologue".[^4]
The Moshi paper reports a theoretical latency of 160 ms (one Mimi frame plus depth-Transformer inference) and a measured latency of approximately 200 ms on a single NVIDIA L4 GPU.[^4][^13] Two voice-tuned variants were released: Moshiko, fine-tuned on a male synthetic voice, and Moshika, fine-tuned on a female synthetic voice; both are available in bfloat16, int8, and (via MLX) int4 quantisations through PyTorch, MLX, and Rust/Candle backends.[^18] Code is MIT (Python) and Apache 2.0 (Rust), weights are CC-BY 4.0.[^18]
The Moshi release was widely reported in French and international press as the first credible open-research alternative to closed real-time voice systems such as the GPT-4o voice mode and the OpenAI Realtime API.[^4][^22]
Helium-1 is a 2-billion-parameter small language model aimed at edge and mobile deployments. A preview was released on 13 January 2025; a full release accompanied by an updated technical write-up appeared on 30 April 2025.[^6][^14][^23]
Helium-1 is a Transformer-based decoder model with a context window of 4,096 tokens, trained on roughly 2.5 trillion tokens drawn from a filtered subset of Common Crawl using Kyutai's open-source dactory data pipeline.[^6][^23] Training was performed on 64 NVIDIA H100 GPUs using JAX.[^14][^23] It supports all 24 official languages of the European Union and is distributed under a CC-BY licence on Hugging Face as kyutai/helium-1-2b.[^23][^24] Kyutai reports that Helium-1 performs at parity with or above similarly sized open baselines such as Qwen 2.5 (1.5B), Gemma 2 (2B), and Llama 3.2 (3B) on multilingual benchmarks.[^23] A token-level distillation from a 7-billion-parameter teacher is reported as part of the training recipe.[^23]
Beyond its standalone use, Helium-1 is also the language-model backbone of the CASA-Helium1-VL-2B vision-language model and of the speech-text releases that followed.[^25] In the April 2025 update Kyutai described Helium-1 as part of a wider research programme to build modular language models, where small specialised modules are combined at inference time rather than scaling a single dense monolith.[^14]
Hibiki is Kyutai's simultaneous speech-to-speech and speech-to-text translation model, released on 5 February 2025 with a press release the following day and a paper on arXiv (2502.03382) the same week.[^7][^15][^26] The authors are Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez and Neil Zeghidour.[^26]
Hibiki uses the multistream architecture pioneered in Moshi to model source and target speech jointly, producing aligned text and audio tokens for both languages at 12.5 Hz.[^7][^15][^26] The released model supports French-to-English only, although the architecture is language-pair agnostic.[^7][^15] The full Hibiki model is a 2.7-billion-parameter decoder-only Transformer using 16 RVQ streams; a distilled variant called Hibiki-M (1.7 billion parameters, 8 RVQ streams) targets on-device deployment on smartphones.[^7][^15] Training data consisted of approximately 7 million hours of English audio and 450,000 hours of French audio, supplemented with around 40,000 hours of synthetic parallel speech generated by aligning a text translation system with text-to-speech.[^15]
Kyutai reports an ASR-BLEU score of 30.5 on its French-to-English benchmark, a naturalness mean opinion score of 3.73/5 (compared with 4.12/5 for professional human interpreters), and a speaker-similarity score of 0.52 against a reference of 0.43 for Meta's Seamless translation model.[^15] The system supports batch inference of up to 320 concurrent sequences on a single H100 GPU.[^15] Weights are released under CC-BY 4.0 with code in PyTorch, MLX, MLX-Swift (iOS), and Rust.[^27]
Unmute is Kyutai's modular voice stack, introduced on 22 May 2025 and fully open-sourced on 3 July 2025.[^8][^16] Where Moshi is an end-to-end speech-text foundation model that handles dialogue itself, Unmute is designed to wrap any existing text-only large language model with low-latency streaming speech-to-text (STT) and text-to-speech (TTS) modules, plus a voice activity detection and turn-taking layer.[^8][^16] Kyutai reports end-to-end response times in the 200 to 350 millisecond range and the ability to clone a target voice from a 10-second audio sample.[^16] Unmute is distributed under MIT/Apache 2.0 with weights under CC-BY 4.0.[^8][^16]
Underpinning Unmute is the "Delayed Streams Modeling" framework, released as a separate repository (delayed-streams-modeling) on GitHub.[^19] Delayed Streams Modeling generalises Moshi's multistream design to arbitrary collections of input and output streams with explicit per-stream delays, and is used as the training framework for Kyutai's later TTS and STT models.[^19]
Pocket TTS is Kyutai's CPU-targeted text-to-speech model, introduced on 13 January 2026 and extended to a multilingual configuration on 4 May 2026.[^10][^17] The model has approximately 100 million parameters and was designed to run in real time without a GPU, with reported sub-50-millisecond latency on a modern CPU.[^17] The May 2026 release adds support for six languages (English, French, German, Spanish, Portuguese, and Italian) while preserving the original 100-million-parameter footprint and improving the English quality.[^10] Pocket TTS is distributed open-source under MIT and CC-BY 4.0 from the kyutai-labs GitHub organisation.[^19]
Beyond the headline models, the kyutai-labs GitHub organisation hosts several auxiliary projects: moshi-finetune (training scripts for fine-tuning Moshi), moshi-rag (a retrieval-augmented variant of Moshi in Rust), tts_longeval (an evaluation toolkit for long-form text-to-speech), flashy (a training-loop framework), dactory (the Common Crawl filtering and data preparation pipeline used for Helium), and flash-attn3-jax (a JAX binding for FlashAttention 3).[^19] In late 2025 the organisation also published invincible-voice (a voice restoration project) and ovie (a monocular novel-view synthesis experiment) as part of a broader exploration beyond pure audio modelling.[^19]
A consistent set of technical commitments runs across Kyutai's releases.
Streaming and low-latency inference. Mimi, Moshi, Hibiki, and Unmute are all explicitly designed to consume audio frame-by-frame and produce output frame-by-frame, with documented end-to-end latency budgets in the few-hundred-millisecond range.[^4][^7][^16] This contrasts with batch encoders and decoders used by many open speech models such as Whisper.
Joint semantic and acoustic tokenisation. Mimi's design adopts the SpeechTokenizer pattern of distilling a self-supervised semantic encoder into the first codebook of a residual vector quantiser while training the remaining codebooks on acoustic reconstruction.[^4][^5] This dual representation is critical for Moshi's Inner Monologue construction, where text tokens are interleaved with acoustic tokens.[^4][^13]
Multistream language modelling. Moshi's contribution of jointly predicting text, the model's own audio, and the user's audio at every frame is generalised in the Delayed Streams Modeling repository to arbitrary numbers of streams.[^19] This framework supports tasks such as simultaneous translation (Hibiki) and streaming STT/TTS (Unmute) using the same underlying machinery.[^7][^16][^19]
Small, specialised, edge-deployable models. Helium-1 (2B parameters), Hibiki-M (1.7B), and Pocket TTS (100M) are explicitly designed for resource-constrained deployment.[^6][^7][^17] In the April 2025 Helium write-up, Kyutai frames this as part of a research programme of moving away from monolithic giant models toward modular composition of small specialist models.[^14]
Full open release including training pipelines. Beyond model weights, Kyutai has released dactory (Helium's data pipeline), flashy (a generic training loop), and the Delayed Streams Modeling training framework, which together make several of the released artifacts reproducible in principle.[^14][^19]
Kyutai's models have been adopted as building blocks in third-party projects across speech recognition, translation, and conversational interfaces. The Moshi GitHub repository accumulated more than 10,000 stars within months of release, and the Mimi codec was upstreamed into the Hugging Face transformers library, where it is exposed under the MimiModel class.[^18][^20]
In the speech recognition space, Moshi-derived components have been promoted as low-latency open alternatives to OpenAI Whisper for streaming dictation and live captioning, with macOS-focused tutorials describing self-hosted installations of the Moshi STT stack.[^28] In the translation space, Hibiki has been characterised in independent industry coverage (for example by the localisation-industry publication Slator) as the first open-source simultaneous speech-to-speech translation system to approach the quality of commercial offerings such as Meta's Seamless.[^29] Unmute has been described by independent commentators as a way to add ultra-low-latency voice to any existing text LLM without retraining the LLM itself, with reported response times in the 200-350 ms range.[^16][^30]
Within Kyutai itself, the Helium-1 language model is reused as the backbone for CASA-Helium1-VL-2B, a vision-language model built on Helium-1 and a fine-tuned Qwen2.5-VL-3B image encoder using a cross-attention fusion strategy called CASA (Cross-Attention with Selective Aggregation).[^25] CASA-Qwen2_5-VL-3B and a live-captioning variant CASA-Qwen2_5-VL-3B-LiveCC have also been published on Hugging Face.[^25]
Independent coverage and the laboratory's own technical reports identify several limitations.
Limited language coverage in speech models. Moshi at launch was English-only, and Hibiki at launch supports only French-to-English translation, despite an architecture that is in principle language-pair agnostic.[^4][^7][^15] Coverage in the Slator article on Hibiki noted that this limits practical applicability for many users.[^29]
Latency vs. quality trade-off. Moshi's 200 ms latency on an L4 GPU is achieved with a 7-billion-parameter backbone running in bfloat16; users on more modest hardware report latencies several times higher, and the quality of the generated speech, while strong for an open model of its class, is below that of large closed proprietary systems such as GPT-4o voice on certain prosodic dimensions.[^4][^13][^22]
Scale. Helium-1 at 2 billion parameters is significantly smaller than the frontier models released by Mistral AI, Meta AI, or OpenAI during the same period, reflecting Kyutai's deliberate focus on edge-deployable models rather than direct competition for top general-knowledge benchmark scores.[^6][^14] Coverage at Helium-1's release flagged that the 2B-parameter scale, while competitive within its size class, would not match the absolute quality of models an order of magnitude larger.[^23]
Funding runway. The 300 million euro envelope was reported as a five-year commitment, leaving open questions about Kyutai's long-term financial sustainability after 2028 in the absence of commercial revenue.[^1][^2] Press coverage at launch raised this point explicitly when comparing Kyutai to commercial European AI laboratories such as Mistral AI and Aleph Alpha.[^31]
Voice cloning risks. Like other recent voice-cloning systems (for example voice cloning systems built on top of neural codecs in the lineage of VALL-E), Pocket TTS and Unmute are capable of producing a synthetic voice from a 10-second sample, raising the usual concerns about impersonation and consent.[^16][^17] Kyutai documents these risks in its release notes and applies content licences (CC-BY 4.0) and acceptable-use language to the weights, but no technical watermarking is documented at the time of writing.[^16][^17]
Kyutai is one of a small group of Paris-headquartered AI research organisations that emerged after the 2022-2023 wave of large language model commercialisation. The comparison below summarises how it relates to its closest neighbours.
| Organisation | HQ | Structure | Headline focus | Open weights? | Founded |
|---|---|---|---|---|---|
| Kyutai | Paris | Private nonprofit | Speech, audio, modular small LMs | Yes (CC-BY 4.0)[^4][^7] | November 2023[^1] |
| Mistral AI | Paris | For-profit company | General LLMs, code, OCR | Mixed (some open, some commercial)[^32] | April 2023[^32] |
| Hugging Face | New York / Paris | For-profit company | Hub, libraries, hosting | Distributor of open models[^33] | 2016[^33] |
| Meta FAIR (Paris) | Paris | Corporate research arm | LLMs, vision, audio (Llama, AudioCraft, MusicGen) | Mixed (Llama open-weights)[^34] | 2015 (Paris site)[^34] |
| Google DeepMind (Paris) | Paris / London | Corporate research arm | Frontier LLMs, robotics (Google DeepMind) | Selective (e.g. Gemma)[^35] | 2014 (DeepMind acquisition)[^35] |
Relative to Mistral AI, Kyutai differs both in structure (nonprofit vs. for-profit) and in research scope (audio-first vs. general text-LLM-first).[^31][^32] Relative to Hugging Face, Kyutai is a research producer rather than a distribution platform, although in practice it relies on Hugging Face to host its weights.[^19][^33] Relative to Meta FAIR's Paris office, with which it shares several former employees including Jégou, Grave and Défossez, Kyutai operates independently of any commercial product roadmap.[^9][^11] And relative to Google DeepMind, where Zeghidour and Mazaré previously worked on streaming codecs and audio language models, Kyutai's distinctive trait is its commitment to fully releasing the resulting artifacts.[^9][^11]
A 2026 TechCrunch survey of European AI startups described Kyutai as "the leading European open-research lab on speech", with Mistral AI and a younger generation of vertical companies (legal, biomedical, robotics) building atop the same continental AI ecosystem.[^36]
Kyutai matters for three intertwined reasons.
First, it offers a concrete demonstration that a privately funded nonprofit can ship state-of-the-art research artifacts in a competitive domain (real-time speech), at a scale (multibillion-parameter models trained on 1,000-GPU clusters) usually associated with commercial laboratories.[^3][^4]
Second, its full-disclosure release pattern (weights, inference code, training code where possible, paper) has set a higher reference bar for "open" releases in the speech and audio community, alongside contemporary efforts such as Meta FAIR's AudioCraft release and the Hugging Face Transformers reference implementations.[^4][^7][^20]
Third, Kyutai has helped to anchor a Paris-based concentration of speech and audio expertise, drawing on senior alumni of Meta FAIR Paris and Google DeepMind Paris, and producing junior researchers who train at the laboratory and then move into the broader French and European AI ecosystem.[^9][^11][^36]
Through 2025 and the first half of 2026, the most consequential developments at Kyutai have been the steady expansion of the speech and audio stack and the start of work beyond pure audio:
The KE:SAI announcement is notable because it represents Kyutai's first substantial extension beyond pure audio and language research, into embodied and physical AI in collaboration with a European partner institution.[^11]
Kyutai's releases sit within a wider lineage of work on streaming neural audio codecs and speech-text foundation models, including SoundStream (the earlier 2021 Google streaming codec, co-authored by Kyutai's later co-founder Zeghidour), EnCodec (Meta's 2022 streaming codec, co-authored by Kyutai's later co-founder Défossez), Meta's AudioCraft toolkit and MusicGen release, and Microsoft's VALL-E (a neural codec language model for zero-shot TTS).[^5][^21][^37]
In language modelling, Helium-1 belongs to the family of small language models that includes Mistral 7B, Gemma 2B, and Llama 3.2 3B; comparisons against these are reported in the Helium-1 release notes.[^23][^32]
In streaming translation, Hibiki is best understood in comparison with Meta's Seamless model family and with prior cascaded approaches to simultaneous interpretation; Hibiki's distinguishing claim is the joint preservation of voice characteristics and timing alongside translation quality.[^15][^29]