CosyVoice
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,664 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,664 words
Add missing citations, update stale details, or suggest a clearer explanation.
CosyVoice is a family of large-scale neural text-to-speech (TTS) and voice cloning models developed by the Tongyi Speech Lab at Alibaba Group. The first version was introduced in July 2024 in the paper "CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens".[1] The system pairs an autoregressive language model that maps text to discrete speech tokens with a conditional flow matching acoustic decoder that converts those tokens into mel spectrograms, which are then rendered to waveforms by a HiFi-GAN-derived vocoder.[1] The design's distinguishing feature is its use of supervised semantic tokens derived from an automatic speech recognition (ASR) encoder rather than the unsupervised acoustic tokens used by codec-LM systems such as VALL-E.[1] Three numbered releases have followed in quick succession: CosyVoice (July 2024), CosyVoice 2 (December 2024) with a streaming-first redesign, and CosyVoice 3 (open-sourced in December 2025) with a 1 million hour training corpus and an expanded set of supported languages and dialects.[2][3][4]
| Field | Value |
|---|---|
| Developer | Tongyi SpeechTeam, Alibaba Group (FunAudioLLM project)[5] |
| First release | July 2024 (arXiv:2407.05407)[1] |
| Latest open release | Fun-CosyVoice3-0.5B-2512, December 2025[4] |
| Architecture | Autoregressive text-to-token LM + conditional flow matching + HiFT/HiFi-GAN vocoder[1] |
| Code license | Apache License 2.0[6] |
| Open repository | github.com/FunAudioLLM/CosyVoice[7] |
| Checkpoints | Hugging Face and ModelScope[7][8] |
| Languages (v3) | Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian, plus 18+ Chinese dialects[4][8] |
CosyVoice was released by the Tongyi Speech Lab (also referred to as the Tongyi SpeechTeam) within Alibaba Group as part of the broader FunAudioLLM project, an umbrella effort that paired two complementary foundation models: a speech-understanding model called SenseVoice and a speech-generation model called CosyVoice.[5] A companion technical report, "FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs", was posted to arXiv on July 4, 2024, three days before the CosyVoice paper itself.[5] FunAudioLLM positioned the two models as building blocks for applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration.[5]
The CosyVoice paper (arXiv:2407.05407) was submitted on July 7, 2024 and revised on July 9, 2024, with Zhihao Du as lead author and a 12-author byline including Qian Chen, Shiliang Zhang, and Zhijie Yan.[1] At the time, the dominant paradigm for large-model TTS in the open research literature was Microsoft's VALL-E, which treated speech as a sequence of discrete tokens emitted by a neural audio codec (VALL-E used Encodec-style tokens).[9] The CosyVoice authors argued that codec-style tokens "lack explicit semantic information and alignment to the text," motivating a tokenizer trained with ASR supervision so that the resulting discrete units carry phonetic and linguistic content directly.[1]
CosyVoice 2, "Scalable Streaming Speech Synthesis with Large Language Models", was posted on December 13, 2024 and revised through December 25, 2024 (v3 on arXiv).[2] The second version redesigned the architecture around streaming and real-time chat: a single unified language model that can emit speech either offline or in interleaved streaming mode, a chunk-aware causal flow matching decoder, and finite-scalar quantization (FSQ) in place of the original vector quantizer.[2] CosyVoice 2 also removed the explicit speaker embedding and the dedicated text encoder used in version 1, relying instead on a pretrained Qwen 2.5 LM backbone with simplified conditioning.[2]
CosyVoice 3, "Towards In-the-wild Speech Generation via Scaling-up and Post-training", appeared on arXiv on May 23, 2025 (revised May 27, 2025) with Zhihao Du as lead author and a 21-author byline.[3] The paper documents a 100x expansion of training data from 10,000 hours to 1 million hours, a scaling of the language model from 0.5B to 1.5B parameters, a new MinMo-based tokenizer trained on multi-task speech understanding objectives, and a post-training procedure called Differentiable Reward Optimization (DiffRO).[3] An accompanying open-source checkpoint, Fun-CosyVoice3-0.5B-2512, was published on December 15, 2025 to Hugging Face and ModelScope under the same Apache 2.0 code license as previous versions, and the larger 1.5B configuration is documented in the paper.[4][8] The hosted version of the model is also offered via Alibaba Cloud's Model Studio (Bailian) speech APIs.[10]
At a high level, every CosyVoice release follows the same three-stage pipeline introduced in version 1:
The three releases differ in how each stage is parameterized and how they are wired together for streaming.
The original CosyVoice paper introduces a tokenizer the authors abbreviate as S3 (Supervised Semantic Speech tokenizer).[11] The construction is straightforward: take a multilingual ASR encoder, split it after an early layer, insert a residual vector quantization (VQ) bottleneck with a single codebook of 4,096 entries, and continue training end-to-end with the CTC/ASR loss applied at the encoder output.[1][11] Because the model is supervised by the ASR objective, the quantized indices retain phonetic and linguistic content. The paper uses an ESPnet Conformer ASR model as the backbone for small-scale single-language experiments and SenseVoice-Large for multilingual experiments, with the VQ layer inserted after the first six encoder layers.[1][11] Token rate is 25 Hz (25 tokens per second of speech).[12]
This is the design choice that gives CosyVoice its name and identity. Unsupervised neural audio codecs such as EnCodec and SoundStream optimize for reconstruction quality; their tokens compress acoustic information densely but do not align cleanly with text. Codec-LM TTS systems such as VALL-E therefore have to learn the text-to-acoustics mapping end-to-end through the language model, which is data-hungry and can be unstable for long sequences.[9] CosyVoice's S3 tokens, by contrast, are essentially a discretization of an ASR encoder's phonetic representation. Because each token already corresponds to something close to a phonetic state, the downstream language model has an easier job and can be trained on less data per language.[1]
CosyVoice 2 keeps the supervised-tokenizer idea but replaces VQ with finite-scalar quantization (FSQ). Encoder activations are projected into a low-rank D-dimensional space and each dimension is quantized independently into a small integer range [-K, K], yielding an implicit codebook size of (2K+1)^D.[2] FSQ is significantly easier to train than VQ, avoids the dead-codes problem, and in CosyVoice 2 reaches 100% codebook utilization (6,561 effective tokens) compared with only 23% for the VQ tokenizer's 4,096-entry codebook in version 1.[2] The CosyVoice 2 tokenizer is built atop SenseVoice-Large, with six transformer blocks using rotary position embeddings (RoPE).[12]
CosyVoice 3's tokenizer was rebuilt on top of a different speech foundation model, MinMo, which itself was pretrained on 1.4 million hours of speech across multi-task objectives (ASR, language identification, speech emotion recognition, audio event detection, and speaker analysis).[3] The FSQ module is again inserted into the encoder, but because MinMo was trained on more diverse data and on prosodically rich objectives such as emotion recognition, the resulting tokens carry more paralinguistic information than the v1/v2 tokens. The paper credits this richer tokenizer with improved prosody naturalness, especially on emotion-conditioned generation.[3] Token rate remains 25 Hz.[3]
In the original CosyVoice, the text-to-token model is an autoregressive transformer that takes a sequence [S, v, {ȳ_u}, T, {μ_l}, E], where S and E are sequence boundaries, v is a speaker embedding, ȳ_u are encoded text tokens, and μ_l are speech tokens.[11] Training uses teacher forcing with cross-entropy applied only to predicted speech tokens and the end-of-sequence marker.[11] A separate text encoder (initialized from a small BPE LM) projects the input text into the LM's representation space.
CosyVoice 2 simplifies this in two important ways. First, the dedicated text encoder is removed and the LM is initialized directly from a pretrained large language model: the v2 release uses Qwen 2.5-0.5B as the backbone, so the speech-token vocabulary is appended to the existing LLM vocabulary and the rest of the model can be fine-tuned on text-plus-speech data.[2] Second, the explicit speaker embedding v is dropped because the authors observed it was leaking content information into the speaker channel; the model instead conditions on a short reference utterance encoded by the same tokenizer.[2] These two changes shorten the conditioning prefix and let the LLM's pretrained linguistic knowledge transfer more cleanly to the TTS task.
For streaming generation, CosyVoice 2 interleaves text and speech tokens at a fixed N:M ratio (N=5 text tokens, M=15 speech tokens by default) so that the model alternates between absorbing text and emitting speech.[2] In the offline mode the sequence is simply text-then-speech, but the same weights are used in both modes, so the v2 model can be deployed as either a streaming or a non-streaming TTS without retraining.[2]
CosyVoice 3 scales the LM from 0.5B to 1.5B parameters and adopts a similar Qwen-style backbone.[3] The paper notes that the larger model is particularly beneficial in low-resource languages, where the broader linguistic priors of a pretrained LLM help compensate for limited speech training data.[3]
Rather than a diffusion model, CosyVoice uses optimal-transport conditional flow matching (OT-CFM) to learn a deterministic vector field that transports a noise distribution into the distribution of mel spectrograms conditioned on speech tokens.[1][11] Flow matching is closely related to diffusion but trains a continuous-time velocity field directly using a regression loss along straight-line transport paths, which converges faster and uses fewer sampling steps at inference.[11] The authors apply classifier-free guidance by dropping conditions with probability 0.2 during training, use a cosine timestep schedule at inference, and provide a masked mel spectrogram as additional conditioning so the model can fill in only the missing portions of a partially specified target.[11]
CosyVoice 2 generalizes this to a chunk-aware causal flow matching model that supports four different attention masks for different latency/quality tradeoffs:[2]
| Mask | Future context | Use case |
|---|---|---|
| Non-causal | All frames | Highest quality, offline |
| Chunk-2M | 2M future frames | Near-offline quality |
| Chunk-M | M future frames | Balanced latency/quality |
| Full-causal | None (past only) | Lowest latency streaming |
By making the unfolded U-Net causal, CosyVoice 2 can run flow matching incrementally as new speech tokens arrive, which is what lets the system deliver the advertised first-package latency of approximately 150 ms in bi-streaming mode.[2][7]
CosyVoice 3 keeps the chunk-aware causal flow-matching framework but rebuilds the decoder around a Diffusion Transformer (DiT) backbone and scales it from approximately 100M to 300M parameters.[3] The larger decoder is given more freedom to render prosodic detail and accommodates the broader range of languages and dialects in the v3 training set.[3]
The final stage converts a mel spectrogram into a 24 kHz waveform. The original CosyVoice and CosyVoice 2 use a HiFi-GAN-style generator with multi-receptive-field (MRF) fusion (the FunAudioLLM team describes the deployed variant as a HiFTNet vocoder), with four transposed-convolution upsampling blocks of strides 4, 4, 4, 2 for a total upsampling factor of 128.[5][13] Because the bulk of the perceptual content is already encoded in the mel spectrogram, the vocoder can run efficiently in real time and is not the latency bottleneck for streaming.[2]
The first release was a research preview targeting zero-shot multilingual TTS and voice cloning. The training corpus consisted of approximately 130,000 hours of Mandarin Chinese, 30,000 hours of English, 5,000 hours of Cantonese, 4,600 hours of Japanese, and 2,200 hours of Korean.[1] Three model sizes were released (300M, 300M-instructed, 300M-SFT), each at roughly 300M parameters.[7] On the LibriTTS test-clean English split the v1 paper reports a Whisper-based word error rate of 2.89% with a speaker-similarity score of 74.3% (cosine similarity of ERes2Net embeddings between prompt and generated speech).[1] On the AISHELL-3 Chinese set CosyVoice reports 3.82% character error rate and 81.58% speaker similarity.[1] The paper benchmarks against VALL-E (reported at 18.7% WER under matched conditions) and UniAudio (8.74% WER), and CosyVoice achieves a substantially lower error rate while maintaining high speaker similarity.[1] Capabilities listed in the original release include zero-shot voice cloning from a roughly 3-second reference, cross-lingual cloning (use a Mandarin reference to clone an English voice), and instruction-following for limited prosody and style control.[5][7]
CosyVoice 2 redesigned the system to operate in both streaming and offline modes from a single set of weights, with the goal of supporting interactive voice agents and LLM-driven chat applications. The reported first-package latency in bi-streaming mode is approximately 150 ms, achieved by interleaving text and speech tokens at a 5:15 ratio and running the flow-matching decoder in chunk-aware causal mode.[2][7] The 0.5B-parameter checkpoint is widely cited as the canonical reference: on the SEED-TTS-Eval test sets the model card reports test-zh CER 1.45% with speaker similarity 75.7%, test-en WER 2.57% with similarity 65.9%, and test-hard CER 6.83% with similarity 72.4%.[8] On LibriSpeech test-clean, the CosyVoice 2 paper reports WER 2.47% and NMOS 3.96, slightly exceeding the human reference values of 2.66% WER and 3.84 NMOS, which the authors describe as human-parity quality.[2]
CosyVoice 2 also significantly expanded instruction-following. The release supports 29 instruction categories spanning eight emotions, multiple speaking rates, Chinese dialects, role-playing styles, and fine-grained markers such as [laughter] and [breath].[2] The instruction-following and zero-shot voice-cloning capabilities are integrated into a single model rather than separate checkpoints.[2]
The CosyVoice 2 checkpoint is hosted at FunAudioLLM/CosyVoice2-0.5B on Hugging Face and reports more than 250,000 monthly downloads at the time of writing.[8]
CosyVoice 3 expanded the system in scale, languages, and post-training. Documented changes from version 2 include:[3][4]
The 1.5B configuration described in the paper is the basis for the hosted CosyVoice 3 Plus offering on Alibaba Cloud's Model Studio (Bailian), while the open-source checkpoint Fun-CosyVoice3-0.5B-2512, published on December 15, 2025, is the 0.5B sibling intended for self-hosted use.[4][10] The CosyVoice 3 model card reports test-zh CER 1.21%, test-en WER 2.24%, and test-hard CER 6.71% for the base 0.5B variant, improving to test-zh CER 0.81% and test-en WER 1.68% for the reinforcement-learned variant (Fun-CosyVoice3-0.5B-2512_RL).[4] The 1.5B configuration with DiffRO reports a SEED-TTS-Eval test-zh CER of 0.71%, an English WER of 1.45%, a hard-case CER of 5.66%, and a WavLM-based speaker similarity of 0.836.[3]
All three CosyVoice releases support zero-shot voice cloning from a short reference utterance of approximately 3 seconds, in which the model conditions on a tokenized prompt and generates a target sentence in the speaker's voice without any speaker-specific fine-tuning.[1][4] The cloning interface in the open-source repository exposes both a standard inference_zero_shot call (give text, give reference audio, get audio out) and an inference_cross_lingual variant that explicitly tags the language of the target.[7]
A reference utterance in one language can be used to synthesize text in another. The CosyVoice 3 paper benchmarks cross-lingual Mandarin-to-English cloning at 5.09% WER with 0.669 speaker similarity, and the v3 demo page includes examples in which a German reference is used to read Chinese text and vice versa.[3] This is one of the capabilities most directly enabled by the supervised semantic tokenizer: because the tokens are essentially language-agnostic phonetic units, the model can map text in one language onto the speaker characteristics extracted from a reference in another.[11]
CosyVoice 2 and 3 accept natural-language instructions that adjust emotion, speaking style, dialect, and role. CosyVoice 2 supports 29 instruction types spanning happy/sad/angry/excited and other emotional categories, speaking rate, dialect tags such as Cantonese, and structured markers such as [laughter], [breath], and emphasis tokens.[2] CosyVoice 3 expands this to 100+ speaking styles and incorporates emotion, accent, and role-play instructions trained on 1,500 hours of dedicated instructed data.[3] The CosyVoice 3 model card demonstrates the instructed interface with Cantonese, peppered with <peppa> (character voice) tags, and emotion markers.[4]
The bi-streaming architecture in CosyVoice 2 and 3 supports both streaming text input and streaming audio output, with a default first-package latency of approximately 150 ms and per-chunk latency that is bounded by the chunk size of the causal flow-matching decoder.[2][7] The published latency formula is L_TTS = M*(d_lm + d_fm + d_voc), where M is the chunk size in speech tokens and d_lm, d_fm, d_voc are the per-token computation times for the language model, flow-matching decoder, and vocoder, respectively.[2] This makes CosyVoice suitable as the TTS leg of a real-time voice agent, and the project ships an integration with NVIDIA TensorRT-LLM and vLLM for accelerated serving.[7]
The CosyVoice repository documents pronunciation inpainting in which mixed sequences of words and explicit phonemes (Pinyin for Chinese, CMU phonemes for English) can be used to control the pronunciation of difficult or polyphonic characters without retraining the model.[4][7] Long-form synthesis is supported by streaming generation in chunks; the official model card example demonstrates generation directly from a Python text generator.[8]
The supported set of languages and dialects has grown across releases:
| Release | Languages | Dialects |
|---|---|---|
| CosyVoice 1 (Jul 2024) | Mandarin, English, Cantonese, Japanese, Korean[1] | Limited, not core focus |
| CosyVoice 2 (Dec 2024) | Mandarin, English, Japanese, Korean[2] | Initial dialect support |
| Fun-CosyVoice3 0.5B (2025) | Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian[4] | 18+ Chinese dialects including Cantonese, Minnan, Sichuanese, Shanghainese, Dongbei, Tianjin, Shandong[4] |
The reference implementation is hosted at github.com/FunAudioLLM/CosyVoice and is distributed under the Apache License 2.0; the repository contains training, inference, and deployment code and at the time of writing is one of the most-starred open-source TTS projects on GitHub, with roughly 21,000 stars.[6][7] Model weights are distributed through both Hugging Face (e.g., FunAudioLLM/CosyVoice2-0.5B, FunAudioLLM/Fun-CosyVoice3-0.5B-2512) and ModelScope.[7][8] The repository also ships a Docker image with a gRPC and FastAPI server and supports accelerated inference through TensorRT-LLM and vLLM.[7]
Alibaba Cloud exposes CosyVoice as a managed service through its Model Studio (the platform also referred to as Bailian, 百炼). Documented endpoints include a REST-style speech synthesis API, a WebSocket streaming API, and a voice-cloning API that lets developers upload a reference utterance and register a custom voice.[10] The hosted CosyVoice 3 Plus offering is positioned as the 1.5B configuration of the model with the latest tokenizer and DiffRO post-training.[3][10]
CosyVoice and CosyVoice 2 are routinely used as baselines in subsequent zero-shot TTS papers. The CosyVoice 3 evaluation compares against ten contemporary baselines including F5-TTS, Spark-TTS, MaskGCT, SEED-TTS, and various LLM-based codec-token approaches; CosyVoice 3 reports state-of-the-art content consistency (lowest WER/CER) on most multilingual benchmarks while remaining competitive on speaker similarity.[3] CV3-Eval, the multilingual benchmark introduced alongside CosyVoice 3, has been adopted by other speech-generation researchers as a standard for in-the-wild evaluation.[3]
The CosyVoice repository documents integrations with NVIDIA TensorRT-LLM (for roughly 4x inference acceleration) and with vLLM versions 0.9.0 and 0.11.x, making the models deployable on the same serving stack used for large language models.[7] Because the streaming interface is text-in / audio-out with a low first-package latency, CosyVoice has been used as a drop-in TTS leg in real-time voice agent pipelines built with open-source frameworks such as LiveKit Agents and Pipecat.[7]
The combination of low-latency streaming, multilingual coverage, expressive control, and open-source weights has made CosyVoice a common choice for a range of speech-generation workloads documented either in the official papers or in downstream community projects:[3][7]
The CosyVoice 3 paper explicitly acknowledges several remaining weaknesses, and additional limitations are documented in the official voice-cloning user guide for the hosted service:[3][14]
Like all high-fidelity zero-shot voice-cloning systems, CosyVoice raises consent and impersonation concerns: a 3-second reference is enough to clone a target voice, with no built-in mechanism to verify that the user holds rights to the reference recording. The Alibaba Cloud documentation places the responsibility for legal rights to a cloned voice on the user.[14] Like other open-source voice models, the open-weights checkpoints are not audited for safety in the same way that managed APIs are, which is the standard concern around open-source deepfake-capable audio models.
CosyVoice sits at the intersection of two design traditions: codec-LM TTS (VALL-E, AudioLM, UniAudio) and flow- or diffusion-based TTS (NaturalSpeech series, F5-TTS, Voicebox). Its defining stylistic choice is to use ASR-supervised semantic tokens between the two stages, rather than either acoustic codec tokens or continuous representations end-to-end. Useful points of comparison include:
| System | Token type | Decoder | Streaming | Notable scope |
|---|---|---|---|---|
| CosyVoice / 2 / 3 | Supervised semantic (S3 / FSQ)[1][2] | Conditional flow matching + HiFi-GAN[1] | Yes (v2+, ~150 ms)[2] | 9 languages + 18 Chinese dialects (v3)[4] |
| VALL-E (VALL-E) | EnCodec acoustic[9] | Codec decoder[9] | Limited | English-first, with multilingual follow-ups[9] |
| Suno (Bark + Suno v3-v5) | Hybrid semantic + coarse / fine acoustic | Codec | Limited | Music-first; multilingual[15] |
| ElevenLabs / ElevenLabs v3 | Proprietary | Proprietary | Yes | Commercial closed-source[16] |
| Sesame CSM | Discrete tokens (RVQ)[17] | Codec | Yes | Conversational speech model[17] |
| OpenVoice (MyShell) | Tone-color extractor over base TTS | Base TTS | Yes | Voice cloning emphasis |
| XTTS v2 (Coqui) | VQ-VAE | Diffusion | Limited | Multilingual zero-shot |
| F5-TTS | Char-level flow matching | Flow matching | Limited | Flow-matching baseline |
| Tortoise | Discrete autoregressive | Diffusion | No | High-quality but slow |
| GPT-SoVITS | VQ-VAE + SoVITS | SoVITS | Limited | Community-driven |
CosyVoice is most often contrasted with VALL-E (the canonical codec-LM TTS) and with F5-TTS (a recent flow-matching-only baseline). The CosyVoice 3 paper reports lower error rates than F5-TTS, Spark-TTS, MaskGCT, and SEED-TTS on most subsets of its CV3-Eval multilingual benchmark, with the closest competitor on speaker similarity being SEED-TTS, a closed-source commercial system from ByteDance.[3] Third-party reviews of open-source TTS in 2025-2026 generally place CosyVoice 2 at or near the top of the open-source field for combined naturalness, speaker similarity, and streaming latency, with F5-TTS preferred for resource-constrained settings and XTTS v2 still common for real-time pipelines that need permissive licensing across all components.[18]
The CosyVoice family belongs to the broader landscape of large-model TTS that emerged after 2022: