# CosyVoice

> Source: https://aiwiki.ai/wiki/cosyvoice
> Updated: 2026-07-13
> Categories: Chinese AI, Speech & Audio AI, Voice AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**CosyVoice** is a family of open-source multilingual neural [text-to-speech](/wiki/text_to_speech_ai) (TTS) and [voice cloning](/wiki/voice_cloning) models developed by the Tongyi Speech Lab (Tongyi SpeechTeam) at [Alibaba Group](/wiki/alibaba) and released under the Apache 2.0 license. First introduced in July 2024, CosyVoice clones a target speaker's voice from roughly 3 seconds of reference audio and, from CosyVoice 2 onward, streams speech with a first-package latency of about 150 milliseconds; its latest version, CosyVoice 3, generates natural, expressive speech across 9 languages and more than 18 Chinese dialects.[1][2][4] Its reference implementation has drawn more than 22,000 GitHub stars, making it one of the most widely used open-source speech-generation projects.[7]

The first version was introduced in the paper *"CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens"*.[1] The system pairs an autoregressive [language model](/wiki/language_model) that maps text to discrete speech tokens with a conditional [flow matching](/wiki/flow_matching) acoustic decoder that converts those tokens into mel spectrograms, which are then rendered to waveforms by a HiFi-GAN-derived vocoder.[1] The design's distinguishing feature is its use of *supervised* semantic tokens derived from an automatic speech recognition (ASR) encoder rather than the unsupervised acoustic tokens used by codec-LM systems such as VALL-E; the authors report that these "supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning."[1] Three numbered releases have followed in quick succession: CosyVoice (July 2024), CosyVoice 2 (December 2024) with a streaming-first redesign, and CosyVoice 3 (open-sourced in December 2025) with a 1 million hour training corpus and an expanded set of supported languages and dialects.[2][3][4]

| Field | Value |
| --- | --- |
| Developer | Tongyi SpeechTeam, Alibaba Group (FunAudioLLM project)[5] |
| First release | July 2024 (arXiv:2407.05407)[1] |
| Latest open release | Fun-CosyVoice3-0.5B-2512, December 2025[4] |
| Architecture | Autoregressive text-to-token LM + conditional flow matching + HiFT/HiFi-GAN vocoder[1] |
| Code license | Apache License 2.0[6] |
| Open repository | github.com/FunAudioLLM/CosyVoice[7] |
| Checkpoints | [Hugging Face](/wiki/hugging_face) and [ModelScope](/wiki/modelscope)[7][8] |
| Languages (v3) | Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian, plus 18+ Chinese dialects[4][8] |

## Background and history

CosyVoice was released by the Tongyi Speech Lab (also referred to as the Tongyi SpeechTeam) within Alibaba Group as part of the broader **FunAudioLLM** project, an umbrella effort that paired two complementary foundation models: a speech-understanding model called *SenseVoice* and a speech-generation model called *CosyVoice*.[5] A companion technical report, *"FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs"*, was posted to arXiv on July 4, 2024, three days before the CosyVoice paper itself.[5] FunAudioLLM positioned the two models as building blocks for applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration.[5]

The CosyVoice paper (arXiv:2407.05407) was submitted on July 7, 2024 and revised on July 9, 2024, with Zhihao Du as lead author and a 12-author byline including Qian Chen, Shiliang Zhang, and Zhijie Yan.[1] At the time, the dominant paradigm for large-model TTS in the open research literature was Microsoft's VALL-E, which treated speech as a sequence of discrete tokens emitted by a neural audio codec ([VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e) used Encodec-style tokens).[9] The CosyVoice authors argued that codec-style tokens "lack explicit semantic information and alignment to the text," motivating a tokenizer trained with ASR supervision so that the resulting discrete units carry phonetic and linguistic content directly.[1]

CosyVoice 2, *"Scalable Streaming Speech Synthesis with Large Language Models"*, was posted on December 13, 2024 and revised through December 25, 2024 (v3 on arXiv).[2] The second version redesigned the architecture around streaming and real-time chat: a single unified language model that can emit speech either offline or in interleaved streaming mode, a *chunk-aware causal flow matching* decoder, and finite-scalar quantization (FSQ) in place of the original vector quantizer.[2] CosyVoice 2 also removed the explicit speaker embedding and the dedicated text encoder used in version 1, relying instead on a pretrained Qwen 2.5 LM backbone with simplified conditioning.[2]

CosyVoice 3, *"Towards In-the-wild Speech Generation via Scaling-up and Post-training"*, appeared on arXiv on May 23, 2025 (revised May 27, 2025) with Zhihao Du as lead author and a 21-author byline.[3] The paper documents a 100x expansion of training data from 10,000 hours to 1 million hours, a scaling of the language model from 0.5B to 1.5B parameters, a new MinMo-based tokenizer trained on multi-task speech understanding objectives, and a post-training procedure called Differentiable Reward Optimization (DiffRO).[3] An accompanying open-source checkpoint, **Fun-CosyVoice3-0.5B-2512**, was published on December 15, 2025 to Hugging Face and ModelScope under the same Apache 2.0 code license as previous versions, and the larger 1.5B configuration is documented in the paper.[4][8] The hosted version of the model is also offered via Alibaba Cloud's Model Studio (Bailian) speech APIs.[10]

## How does CosyVoice work?

At a high level, every CosyVoice release follows the same three-stage pipeline introduced in version 1:

1. A **speech tokenizer** that compresses 24 kHz audio into a discrete sequence of low-bitrate semantic tokens.
2. A **text-to-token language model** that autoregressively emits speech tokens conditioned on input text (and optionally a short reference utterance for zero-shot cloning).
3. An **acoustic decoder** based on conditional flow matching that turns predicted tokens into a mel spectrogram, followed by a vocoder that renders the spectrogram to a waveform.[1]

The three releases differ in how each stage is parameterized and how they are wired together for streaming.

### Supervised semantic tokens

The original CosyVoice paper introduces a tokenizer the authors abbreviate as **S3** (Supervised Semantic Speech tokenizer).[11] The construction is straightforward: take a multilingual ASR encoder, split it after an early layer, insert a residual vector quantization (VQ) bottleneck with a single codebook of 4,096 entries, and continue training end-to-end with the CTC/ASR loss applied at the encoder output.[1][11] Because the model is supervised by the ASR objective, the quantized indices retain phonetic and linguistic content. The paper uses an ESPnet Conformer ASR model as the backbone for small-scale single-language experiments and SenseVoice-Large for multilingual experiments, with the VQ layer inserted after the first six encoder layers.[1][11] Token rate is 25 Hz (25 tokens per second of speech).[12]

This is the design choice that gives CosyVoice its name and identity. Unsupervised neural audio codecs such as EnCodec and SoundStream optimize for reconstruction quality; their tokens compress *acoustic* information densely but do not align cleanly with text. Codec-LM TTS systems such as VALL-E therefore have to learn the text-to-acoustics mapping end-to-end through the language model, which is data-hungry and can be unstable for long sequences.[9] CosyVoice's S3 tokens, by contrast, are essentially a discretization of an ASR encoder's phonetic representation. Because each token already corresponds to something close to a phonetic state, the downstream language model has an easier job and can be trained on less data per language.[1]

CosyVoice 2 keeps the supervised-tokenizer idea but replaces VQ with **finite-scalar quantization (FSQ)**. Encoder activations are projected into a low-rank D-dimensional space and each dimension is quantized independently into a small integer range [-K, K], yielding an implicit codebook size of (2K+1)^D.[2] FSQ is significantly easier to train than VQ, avoids the dead-codes problem, and in CosyVoice 2 reaches 100% codebook utilization (6,561 effective tokens) compared with only 23% for the VQ tokenizer's 4,096-entry codebook in version 1.[2] The CosyVoice 2 tokenizer is built atop SenseVoice-Large, with six transformer blocks using rotary position embeddings (RoPE).[12]

CosyVoice 3's tokenizer was rebuilt on top of a different speech foundation model, **MinMo**, which itself was pretrained on 1.4 million hours of speech across multi-task objectives (ASR, language identification, speech emotion recognition, audio event detection, and speaker analysis).[3] The FSQ module is again inserted into the encoder, but because MinMo was trained on more diverse data and on prosodically rich objectives such as emotion recognition, the resulting tokens carry more paralinguistic information than the v1/v2 tokens. The paper credits this richer tokenizer with improved prosody naturalness, especially on emotion-conditioned generation.[3] Token rate remains 25 Hz.[3]

### Text-to-token language model

In the original CosyVoice, the text-to-token model is an autoregressive [transformer](/wiki/transformer) that takes a sequence `[S, v, {ȳ_u}, T, {μ_l}, E]`, where `S` and `E` are sequence boundaries, `v` is a speaker embedding, `ȳ_u` are encoded text tokens, and `μ_l` are speech tokens.[11] Training uses teacher forcing with cross-entropy applied only to predicted speech tokens and the end-of-sequence marker.[11] A separate text encoder (initialized from a small BPE LM) projects the input text into the LM's representation space.

CosyVoice 2 simplifies this in two important ways. First, the dedicated text encoder is removed and the LM is initialized directly from a pretrained large language model: the v2 release uses **Qwen 2.5-0.5B** as the backbone, so the speech-token vocabulary is appended to the existing LLM vocabulary and the rest of the model can be fine-tuned on text-plus-speech data.[2] Second, the explicit speaker embedding `v` is dropped because the authors observed it was leaking content information into the speaker channel; the model instead conditions on a short reference utterance encoded by the same tokenizer.[2] These two changes shorten the conditioning prefix and let the LLM's pretrained linguistic knowledge transfer more cleanly to the TTS task.

For streaming generation, CosyVoice 2 interleaves text and speech tokens at a fixed N:M ratio (N=5 text tokens, M=15 speech tokens by default) so that the model alternates between absorbing text and emitting speech.[2] In the offline mode the sequence is simply text-then-speech, but the same weights are used in both modes, so the v2 model can be deployed as either a streaming or a non-streaming TTS without retraining.[2]

CosyVoice 3 scales the LM from 0.5B to 1.5B parameters and adopts a similar Qwen-style backbone.[3] The paper notes that the larger model is particularly beneficial in low-resource languages, where the broader linguistic priors of a pretrained LLM help compensate for limited speech training data.[3]

### Flow-matching acoustic decoder

Rather than a diffusion model, CosyVoice uses **optimal-transport conditional flow matching (OT-CFM)** to learn a deterministic vector field that transports a noise distribution into the distribution of mel spectrograms conditioned on speech tokens.[1][11] Flow matching is closely related to diffusion but trains a continuous-time velocity field directly using a regression loss along straight-line transport paths, which converges faster and uses fewer sampling steps at inference.[11] The authors apply classifier-free guidance by dropping conditions with probability 0.2 during training, use a cosine timestep schedule at inference, and provide a masked mel spectrogram as additional conditioning so the model can fill in only the missing portions of a partially specified target.[11]

CosyVoice 2 generalizes this to a **chunk-aware causal flow matching** model that supports four different attention masks for different latency/quality tradeoffs:[2]

| Mask | Future context | Use case |
| --- | --- | --- |
| Non-causal | All frames | Highest quality, offline |
| Chunk-2M | 2M future frames | Near-offline quality |
| Chunk-M | M future frames | Balanced latency/quality |
| Full-causal | None (past only) | Lowest latency streaming |

By making the unfolded U-Net causal, CosyVoice 2 can run flow matching incrementally as new speech tokens arrive, which is what lets the system deliver the advertised first-package latency of approximately 150 ms in bi-streaming mode.[2][7]

CosyVoice 3 keeps the chunk-aware causal flow-matching framework but rebuilds the decoder around a [Diffusion Transformer (DiT)](/wiki/diffusion_transformer) backbone and scales it from approximately 100M to 300M parameters.[3] The larger decoder is given more freedom to render prosodic detail and accommodates the broader range of languages and dialects in the v3 training set.[3]

### Vocoder

The final stage converts a mel spectrogram into a 24 kHz waveform. The original CosyVoice and CosyVoice 2 use a HiFi-GAN-style generator with multi-receptive-field (MRF) fusion (the FunAudioLLM team describes the deployed variant as a HiFTNet vocoder), with four transposed-convolution upsampling blocks of strides 4, 4, 4, 2 for a total upsampling factor of 128.[5][13] Because the bulk of the perceptual content is already encoded in the mel spectrogram, the vocoder can run efficiently in real time and is not the latency bottleneck for streaming.[2]

## What are the main versions of CosyVoice?

### CosyVoice (July 2024)

The first release was a research preview targeting zero-shot multilingual TTS and voice cloning. The training corpus consisted of approximately 130,000 hours of Mandarin Chinese, 30,000 hours of English, 5,000 hours of Cantonese, 4,600 hours of Japanese, and 2,200 hours of Korean.[1] Three model sizes were released (300M, 300M-instructed, 300M-SFT), each at roughly 300M parameters.[7] On the LibriTTS *test-clean* English split the v1 paper reports a [Whisper](/wiki/whisper)-based word error rate of 2.89% with a speaker-similarity score of 74.3% (cosine similarity of ERes2Net embeddings between prompt and generated speech).[1] On the AISHELL-3 Chinese set CosyVoice reports 3.82% character error rate and 81.58% speaker similarity.[1] The paper benchmarks against VALL-E (reported at 18.7% WER under matched conditions) and UniAudio (8.74% WER), and CosyVoice achieves a substantially lower error rate while maintaining high speaker similarity.[1] Capabilities listed in the original release include zero-shot voice cloning from a roughly 3-second reference, cross-lingual cloning (use a Mandarin reference to clone an English voice), and instruction-following for limited prosody and style control.[5][7]

### CosyVoice 2 (December 2024)

CosyVoice 2 redesigned the system to operate in both streaming and offline modes from a single set of weights, with the goal of supporting interactive voice agents and LLM-driven chat applications. The reported first-package latency in bi-streaming mode is approximately 150 ms, achieved by interleaving text and speech tokens at a 5:15 ratio and running the flow-matching decoder in chunk-aware causal mode.[2][7] The 0.5B-parameter checkpoint is widely cited as the canonical reference: on the SEED-TTS-Eval test sets the model card reports test-zh CER 1.45% with speaker similarity 75.7%, test-en WER 2.57% with similarity 65.9%, and test-hard CER 6.83% with similarity 72.4%.[8] On LibriSpeech *test-clean*, the CosyVoice 2 paper reports WER 2.47% and NMOS 3.96, slightly exceeding the human reference values of 2.66% WER and 3.84 NMOS. The authors state that CosyVoice 2 "achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode."[2]

CosyVoice 2 also significantly expanded instruction-following. The release supports 29 instruction categories spanning eight emotions, multiple speaking rates, Chinese dialects, role-playing styles, and fine-grained markers such as `[laughter]` and `[breath]`.[2] The instruction-following and zero-shot voice-cloning capabilities are integrated into a single model rather than separate checkpoints.[2]

The CosyVoice 2 checkpoint is hosted at `FunAudioLLM/CosyVoice2-0.5B` on Hugging Face under the Apache 2.0 license and is the family's most widely used open checkpoint.[8]

### CosyVoice 3 (2025)

CosyVoice 3 expanded the system in scale, languages, and post-training. The paper describes it as "an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness."[3] Documented changes from version 2 include:[3][4]

- **Training data:** 10,000 hours to roughly 1 million hours, sourced from broader in-the-wild data and a wider set of text genres.
- **Language coverage:** Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian (9 languages), with 18+ Chinese dialects and accents including Cantonese, Sichuanese, Shanghainese, and others.
- **Model size:** Language model scaled from 0.5B to 1.5B parameters; flow-matching decoder scaled from approximately 100M to 300M parameters with a Diffusion Transformer backbone.
- **Tokenizer:** Rebuilt on top of MinMo, a multi-task speech foundation model trained on 1.4M hours including emotion recognition and audio event detection, which the authors credit with improved prosodic richness.
- **Post-training:** A novel Differentiable Reward Optimization (DiffRO) procedure that uses the Gumbel-Softmax operation to sample LLM-predicted tokens and then directly backpropagates from an ASR-based reward (and auxiliary rewards for emotion and audio events) into the speech-token policy.

The 1.5B configuration described in the paper is the basis for the hosted *CosyVoice 3 Plus* offering on Alibaba Cloud's Model Studio (Bailian), while the open-source checkpoint **Fun-CosyVoice3-0.5B-2512**, published on December 15, 2025, is the 0.5B sibling intended for self-hosted use.[4][10] The CosyVoice 3 model card reports test-zh CER 1.21%, test-en WER 2.24%, and test-hard CER 6.71% for the base 0.5B variant, improving to test-zh CER 0.81%, test-en WER 1.68%, and test-hard CER 5.44% for the reinforcement-learned variant (Fun-CosyVoice3-0.5B-2512_RL).[4] The 1.5B configuration with DiffRO reports a SEED-TTS-Eval test-zh CER of 0.71%, an English WER of 1.45%, a hard-case CER of 5.66%, and a WavLM-based speaker similarity of 0.775 (0.836 on the ERes2Net metric).[3]

## What can CosyVoice do?

### Zero-shot voice cloning

All three CosyVoice releases support zero-shot voice cloning from a short reference utterance of approximately 3 seconds, in which the model conditions on a tokenized prompt and generates a target sentence in the speaker's voice without any speaker-specific fine-tuning.[1][4] The cloning interface in the open-source repository exposes both a standard `inference_zero_shot` call (give text, give reference audio, get audio out) and an `inference_cross_lingual` variant that explicitly tags the language of the target.[7]

### Cross-lingual voice cloning

A reference utterance in one language can be used to synthesize text in another. On its CV3-Eval benchmark, the CosyVoice 3 paper reports Chinese-to-English cross-lingual cloning (the 1.5B model) at 4.32% WER with 0.664 speaker similarity, and the v3 demo page includes examples in which a German reference is used to read Chinese text and vice versa.[3] This is one of the capabilities most directly enabled by the supervised semantic tokenizer: because the tokens are essentially language-agnostic phonetic units, the model can map text in one language onto the speaker characteristics extracted from a reference in another.[11]

### Instruction control

CosyVoice 2 and 3 accept natural-language instructions that adjust emotion, speaking style, dialect, and role. CosyVoice 2 supports 29 instruction types spanning happy/sad/angry/excited and other emotional categories, speaking rate, dialect tags such as Cantonese, and structured markers such as `[laughter]`, `[breath]`, and emphasis tokens.[2] CosyVoice 3 expands this to more than 100 speaking styles and incorporates emotion, accent, and role-play instructions, with the high-quality instruction-following data expanded from 1,500 hours in CosyVoice 2 to 5,000 hours.[3] The CosyVoice 3 model card demonstrates the instructed interface with Cantonese, peppered with `<peppa>` (character voice) tags, and emotion markers.[4]

### Streaming TTS

The bi-streaming architecture in CosyVoice 2 and 3 supports both *streaming text input* and *streaming audio output*, with a default first-package latency of approximately 150 ms and per-chunk latency that is bounded by the chunk size of the causal flow-matching decoder.[2][7] The published latency formula is `L_TTS = M*(d_lm + d_fm + d_voc)`, where M is the chunk size in speech tokens and `d_lm`, `d_fm`, `d_voc` are the per-token computation times for the language model, flow-matching decoder, and vocoder, respectively.[2] This makes CosyVoice suitable as the TTS leg of a real-time voice agent, and the project ships an integration with NVIDIA TensorRT-LLM and [vLLM](/wiki/vllm) for accelerated serving.[7]

### Long-form synthesis and pronunciation control

The CosyVoice repository documents *pronunciation inpainting* in which mixed sequences of words and explicit phonemes (Pinyin for Chinese, CMU phonemes for English) can be used to control the pronunciation of difficult or polyphonic characters without retraining the model.[4][7] Long-form synthesis is supported by streaming generation in chunks; the official model card example demonstrates generation directly from a Python text generator.[8]

### Languages and dialects

The supported set of languages and dialects has grown across releases:

| Release | Languages | Dialects |
| --- | --- | --- |
| CosyVoice 1 (Jul 2024) | Mandarin, English, Cantonese, Japanese, Korean[1] | Limited, not core focus |
| CosyVoice 2 (Dec 2024) | Mandarin, English, Japanese, Korean[2] | Initial dialect support |
| Fun-CosyVoice3 0.5B (2025) | Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian[4] | 18+ Chinese dialects including Cantonese, Minnan, Sichuanese, Shanghainese, Dongbei, Tianjin, Shandong[4] |

## Implementations and adoption

### Is CosyVoice open source?

The reference implementation is hosted at `github.com/FunAudioLLM/CosyVoice` and is distributed under the Apache License 2.0; the repository contains training, inference, and deployment code and is one of the most-starred open-source TTS projects on GitHub, with more than 22,000 stars as of mid-2026.[6][7] Model weights are distributed through both [Hugging Face](/wiki/hugging_face) (e.g., `FunAudioLLM/CosyVoice2-0.5B`, `FunAudioLLM/Fun-CosyVoice3-0.5B-2512`) and [ModelScope](/wiki/modelscope).[7][8] The repository also ships a Docker image with a gRPC and FastAPI server and supports accelerated inference through [TensorRT](/wiki/tensorrt)-LLM and vLLM.[7]

### Hosted offerings

Alibaba Cloud exposes CosyVoice as a managed service through its Model Studio (the platform also referred to as Bailian, 百炼). Documented endpoints include a REST-style speech synthesis API, a WebSocket streaming API, and a voice-cloning API that lets developers upload a reference utterance and register a custom voice.[10] The hosted CosyVoice 3 Plus offering is positioned as the 1.5B configuration of the model with the latest tokenizer and DiffRO post-training.[3][10]

### Adoption as a research baseline

CosyVoice and CosyVoice 2 are routinely used as baselines in subsequent zero-shot TTS papers. The CosyVoice 3 evaluation compares against ten contemporary baselines including F5-TTS, Spark-TTS, MaskGCT, SEED-TTS, and various LLM-based codec-token approaches; CosyVoice 3 reports state-of-the-art content consistency (lowest WER/CER) on most multilingual benchmarks while remaining competitive on speaker similarity.[3] CV3-Eval, the multilingual benchmark introduced alongside CosyVoice 3, has been adopted by other speech-generation researchers as a standard for in-the-wild evaluation.[3]

### Ecosystem integrations

The CosyVoice repository documents integrations with NVIDIA TensorRT-LLM (for roughly 4x inference acceleration) and with vLLM versions 0.9.0 and 0.11.x, making the models deployable on the same serving stack used for large language models.[7] Because the streaming interface is text-in / audio-out with a low first-package latency, CosyVoice has been used as a drop-in TTS leg in real-time voice agent pipelines built with open-source frameworks such as LiveKit Agents and Pipecat.[7]

## What is CosyVoice used for?

The combination of low-latency streaming, multilingual coverage, expressive control, and open-source weights has made CosyVoice a common choice for a range of speech-generation workloads documented either in the official papers or in downstream community projects:[3][7]

- **Voice agents and conversational AI.** Real-time chat applications can pair CosyVoice 2 or 3 with a separate ASR system (often SenseVoice, from the same FunAudioLLM family) to build a fully open-source speech-to-speech stack.[5][7]
- **Audiobook and podcast narration.** Long-form streaming generation combined with cross-lingual cloning supports multi-voice narration in multiple languages from a small library of voice prompts.[5]
- **Localization and dubbing.** Cross-lingual voice cloning enables the same speaker's voice to read text in a different language, which is useful for video localization where consistent vocal identity across languages is desirable.[3]
- **Speech-to-speech translation.** FunAudioLLM's umbrella architecture pairs SenseVoice (ASR + LID + emotion) with CosyVoice (TTS) to support translation that preserves emotional and paralinguistic features.[5]
- **Customer service and IVR.** Alibaba Cloud markets the Bailian-hosted CosyVoice service for voice-driven customer service automation in Mandarin, Cantonese, and English.[10]

## What are CosyVoice's limitations?

The CosyVoice 3 paper explicitly acknowledges several remaining weaknesses, and additional limitations are documented in the official voice-cloning user guide for the hosted service:[3][14]

- **No timbre control by text instruction.** The authors note that CosyVoice 3 "cannot control acoustic characteristics, such as timbre, through textual instructions"; timbre is only set by the reference utterance.[3]
- **Weak singing voice generation.** The model does not perform well on singing or other strongly musical vocal styles.[3]
- **Sensitivity to noisy reference audio.** Background noise, reverberation, overlapping voices, and long silences in the reference utterance degrade cloning similarity and naturalness; the Alibaba Cloud guide recommends at least 60% active speech in the reference and avoiding silences longer than 2 seconds.[14]
- **Character-set overlap.** The CosyVoice 2 paper observes that the heavy overlap between the Japanese and Chinese character sets degrades performance on Japanese benchmarks relative to languages with disjoint scripts.[2]
- **Hard cases.** Even the strongest 1.5B v3 configuration with reinforcement learning reports CER above 5% on the SEED-TTS-Eval test-hard subset, indicating that complex prosodic and content cases (e.g., dense numerals, rare names) are still challenging.[3]

Like all high-fidelity zero-shot voice-cloning systems, CosyVoice raises consent and impersonation concerns: a 3-second reference is enough to clone a target voice, with no built-in mechanism to verify that the user holds rights to the reference recording. The Alibaba Cloud documentation places the responsibility for legal rights to a cloned voice on the user.[14] Like other open-source voice models, the open-weights checkpoints are not audited for safety in the same way that managed APIs are, which is the standard concern around open-source [deepfake](/wiki/deepfake)-capable audio models.

## How does CosyVoice compare to other TTS systems?

CosyVoice sits at the intersection of two design traditions: codec-LM TTS (VALL-E, AudioLM, UniAudio) and flow- or diffusion-based TTS (NaturalSpeech series, F5-TTS, Voicebox). Its defining stylistic choice is to use ASR-supervised semantic tokens between the two stages, rather than either acoustic codec tokens or continuous representations end-to-end. Useful points of comparison include:

| System | Token type | Decoder | Streaming | Notable scope |
| --- | --- | --- | --- | --- |
| CosyVoice / 2 / 3 | Supervised semantic (S3 / FSQ)[1][2] | Conditional flow matching + HiFi-GAN[1] | Yes (v2+, ~150 ms)[2] | 9 languages + 18 Chinese dialects (v3)[4] |
| VALL-E ([VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e)) | EnCodec acoustic[9] | Codec decoder[9] | Limited | English-first, with multilingual follow-ups[9] |
| [Suno](/wiki/suno) (Bark + Suno v3-v5) | Hybrid semantic + coarse / fine acoustic | Codec | Limited | Music-first; multilingual[15] |
| [ElevenLabs](/wiki/elevenlabs) / [ElevenLabs v3](/wiki/elevenlabs_v3) | Proprietary | Proprietary | Yes | Commercial closed-source[16] |
| [Sesame](/wiki/sesame) [CSM](/wiki/sesame_csm) | Discrete tokens (RVQ)[17] | Codec | Yes | Conversational speech model[17] |
| OpenVoice (MyShell) | Tone-color extractor over base TTS | Base TTS | Yes | Voice cloning emphasis |
| XTTS v2 (Coqui) | VQ-VAE | Diffusion | Limited | Multilingual zero-shot |
| F5-TTS | Char-level flow matching | Flow matching | Limited | Flow-matching baseline |
| Tortoise | Discrete autoregressive | Diffusion | No | High-quality but slow |
| GPT-SoVITS | VQ-VAE + SoVITS | SoVITS | Limited | Community-driven |

CosyVoice is most often contrasted with VALL-E (the canonical codec-LM TTS) and with F5-TTS (a recent flow-matching-only baseline). The CosyVoice 3 paper reports lower error rates than F5-TTS, Spark-TTS, MaskGCT, and SEED-TTS on most subsets of its CV3-Eval multilingual benchmark, with the closest competitor on speaker similarity being SEED-TTS, a closed-source commercial system from ByteDance.[3] Third-party reviews of open-source TTS in 2025-2026 generally place CosyVoice 2 at or near the top of the open-source field for combined naturalness, speaker similarity, and streaming latency, with F5-TTS preferred for resource-constrained settings and XTTS v2 still common for real-time pipelines that need permissive licensing across all components.[18]

## Related work

The CosyVoice family belongs to the broader landscape of large-model TTS that emerged after 2022:

- **VALL-E** (2023): the canonical codec-LM TTS that established the "speech as tokens for an LLM" paradigm. CosyVoice deliberately replaces unsupervised codec tokens with supervised semantic tokens.[9]
- **Flow matching**: the mathematical framework underlying the CosyVoice acoustic decoder; broadly adopted across modern TTS systems such as Voicebox and F5-TTS.[11]
- **Diffusion Transformers**: CosyVoice 3 uses a [DiT](/wiki/diffusion_transformer)-style backbone in its flow-matching decoder.[3]
- **FunAudioLLM siblings**: SenseVoice (ASR + LID + emotion) and Fun-ASR provide the speech-understanding half of the FunAudioLLM project; the speech tokenizer in CosyVoice 1/2 is a fine-tune of SenseVoice-Large.[5]
- **Qwen LLM family**: CosyVoice 2 initializes its text-to-token model from [Qwen](/wiki/qwen) 2.5-0.5B, leveraging the broader Tongyi LLM lineage.[2]

## See also

- [Text-to-Speech](/wiki/text_to_speech_ai)
- [Voice cloning](/wiki/voice_cloning)
- [Flow matching](/wiki/flow_matching)
- [Diffusion Transformer (DiT)](/wiki/diffusion_transformer)
- [Alibaba Group](/wiki/alibaba)
- [Tongyi Qianwen](/wiki/tongyi_qianwen)
- [Qwen](/wiki/qwen)
- [Qwen3](/wiki/qwen_3)
- [ModelScope](/wiki/modelscope)
- [Hugging Face](/wiki/hugging_face)
- [vLLM](/wiki/vllm)
- [TensorRT](/wiki/tensorrt)
- [ElevenLabs](/wiki/elevenlabs)
- [ElevenLabs v3](/wiki/elevenlabs_v3)
- [Sesame CSM](/wiki/sesame_csm)
- [Suno](/wiki/suno)
- [Suno v5](/wiki/suno_v5)
- [Whisper](/wiki/whisper)
- [Deepfake](/wiki/deepfake)

## References

[1] Du, Zhihao et al., "CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens", arXiv, 2024-07-07. https://arxiv.org/abs/2407.05407. Accessed 2026-05-20.
[2] Du, Zhihao et al., "CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models", arXiv, 2024-12-13. https://arxiv.org/abs/2412.10117. Accessed 2026-05-20.
[3] Du, Zhihao et al., "CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training", arXiv, 2025-05-23. https://arxiv.org/abs/2505.17589. Accessed 2026-05-20.
[4] FunAudioLLM, "Fun-CosyVoice3-0.5B-2512 Model Card", Hugging Face, 2025-12-15. https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512. Accessed 2026-05-20.
[5] Tongyi SpeechTeam, "FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs", arXiv, 2024-07-04. https://arxiv.org/abs/2407.04051. Accessed 2026-05-20.
[6] FunAudioLLM, "CosyVoice LICENSE (Apache 2.0)", GitHub, 2024. https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE. Accessed 2026-05-20.
[7] FunAudioLLM, "CosyVoice GitHub repository", GitHub, 2024-2025. https://github.com/FunAudioLLM/CosyVoice. Accessed 2026-05-20.
[8] FunAudioLLM, "CosyVoice2-0.5B Model Card", Hugging Face, 2024. https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B. Accessed 2026-05-20.
[9] Wang, Chengyi et al., "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers", arXiv, 2023-01-05. https://arxiv.org/abs/2301.02111. Accessed 2026-05-20.
[10] Alibaba Cloud, "CosyVoice speech synthesis Python SDK", Alibaba Cloud Model Studio Documentation, 2025. https://www.alibabacloud.com/help/en/model-studio/cosyvoice-python-sdk. Accessed 2026-05-20.
[11] Du, Zhihao et al., "CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens (HTML)", arXiv, 2024-07-09. https://arxiv.org/html/2407.05407v1. Accessed 2026-05-20.
[12] Du, Zhihao et al., "CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models (HTML)", arXiv, 2024-12-25. https://arxiv.org/html/2412.10117v3. Accessed 2026-05-20.
[13] Tongyi SpeechTeam, "FunAudioLLM homepage and CosyVoice technical demos", FunAudioLLM, 2024. https://funaudiollm.github.io/. Accessed 2026-05-20.
[14] Alibaba Cloud, "Voice cloning user guide", Alibaba Cloud Model Studio Documentation, 2025. https://www.alibabacloud.com/help/en/model-studio/voice-cloning-user-guide. Accessed 2026-05-20.
[15] FunAudioLLM team, "CosyVoice 3 demo page", FunAudioLLM, 2025. https://funaudiollm.github.io/cosyvoice3/. Accessed 2026-05-20.
[16] Alizila staff, "News Roundup: Multilingual CosyVoice 3, Upgraded AgentScope for Production-Grade AI Agents, Enterprise-Ready AI Coding", Alizila (Alibaba), 2025-12. https://www.alizila.com/news-roundup-multilingual-cosyvoice-3-upgraded-agentscope-for-production-grade-ai-agents-enterprise-ready-ai-coding/. Accessed 2026-05-20.
[17] StableLearn, "CosyVoice 3.0 Tech Guide: Next-Gen Zero-Shot Speech Generation", StableLearn, 2025-12-15. https://stable-learn.com/en/cosyvoice3-tech-guide/. Accessed 2026-05-20.
[18] DataRoot Labs, "Which Open Source Text-to-Speech Model Should You Use?", DataRoot Labs, 2026. https://datarootlabs.com/blog/text-to-speech-models. Accessed 2026-05-20.