XTTS (Coqui XTTS)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,192 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,192 words
Add missing citations, update stale details, or suggest a clearer explanation.
XTTS (sometimes stylized ⓍTTS, short for "cross-lingual text-to-speech") is an open-weights multilingual text-to-speech model developed by Coqui AI that performs zero-shot voice cloning from short reference audio prompts of roughly six seconds.[1][2] The model couples a GPT-style autoregressive transformer over discrete speech tokens with a HiFi-GAN-style decoder, and it was first released to the public on 30 September 2023 in collaboration with Hugging Face.[3] A refined second release, XTTS v2, followed in November 2023 and extended language coverage to seventeen languages (English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and later Hindi).[1][4] The weights are distributed under the non-commercial Coqui Public Model License (CPML), and although Coqui AI shut down in January 2024, the model and its training toolkit remain widely used through community forks on Hugging Face and GitHub.[5][6][7]
| Field | Value |
|---|---|
| Developer | Coqui AI |
| Initial release | XTTS v1, 30 September 2023[3] |
| Latest release | XTTS v2 (model card finalized November 2023; minor patch releases in December 2023)[1][8] |
| Parameters | ~750M (commonly reported model size)[9] |
| Sample rate | 24 kHz output; 22 kHz conditioning input[2] |
| Reference clip length | ~6 seconds for voice cloning[1] |
| Streaming latency | ~200 ms round trip to first chunk, <100 ms inference on GPU[10] |
| Languages (v2) | 17: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi[1] |
| License | Coqui Public Model License 1.0.0 (non-commercial; commercial license was sold separately while Coqui operated)[5][11] |
| Code repository | github.com/coqui-ai/TTS (original, unmaintained) and github.com/idiap/coqui-ai-TTS (community fork)[6][7] |
| Model weights | huggingface.co/coqui/XTTS-v2[1] |
Coqui AI was founded in 2021 by alumni of Mozilla's machine learning group, including Josh Meyer, Eren Gölge, Reuben Morais, and Kelly Davis.[12][7] The company started by maintaining and expanding the open-source speech research stack that some of the same engineers had begun building inside Mozilla under the Common Voice and Mozilla TTS projects. The flagship public artifact of this effort was the coqui-ai/TTS library on GitHub, a PyTorch toolkit covering Tacotron 2, FastSpeech 2, Glow-TTS, VITS, and a port of Suno's Bark, distributed under the MPL-2.0 license.[6] By late 2023 the repository carried tens of thousands of GitHub stars and the company was offering a commercial product, Coqui Studio, layered on top of the same models.[6][12]
Coqui's research direction in 2022 and 2023 shifted toward zero-shot, multilingual, language-model-style speech synthesis, in line with broader industry developments such as VALL-E, SoundStream, and the Tortoise-TTS system that James Betker had released on GitHub in 2022.[13][14] Tortoise itself trained a GPT-style autoregressive model to predict mel-spectrogram codebook tokens and then decoded those tokens with a diffusion model and a vocoder; it became the conceptual template that Coqui's team extended into XTTS.[14]
XTTS v1 was unveiled on 30 September 2023 through a joint announcement with Hugging Face.[3] At launch the model supported thirteen languages: English, Spanish, French, German, Italian, Brazilian Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, and Mandarin Chinese.[3] The original announcement framed XTTS as "the first generative voice AI foundation model" trained for cross-lingual cloning, and it advertised the ability to clone a speaker's voice from a short audio sample of a few seconds and have that voice speak in any of the thirteen supported languages.[3] The official model card on Hugging Face confirmed the 24 kHz output sampling rate, the six-second reference-clip requirement, and the use of the Coqui Public Model License for the weights.[15] In the days after release the repository was reported as the top trending project on GitHub and the top trending Space on Hugging Face.[3]
A Japanese language code was added to the same v1 family shortly after the initial release, bringing the early count to fourteen languages on the v1 model card.[15]
The launch positioning was deliberately framed against the proprietary voice cloning incumbents. Coqui's announcement copy described XTTS as a "foundation model for generative voice" and contrasted the six-second reference-clip requirement with the multi-minute or multi-hour data requirements typical of earlier speaker-adaptive TTS systems.[3] Coqui co-founder Joshua Meyer wrote in the press release that the partnership with Hugging Face was intended to ensure that the new foundation model was widely accessible to researchers and hobbyists, and Hugging Face CTO Julien Chaumond described the launch as part of an ongoing collaboration between the two companies on speech models.[3] In the public discussion that followed the release, developers reported success cloning their own voices from clips of three to ten seconds and demonstrated cross-lingual transfer of English reference speakers into Spanish, German, and Mandarin.[15][22]
XTTS v2 was released about two months after v1. The v2 model card adds Hungarian and Korean (bringing the total to sixteen languages at launch), along with revised speaker conditioning that supports multiple reference clips and interpolation between reference speakers, generally improved stability, and reportedly better prosody and audio quality across languages.[1] Patch releases continued through December 2023; version 2.0.2 of the v2 weights, shipped alongside Coqui TTS library v0.22.0 on 12 December 2023, is the last release published under the original Coqui organization.[8] An additional Hindi language code was rolled into the model card after launch, taking the language count to seventeen as listed on the current Hugging Face page.[1]
The interpolation-style speaker conditioning in v2 is exposed in the Coqui TTS API as a get_conditioning_latents() function that returns two tensors per reference: a GPT conditioning latent that biases the autoregressive token predictor and a speaker embedding consumed by the decoder.[2] Multiple reference files can be passed at once, with the cost of inference unchanged because the latents and embeddings are aggregated before generation begins.[2]
Beyond the new languages and conditioning mechanism, the v2 model card lists "stability improvements" and "better prosody and audio quality" as the headline improvements.[1] In practical terms, users reported fewer instances of the autoregressive transformer entering pathological loops on long inputs (a behavior that occasionally caused v1 to repeat syllables or stop generating before the input text was exhausted), more consistent emotional intonation across sentences in a paragraph, and somewhat cleaner consonant articulation. These improvements were not accompanied by a corresponding paper at the time of release; the formal academic write-up of the system came later, in the Interspeech 2024 paper that documents the model trained on sixteen languages.[9]
On 3 January 2024 Coqui co-founder Josh Meyer announced that the company was winding down operations, writing "Coqui is shutting down" on social media and pointing the community to the open-source repository as the canonical home for XTTS going forward.[12][16] The decision followed a wave of consolidation in the generative-voice market in late 2023, where well-funded incumbents such as ElevenLabs dominated paid voice cloning while open-weight alternatives proliferated. The closure of Coqui Studio and the Coqui API followed shortly afterwards, and the official coqui-ai/TTS GitHub repository has had no new releases since v0.22.0 on 12 December 2023.[6][8]
The shutdown immediately raised questions in the developer community about the future of the XTTS license, with users on the repository's discussion board asking Coqui to relicense the weights under a permissive license such as Apache 2.0 or MIT so that the model could be used commercially without negotiating with a now-defunct company.[17] No relicensing occurred, and the CPML notice on the Hugging Face page for coqui/XTTS-v2 remained in place. The Idiap Research Institute in Switzerland created an actively maintained community fork at github.com/idiap/coqui-ai-TTS, publishing a new PyPI package named coqui-tts and continuing to ship feature releases through 2025 and into early 2026.[7]
XTTS is an end-to-end zero-shot voice cloning text-to-speech system whose runtime data flow can be summarized in four stages: a tokenizer turns input text into BPE-style symbol tokens, a GPT-style autoregressive transformer predicts a sequence of discrete audio tokens conditioned on those text tokens and on speaker latents extracted from a reference clip, a HiFi-GAN-style decoder converts the predicted token sequence (and a separate speaker embedding) into a 24 kHz waveform, and an optional streaming wrapper exposes the system as a low-latency real-time interface.[2][9][18]
The architecture inherits the broad pattern that James Betker introduced in Tortoise TTS: a transformer language model over discrete speech codes serves as the prosody and phonetic predictor, while a separate non-autoregressive neural network reconstructs the waveform from those codes.[14][18] XTTS keeps that two-stage structure but swaps several components and adds explicit multilingual and multi-speaker conditioning.
Text inputs are normalized per language and tokenized with a byte-pair-encoded (BPE) vocabulary referred to in the code as VoiceBpeTokenizer.[18] The tokenizer is multilingual: a single shared vocabulary handles all of the supported languages, and the model receives a language-id token at the start of each generation so it knows which phonological and prosodic distribution to sample from.[2][9] This design is what allows the v2 model to perform cross-language transfer in which a six-second English reference clip is used to synthesize, for example, Korean or Hindi speech in the same speaker's voice.[1]
The core of XTTS is a decoder-only transformer modeled on the GPT-2 architecture and configured for speech.[18] The transformer's vocabulary contains a mixture of text BPE tokens (consumed as context) and discrete audio tokens (the output target). During training the audio tokens come from a discrete variational autoencoder (a DVAE, conceptually equivalent to a VQ-VAE) that compresses 24 kHz mel-spectrograms into a small codebook; the transformer then learns to predict the next audio token given the preceding text tokens, language id, and audio tokens already emitted.[18][9] This is the same pattern as Tortoise and as autoregressive neural-codec language models such as VALL-E and AudioLM, with the practical difference that the codebook in XTTS encodes mel-spectrogram patches rather than EnCodec residual codes.[14][18]
The XTTS v2 paper and accompanying documentation also describe a perceiver resampler that maps the variable-length speaker reference into a fixed-size set of conditioning vectors prepended to the transformer input.[9][18] This is the mechanism that turns a several-second reference clip into a stable speaker prior even when the clip has different length and prosody from the target utterance, and it is one of the components that the v2 release explicitly changed relative to v1.[9]
The predicted audio tokens are converted back into a 24 kHz waveform by a HiFi-GAN-style decoder.[18] The decoder is conditioned on a speaker embedding extracted from the reference clip in parallel with the GPT conditioning latents, which lets the timbre of the cloned speaker influence not just the token sequence but also the spectral details of the synthesized waveform.[2][18] Deep Learning toolkits that document XTTS internals report that the HiFi-GAN decoder uses a multi-scale generator architecture and is trained with adversarial and reconstruction losses, consistent with the original HiFi-GAN design.[18]
Older Coqui documentation references diffusion-based decoders inherited from Tortoise; in practice the production XTTS pipeline shipped with the v2 weights uses the HiFi-GAN-style waveform generator described in the v2 paper and code, not a diffusion decoder.[9][18]
Voice cloning in XTTS is zero-shot: the model is not fine-tuned per speaker, and adding a new voice does not require gradient updates. Instead, the reference audio is run through a speaker encoder that produces a fixed-length embedding (consumed by the decoder) and through the perceiver resampler that produces the GPT conditioning latent (consumed by the transformer).[2][9] Both representations can be cached, so repeated inference for the same speaker reuses cached latents and avoids re-encoding the reference clip.[2] In practice this caching, plus the comparatively small size of the autoregressive token sequence, is what enables the model's sub-200-millisecond streaming first-chunk latency on a single GPU.[10]
The official documentation states that XTTS can operate with reference clips as short as three seconds and that quality plateaus around six seconds, with multiple shorter clips often outperforming a single longer one because the model can average over different prosodic states.[2]
The conditioning pipeline imposes important practical constraints. Because the speaker embedding and the GPT conditioning latent are computed from the raw waveform, the quality of the reference clip dominates the quality of the clone: clean studio-quality recordings produce significantly more faithful clones than telephone-band or noisy field recordings, and clips containing music, overlapping voices, or strong room reverberation can leak those characteristics into the synthesized output.[2][20] The documentation recommends using mono 24 kHz reference audio cropped to clean speech segments, and the community fork's v0.27 release added a caching layer specifically to let production systems amortize the cost of cleaning and encoding reference clips across many synthesis requests for the same speaker.[7]
A separate streaming wrapper in the Coqui TTS library lets the model emit audio chunks as the transformer generates tokens, rather than waiting for the full utterance.[10] On a T4-class GPU the documented round trip time to the first audio chunk is roughly 200 milliseconds, with under 100 milliseconds of that spent inside the transformer itself.[10] The library exposes DeepSpeed-accelerated inference paths for users who want to push throughput higher on commodity hardware.[2]
The XTTS paper accepted at Interspeech 2024 describes training on a multilingual corpus covering sixteen languages, including low- and medium-resource languages such as Hungarian, Korean, Czech, and Arabic.[9] The authors emphasize that XTTS was the first massively multilingual zero-shot TTS to cover this breadth, going beyond earlier multilingual systems such as YourTTS and VALL-E X that supported only a handful of high-resource languages.[9] Exact per-language hour counts are reported in the paper's experimental section. The released v2 model on Hugging Face is the production version of this system, with Hindi added through additional training after the paper submission.[1][9]
A small number of design choices in the data pipeline are documented in the paper and in third-party reproductions: the audio is resampled to a unified rate before mel-spectrogram extraction, language ids are injected at the start of the transformer context window, and the DVAE codebook is trained jointly with the autoregressive model so that the discrete audio token distribution adapts to the multilingual distribution rather than being frozen from a monolingual pretraining stage.[9][18] The choice to use a single shared multilingual BPE vocabulary (rather than language-specific tokenizers) was justified in the paper as a way to share representations across related languages, which the authors argue is what allows XTTS to perform reasonably on the lower-resource languages in the training set.[9]
| Stage | Component | Role |
|---|---|---|
| Frontend | VoiceBpeTokenizer (BPE) | Tokenizes multilingual text input[18] |
| Conditioning extractor | Speaker encoder + perceiver resampler | Produces a speaker embedding and GPT conditioning latents from a ~6s reference clip[2][9] |
| Core LM | Decoder-only transformer (~GPT-2 style) | Predicts discrete audio tokens autoregressively from text, language id, and speaker latents[18][9] |
| Codebook | DVAE / VQ-style audio codebook | Defines the discrete audio token vocabulary that the LM emits[18][9] |
| Decoder | HiFi-GAN-style generator | Reconstructs 24 kHz waveform from predicted tokens conditioned on the speaker embedding[18][9] |
| Optional | DeepSpeed and streaming wrappers | Lower-latency inference, real-time chunked output[2][10] |
| Aspect | XTTS v1 | XTTS v2 |
|---|---|---|
| Initial release | 30 September 2023[3] | November 2023 (model card finalized; v0.22.0 patch on 12 December 2023)[1][8] |
| Languages at release | 13 (en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh)[3] | 16 at launch, 17 with later Hindi addition (adds hu, ko, hi to the v1 set; ja was added to v1 after its release)[1][15] |
| Speaker conditioning | Single-reference speaker encoder + Tortoise-style latents[15] | Perceiver resampler + multi-reference and interpolation support[1][9] |
| Audio quality | Production-grade but with reported stability issues on long prompts | "Better prosody and audio quality" plus general stability improvements per the v2 model card[1] |
| Streaming | Not officially supported at launch | Supported with ~200 ms time to first chunk[10] |
The reference implementation lives in the Coqui TTS library, originally at github.com/coqui-ai/TTS and now actively maintained at github.com/idiap/coqui-ai-TTS.[6][7] The library is the same toolkit that also implements Tacotron 2, FastSpeech 2, Glow-TTS, VITS, and a port of Bark, and that previously hosted around 1,100 Fairseq Massively Multilingual Speech (MMS) models for low-resource languages.[6] After the Idiap fork the new PyPI distribution is named coqui-tts, separate from the original TTS package.[7] Public release notes from the Idiap fork track new features such as cached cloned-voice latents added in v0.27.0 and continued maintenance into early 2026.[7]
The canonical model weights are published as coqui/XTTS-v2 on Hugging Face, with coqui/XTTS-v1 retained for the earlier release.[1][15] As of May 2026 the v2 page reports millions of monthly downloads, making it one of the most-downloaded open-weight speech models on the platform.[1] A demo Space at huggingface.co/spaces/coqui/xtts provides an interactive playground for cloning a voice from an uploaded reference clip.[1]
The Coqui TTS library ships fine-tuning recipes for the GPT portion of XTTS, including a Gradio-based interface that walks through dataset preparation, GPT fine-tuning, and inference with the fine-tuned weights.[2] The official documentation reports that adapting XTTS to a new voice or stylistic register typically requires around ten minutes of clean speech from the target speaker.[2][15] Because the HiFi-GAN-style decoder is shared across speakers, fine-tuning is generally restricted to the transformer block.
Independent vendors have packaged XTTS as a managed inference endpoint, including Baseten and Eachlabs, both of which document the streaming endpoint and a real-time factor of roughly 0.3 on consumer-grade GPUs.[10][19] The model has also been wrapped by community projects for use inside open-source pipelines such as local AI agents, developer-tool integrations of various kinds, and audiobook generators. Vendors and forum guides routinely describe XTTS v2 as the leading open-weight option for cross-lingual voice cloning available in 2026, alongside newer entrants such as F5-TTS and various flow-matching systems.[20][21]
XTTS gained traction in the months between its September 2023 release and the Coqui shutdown because it combined three properties that were rare in 2023: zero-shot voice cloning that produced recognizable output from a short reference, broad multilingual coverage, and openly downloadable weights.[3][1] Common deployment scenarios documented in third-party guides and product write-ups include localized voice cloning for content creators, dubbing and accessibility audio for video games, audiobook narration in languages that lack established commercial voices, real-time agentic interactions in voice-enabled assistants, and personal-use voice synthesis where uploading speech to a commercial API is undesirable.[20][22]
The model is also frequently used as a baseline or comparison target in academic and industrial work on neural TTS, where it serves as the canonical open-weight zero-shot multilingual baseline alongside VALL-E-style systems.[9]
The defining practical limitation of XTTS in 2026 is licensing. The weights ship under the Coqui Public Model License 1.0.0, which explicitly forbids any "direct or indirect payment arising from the use of the model or its output," prohibits using XTTS to train other models for commercial use, and reserves the right to grant separate commercial licenses to Coqui.[11] While Coqui operated, that separate commercial license existed and was reportedly priced at modest annual fees for small companies.[5] After the January 2024 shutdown no party is in a position to sell or sign a CPML commercial license, leaving an unresolved gap in which the technical capability to use XTTS commercially exists, but the legal mechanism is dormant.[5][11][17] Independent commentary has warned would-be commercial users that the absence of an active rights holder does not in itself create a permissive license, and several have recommended switching to alternative models with permissive licenses for production deployments.[5][21]
In comparative evaluations XTTS v2 is consistently described as a strong open-weight system whose subjective quality approaches but does not quite match the best proprietary services such as ElevenLabs.[20][21] Reported failure modes include occasional mispronunciations on rare proper nouns, prosody drift on very long utterances, and quality degradation when the reference clip is noisy or contains overlapping speech.[21][22] Newer open systems released after XTTS v2 (such as F5-TTS, OpenVoice v2, and various flow-matching architectures) have begun to outperform it on individual axes such as real-time factor and time to first audio chunk, even when overall naturalness remains close.[21]
Like all expressive zero-shot voice cloning systems, XTTS can be used to fabricate convincing audio of real people without their consent, which is the central abuse vector that motivated regulators and platform operators to focus on voice deepfakes from 2023 onwards. The CPML does not contain a use-based behavioral clause comparable to RAIL-style licenses; it controls commercial use rather than misuse, leaving deepfake risk mitigation to downstream users and platforms.[11] Coqui acknowledged this risk in their original September 2023 announcement by foregrounding research and creative use cases and by emphasizing the non-commercial license as a friction layer against large-scale abuse.[3]
Because the original coqui-ai/TTS repository has not shipped a release since December 2023, users who depend on XTTS in long-lived stacks must rely on the Idiap community fork or maintain their own forks.[6][7] The Idiap fork has so far tracked Python and PyTorch ecosystem changes (including new CUDA versions and updated DeepSpeed builds), but its long-term governance depends on a single research institution rather than a commercial sponsor.[7]
| System | Released | Open weights? | Languages | Zero-shot voice cloning | License |
|---|---|---|---|---|---|
| Tortoise TTS | 2022 | Yes (Apache 2.0) | English only | Yes | Apache 2.0[14] |
| XTTS v2 | 2023 | Yes | 17 | Yes, ~6s ref clip | CPML (non-commercial)[1][11] |
| Bark | 2023 | Yes | Several (incl. non-speech sounds) | Limited (preset speakers) | MIT[6] |
| VALL-E | 2023 | No (Microsoft Research paper) | Originally English; VALL-E X added cross-lingual | Yes, 3s ref clip | Closed[13] |
| ElevenLabs | 2022– | No (proprietary API) | 32+ | Yes, paid API | Commercial[20] |
| F5-TTS | 2024 | Yes | Several | Yes | More permissive[21] |
XTTS occupies a specific niche in this landscape: it has broader language coverage than English-only Tortoise and Bark, openly published weights unlike VALL-E and ElevenLabs, and a stricter license than Tortoise or F5-TTS. The Interspeech 2024 paper positions XTTS as "the first massively multilingual ZS-TTS model supporting low/medium resource languages" and credits the perceiver resampler and the unified BPE vocabulary as the components that make this scale of multilingual coverage tractable.[9]