XTTS (Coqui XTTS)

Open Source AI Speech & Audio AI Voice AI

21 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v4 · 4,188 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

XTTS (sometimes stylized ⓍTTS, short for "cross-lingual text-to-speech") is an open-weights multilingual text-to-speech model developed by Coqui AI that performs zero-shot voice cloning from short reference audio prompts of roughly six seconds.^[1]^[2] The model couples a GPT-style autoregressive transformer over discrete speech tokens with a HiFi-GAN-style decoder, and it was first released to the public on 30 September 2023 in collaboration with Hugging Face.^[3] A refined second release, XTTS v2, followed in November 2023 and extended language coverage to seventeen languages (English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and later Hindi).^[1]^[4] The weights are distributed under the non-commercial Coqui Public Model License (CPML), and although Coqui AI shut down in January 2024, the model and its training toolkit remain widely used through community forks on Hugging Face and GitHub.^[5]^[6]^[7]

Infobox

Field	Value
Developer	Coqui AI
Initial release	XTTS v1, 30 September 2023^[3]
Latest release	XTTS v2 (model card finalized November 2023; minor patch releases in December 2023)^[1]^[8]
Parameters	~750M (commonly reported model size)^[9]
Sample rate	24 kHz output; 22 kHz conditioning input^[2]
Reference clip length	~6 seconds for voice cloning^[1]
Streaming latency	~200 ms round trip to first chunk, <100 ms inference on GPU^[10]
Languages (v2)	17: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi^[1]
License	Coqui Public Model License 1.0.0 (non-commercial; commercial license was sold separately while Coqui operated)^[5]^[11]
Code repository	github.com/coqui-ai/TTS (original, unmaintained) and github.com/idiap/coqui-ai-TTS (community fork)^[6]^[7]
Model weights	huggingface.co/coqui/XTTS-v2^[1]

History

Coqui AI and the road to XTTS

Coqui AI was founded in 2021 by alumni of Mozilla's machine learning group, including Josh Meyer, Eren Gölge, Reuben Morais, and Kelly Davis.^[12]^[7] The company started by maintaining and expanding the open-source speech research stack that some of the same engineers had begun building inside Mozilla under the Common Voice and Mozilla TTS projects. The flagship public artifact of this effort was the coqui-ai/TTS library on GitHub, a PyTorch toolkit covering Tacotron 2, FastSpeech 2, Glow-TTS, VITS, and a port of Suno's Bark, distributed under the MPL-2.0 license.^[6] By late 2023 the repository carried tens of thousands of GitHub stars and the company was offering a commercial product, Coqui Studio, layered on top of the same models.^[6]^[12]

Coqui's research direction in 2022 and 2023 shifted toward zero-shot, multilingual, language-model-style speech synthesis, in line with broader industry developments such as VALL-E, SoundStream, and the Tortoise-TTS system that James Betker had released on GitHub in 2022.^[13]^[14] Tortoise itself trained a GPT-style autoregressive model to predict mel-spectrogram codebook tokens and then decoded those tokens with a diffusion model and a vocoder; it became the conceptual template that Coqui's team extended into XTTS.^[14]

XTTS v1 (September 2023)

XTTS v1 was unveiled on 30 September 2023 through a joint announcement with Hugging Face.^[3] At launch the model supported thirteen languages: English, Spanish, French, German, Italian, Brazilian Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, and Mandarin Chinese.^[3] The original announcement framed XTTS as "the first generative voice AI foundation model" trained for cross-lingual cloning, and it advertised the ability to clone a speaker's voice from a short audio sample of a few seconds and have that voice speak in any of the thirteen supported languages.^[3] The official model card on Hugging Face confirmed the 24 kHz output sampling rate, the six-second reference-clip requirement, and the use of the Coqui Public Model License for the weights.^[15] In the days after release the repository was reported as the top trending project on GitHub and the top trending Space on Hugging Face.^[3]

A Japanese language code was added to the same v1 family shortly after the initial release, bringing the early count to fourteen languages on the v1 model card.^[15]

The launch positioning was deliberately framed against the proprietary voice cloning incumbents. Coqui's announcement copy described XTTS as a "foundation model for generative voice" and contrasted the six-second reference-clip requirement with the multi-minute or multi-hour data requirements typical of earlier speaker-adaptive TTS systems.^[3] Coqui co-founder Joshua Meyer wrote in the press release that the partnership with Hugging Face was intended to ensure that the new foundation model was widely accessible to researchers and hobbyists, and Hugging Face CTO Julien Chaumond described the launch as part of an ongoing collaboration between the two companies on speech models.^[3] In the public discussion that followed the release, developers reported success cloning their own voices from clips of three to ten seconds and demonstrated cross-lingual transfer of English reference speakers into Spanish, German, and Mandarin.^[15]^[22]

XTTS v2 (November 2023)

XTTS v2 was released about two months after v1. The v2 model card adds Hungarian and Korean (bringing the total to sixteen languages at launch), along with revised speaker conditioning that supports multiple reference clips and interpolation between reference speakers, generally improved stability, and reportedly better prosody and audio quality across languages.^[1] Patch releases continued through December 2023; version 2.0.2 of the v2 weights, shipped alongside Coqui TTS library v0.22.0 on 12 December 2023, is the last release published under the original Coqui organization.^[8] An additional Hindi language code was rolled into the model card after launch, taking the language count to seventeen as listed on the current Hugging Face page.^[1]

The interpolation-style speaker conditioning in v2 is exposed in the Coqui TTS API as a get_conditioning_latents() function that returns two tensors per reference: a GPT conditioning latent that biases the autoregressive token predictor and a speaker embedding consumed by the decoder.^[2] Multiple reference files can be passed at once, with the cost of inference unchanged because the latents and embeddings are aggregated before generation begins.^[2]

Beyond the new languages and conditioning mechanism, the v2 model card lists "stability improvements" and "better prosody and audio quality" as the headline improvements.^[1] In practical terms, users reported fewer instances of the autoregressive transformer entering pathological loops on long inputs (a behavior that occasionally caused v1 to repeat syllables or stop generating before the input text was exhausted), more consistent emotional intonation across sentences in a paragraph, and somewhat cleaner consonant articulation. These improvements were not accompanied by a corresponding paper at the time of release; the formal academic write-up of the system came later, in the Interspeech 2024 paper that documents the model trained on sixteen languages.^[9]

The Coqui shutdown

On 3 January 2024 Coqui co-founder Josh Meyer announced that the company was winding down operations, writing "Coqui is shutting down" on social media and pointing the community to the open-source repository as the canonical home for XTTS going forward.^[12]^[16] The decision followed a wave of consolidation in the generative-voice market in late 2023, where well-funded incumbents such as ElevenLabs dominated paid voice cloning while open-weight alternatives proliferated. The closure of Coqui Studio and the Coqui API followed shortly afterwards, and the official coqui-ai/TTS GitHub repository has had no new releases since v0.22.0 on 12 December 2023.^[6]^[8]

The shutdown immediately raised questions in the developer community about the future of the XTTS license, with users on the repository's discussion board asking Coqui to relicense the weights under a permissive license such as Apache 2.0 or MIT so that the model could be used commercially without negotiating with a now-defunct company.^[17] No relicensing occurred, and the CPML notice on the Hugging Face page for coqui/XTTS-v2 remained in place. The Idiap Research Institute in Switzerland created an actively maintained community fork at github.com/idiap/coqui-ai-TTS, publishing a new PyPI package named coqui-tts and continuing to ship feature releases through 2025 and into early 2026.^[7]

How XTTS works

Overall pipeline

XTTS is an end-to-end zero-shot voice cloning text-to-speech system whose runtime data flow can be summarized in four stages: a tokenizer turns input text into BPE-style symbol tokens, a GPT-style autoregressive transformer predicts a sequence of discrete audio tokens conditioned on those text tokens and on speaker latents extracted from a reference clip, a HiFi-GAN-style decoder converts the predicted token sequence (and a separate speaker embedding) into a 24 kHz waveform, and an optional streaming wrapper exposes the system as a low-latency real-time interface.^[2]^[9]^[18]

The architecture inherits the broad pattern that James Betker introduced in Tortoise TTS: a transformer language model over discrete speech codes serves as the prosody and phonetic predictor, while a separate non-autoregressive neural network reconstructs the waveform from those codes.^[14]^[18] XTTS keeps that two-stage structure but swaps several components and adds explicit multilingual and multi-speaker conditioning.

Text frontend and tokenizer

Text inputs are normalized per language and tokenized with a byte-pair-encoded (BPE) vocabulary referred to in the code as VoiceBpeTokenizer.^[18] The tokenizer is multilingual: a single shared vocabulary handles all of the supported languages, and the model receives a language-id token at the start of each generation so it knows which phonological and prosodic distribution to sample from.^[2]^[9] This design is what allows the v2 model to perform cross-language transfer in which a six-second English reference clip is used to synthesize, for example, Korean or Hindi speech in the same speaker's voice.^[1]

GPT-style autoregressive token predictor

The core of XTTS is a decoder-only transformer modeled on the GPT-2 architecture and configured for speech.^[18] The transformer's vocabulary contains a mixture of text BPE tokens (consumed as context) and discrete audio tokens (the output target). During training the audio tokens come from a discrete variational autoencoder (a DVAE, conceptually equivalent to a VQ-VAE) that compresses 24 kHz mel-spectrograms into a small codebook; the transformer then learns to predict the next audio token given the preceding text tokens, language id, and audio tokens already emitted.^[18]^[9] This is the same pattern as Tortoise and as autoregressive neural-codec language models such as VALL-E and AudioLM, with the practical difference that the codebook in XTTS encodes mel-spectrogram patches rather than EnCodec residual codes.^[14]^[18]

The XTTS v2 paper and accompanying documentation also describe a perceiver resampler that maps the variable-length speaker reference into a fixed-size set of conditioning vectors prepended to the transformer input.^[9]^[18] This is the mechanism that turns a several-second reference clip into a stable speaker prior even when the clip has different length and prosody from the target utterance, and it is one of the components that the v2 release explicitly changed relative to v1.^[9]

HiFi-GAN-style decoder

The predicted audio tokens are converted back into a 24 kHz waveform by a HiFi-GAN-style decoder.^[18] The decoder is conditioned on a speaker embedding extracted from the reference clip in parallel with the GPT conditioning latents, which lets the timbre of the cloned speaker influence not just the token sequence but also the spectral details of the synthesized waveform.^[2]^[18] Deep Learning toolkits that document XTTS internals report that the HiFi-GAN decoder uses a multi-scale generator architecture and is trained with adversarial and reconstruction losses, consistent with the original HiFi-GAN design.^[18]

Older Coqui documentation references diffusion-based decoders inherited from Tortoise; in practice the production XTTS pipeline shipped with the v2 weights uses the HiFi-GAN-style waveform generator described in the v2 paper and code, not a diffusion decoder.^[9]^[18]

Voice cloning conditioning

Voice cloning in XTTS is zero-shot: the model is not fine-tuned per speaker, and adding a new voice does not require gradient updates. Instead, the reference audio is run through a speaker encoder that produces a fixed-length embedding (consumed by the decoder) and through the perceiver resampler that produces the GPT conditioning latent (consumed by the transformer).^[2]^[9] Both representations can be cached, so repeated inference for the same speaker reuses cached latents and avoids re-encoding the reference clip.^[2] In practice this caching, plus the comparatively small size of the autoregressive token sequence, is what enables the model's sub-200-millisecond streaming first-chunk latency on a single GPU.^[10]

The official documentation states that XTTS can operate with reference clips as short as three seconds and that quality plateaus around six seconds, with multiple shorter clips often outperforming a single longer one because the model can average over different prosodic states.^[2]

The conditioning pipeline imposes important practical constraints. Because the speaker embedding and the GPT conditioning latent are computed from the raw waveform, the quality of the reference clip dominates the quality of the clone: clean studio-quality recordings produce significantly more faithful clones than telephone-band or noisy field recordings, and clips containing music, overlapping voices, or strong room reverberation can leak those characteristics into the synthesized output.^[2]^[20] The documentation recommends using mono 24 kHz reference audio cropped to clean speech segments, and the community fork's v0.27 release added a caching layer specifically to let production systems amortize the cost of cleaning and encoding reference clips across many synthesis requests for the same speaker.^[7]

Streaming and inference

A separate streaming wrapper in the Coqui TTS library lets the model emit audio chunks as the transformer generates tokens, rather than waiting for the full utterance.^[10] On a T4-class GPU the documented round trip time to the first audio chunk is roughly 200 milliseconds, with under 100 milliseconds of that spent inside the transformer itself.^[10] The library exposes DeepSpeed-accelerated inference paths for users who want to push throughput higher on commodity hardware.^[2]

Training data

The XTTS paper accepted at Interspeech 2024 describes training on a multilingual corpus covering sixteen languages, including low- and medium-resource languages such as Hungarian, Korean, Czech, and Arabic.^[9] The authors emphasize that XTTS was the first massively multilingual zero-shot TTS to cover this breadth, going beyond earlier multilingual systems such as YourTTS and VALL-E X that supported only a handful of high-resource languages.^[9] Exact per-language hour counts are reported in the paper's experimental section. The released v2 model on Hugging Face is the production version of this system, with Hindi added through additional training after the paper submission.^[1]^[9]

A small number of design choices in the data pipeline are documented in the paper and in third-party reproductions: the audio is resampled to a unified rate before mel-spectrogram extraction, language ids are injected at the start of the transformer context window, and the DVAE codebook is trained jointly with the autoregressive model so that the discrete audio token distribution adapts to the multilingual distribution rather than being frozen from a monolingual pretraining stage.^[9]^[18] The choice to use a single shared multilingual BPE vocabulary (rather than language-specific tokenizers) was justified in the paper as a way to share representations across related languages, which the authors argue is what allows XTTS to perform reasonably on the lower-resource languages in the training set.^[9]

Architecture summary table

Stage	Component	Role
Frontend	`VoiceBpeTokenizer` (BPE)	Tokenizes multilingual text input^[18]
Conditioning extractor	Speaker encoder + perceiver resampler	Produces a speaker embedding and GPT conditioning latents from a ~6s reference clip^[2]^[9]
Core LM	Decoder-only transformer (~GPT-2 style)	Predicts discrete audio tokens autoregressively from text, language id, and speaker latents^[18]^[9]
Codebook	DVAE / VQ-style audio codebook	Defines the discrete audio token vocabulary that the LM emits^[18]^[9]
Decoder	HiFi-GAN-style generator	Reconstructs 24 kHz waveform from predicted tokens conditioned on the speaker embedding^[18]^[9]
Optional	DeepSpeed and streaming wrappers	Lower-latency inference, real-time chunked output^[2]^[10]

XTTS v1 versus XTTS v2

Aspect	XTTS v1	XTTS v2
Initial release	30 September 2023^[3]	November 2023 (model card finalized; v0.22.0 patch on 12 December 2023)^[1]^[8]
Languages at release	13 (en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh)^[3]	16 at launch, 17 with later Hindi addition (adds hu, ko, hi to the v1 set; ja was added to v1 after its release)^[1]^[15]
Speaker conditioning	Single-reference speaker encoder + Tortoise-style latents^[15]	Perceiver resampler + multi-reference and interpolation support^[1]^[9]
Audio quality	Production-grade but with reported stability issues on long prompts	"Better prosody and audio quality" plus general stability improvements per the v2 model card^[1]
Streaming	Not officially supported at launch	Supported with ~200 ms time to first chunk^[10]

Implementations and ecosystem

Coqui TTS toolkit

The reference implementation lives in the Coqui TTS library, originally at github.com/coqui-ai/TTS and now actively maintained at github.com/idiap/coqui-ai-TTS.^[6]^[7] The library is the same toolkit that also implements Tacotron 2, FastSpeech 2, Glow-TTS, VITS, and a port of Bark, and that previously hosted around 1,100 Fairseq Massively Multilingual Speech (MMS) models for low-resource languages.^[6] After the Idiap fork the new PyPI distribution is named coqui-tts, separate from the original TTS package.^[7] Public release notes from the Idiap fork track new features such as cached cloned-voice latents added in v0.27.0 and continued maintenance into early 2026.^[7]

Hugging Face

The canonical model weights are published as coqui/XTTS-v2 on Hugging Face, with coqui/XTTS-v1 retained for the earlier release.^[1]^[15] As of May 2026 the v2 page reports millions of monthly downloads, making it one of the most-downloaded open-weight speech models on the platform.^[1] A demo Space at huggingface.co/spaces/coqui/xtts provides an interactive playground for cloning a voice from an uploaded reference clip.^[1]

Fine-tuning and adapters

The Coqui TTS library ships fine-tuning recipes for the GPT portion of XTTS, including a Gradio-based interface that walks through dataset preparation, GPT fine-tuning, and inference with the fine-tuned weights.^[2] The official documentation reports that adapting XTTS to a new voice or stylistic register typically requires around ten minutes of clean speech from the target speaker.^[2]^[15] Because the HiFi-GAN-style decoder is shared across speakers, fine-tuning is generally restricted to the transformer block.

Third-party hosting and integrations

Independent vendors have packaged XTTS as a managed inference endpoint, including Baseten and Eachlabs, both of which document the streaming endpoint and a real-time factor of roughly 0.3 on consumer-grade GPUs.^[10]^[19] The model has also been wrapped by community projects for use inside open-source pipelines such as local AI agents, developer-tool integrations of various kinds, and audiobook generators. Vendors and forum guides routinely describe XTTS v2 as the leading open-weight option for cross-lingual voice cloning available in 2026, alongside newer entrants such as F5-TTS and various flow-matching systems.^[20]^[21]

Adoption and applications

XTTS gained traction in the months between its September 2023 release and the Coqui shutdown because it combined three properties that were rare in 2023: zero-shot voice cloning that produced recognizable output from a short reference, broad multilingual coverage, and openly downloadable weights.^[3]^[1] Common deployment scenarios documented in third-party guides and product write-ups include localized voice cloning for content creators, dubbing and accessibility audio for video games, audiobook narration in languages that lack established commercial voices, real-time agentic interactions in voice-enabled assistants, and personal-use voice synthesis where uploading speech to a commercial API is undesirable.^[20]^[22]

The model is also frequently used as a baseline or comparison target in academic and industrial work on neural TTS, where it serves as the canonical open-weight zero-shot multilingual baseline alongside VALL-E-style systems.^[9]

Limitations and criticisms

License and the post-shutdown problem

The defining practical limitation of XTTS in 2026 is licensing. The weights ship under the Coqui Public Model License 1.0.0, which explicitly forbids any "direct or indirect payment arising from the use of the model or its output," prohibits using XTTS to train other models for commercial use, and reserves the right to grant separate commercial licenses to Coqui.^[11] While Coqui operated, that separate commercial license existed and was reportedly priced at modest annual fees for small companies.^[5] After the January 2024 shutdown no party is in a position to sell or sign a CPML commercial license, leaving an unresolved gap in which the technical capability to use XTTS commercially exists, but the legal mechanism is dormant.^[5]^[11]^[17] Independent commentary has warned would-be commercial users that the absence of an active rights holder does not in itself create a permissive license, and several have recommended switching to alternative models with permissive licenses for production deployments.^[5]^[21]

Quality and robustness

In comparative evaluations XTTS v2 is consistently described as a strong open-weight system whose subjective quality approaches but does not quite match the best proprietary services such as ElevenLabs.^[20]^[21] Reported failure modes include occasional mispronunciations on rare proper nouns, prosody drift on very long utterances, and quality degradation when the reference clip is noisy or contains overlapping speech.^[21]^[22] Newer open systems released after XTTS v2 (such as F5-TTS, OpenVoice v2, and various flow-matching architectures) have begun to outperform it on individual axes such as real-time factor and time to first audio chunk, even when overall naturalness remains close.^[21]

Safety and abuse

Like all expressive zero-shot voice cloning systems, XTTS can be used to fabricate convincing audio of real people without their consent, which is the central abuse vector that motivated regulators and platform operators to focus on voice deepfakes from 2023 onwards. The CPML does not contain a use-based behavioral clause comparable to RAIL-style licenses; it controls commercial use rather than misuse, leaving deepfake risk mitigation to downstream users and platforms.^[11] Coqui acknowledged this risk in their original September 2023 announcement by foregrounding research and creative use cases and by emphasizing the non-commercial license as a friction layer against large-scale abuse.^[3]

Maintenance and forward compatibility

Because the original coqui-ai/TTS repository has not shipped a release since December 2023, users who depend on XTTS in long-lived stacks must rely on the Idiap community fork or maintain their own forks.^[6]^[7] The Idiap fork has so far tracked Python and PyTorch ecosystem changes (including new CUDA versions and updated DeepSpeed builds), but its long-term governance depends on a single research institution rather than a commercial sponsor.^[7]

System	Released	Open weights?	Languages	Zero-shot voice cloning	License
Tortoise TTS	2022	Yes (Apache 2.0)	English only	Yes	Apache 2.0^[14]
XTTS v2	2023	Yes	17	Yes, ~6s ref clip	CPML (non-commercial)^[1]^[11]
Bark	2023	Yes	Several (incl. non-speech sounds)	Limited (preset speakers)	MIT^[6]
VALL-E	2023	No (Microsoft Research paper)	Originally English; VALL-E X added cross-lingual	Yes, 3s ref clip	Closed^[13]
ElevenLabs	2022-	No (proprietary API)	32+	Yes, paid API	Commercial^[20]
F5-TTS	2024	Yes	Several	Yes	More permissive^[21]

XTTS occupies a specific niche in this landscape: it has broader language coverage than English-only Tortoise and Bark, openly published weights unlike VALL-E and ElevenLabs, and a stricter license than Tortoise or F5-TTS. The Interspeech 2024 paper positions XTTS as "the first massively multilingual ZS-TTS model supporting low/medium resource languages" and credits the perceiver resampler and the unified BPE vocabulary as the components that make this scale of multilingual coverage tractable.^[9]

References

Coqui AI, "XTTS-v2 model card", Hugging Face, 2023-11. https://huggingface.co/coqui/XTTS-v2. Accessed 2026-05-20. ↩
Coqui AI, "ⓍTTS - TTS 0.22.0 documentation", Coqui Docs, 2023-12-12. https://docs.coqui.ai/en/latest/models/xtts.html. Accessed 2026-05-20. ↩
Coqui AI and Hugging Face, "Coqui and Hugging Face Partner to Revolutionize Voice AI with New Open-Access XTTS Model", NewsFileCorp press release, 2023-09-30. https://www.newsfilecorp.com/release/182483/Coqui-and-Hugging-Face-Partner-to-Revolutionize-Voice-AI-with-New-OpenAccess-XTTS-Model. Accessed 2026-05-20. ↩
Coqui AI, "XTTS model documentation (dev branch)", GitHub, 2023-12-12. https://github.com/coqui-ai/TTS/blob/dev/docs/source/models/xtts.md. Accessed 2026-05-20. ↩
AI Models Blog, "Coqui XTTS and the Coqui Public Model License: A Close Look at Non-Commercial Use and Copyright Law", aimodels.org, 2024-08-15. https://aimodels.org/ai-blog/coqui-xtts-license-cpml-open-source/. Accessed 2026-05-20. ↩
Coqui AI, "coqui-ai/TTS repository", GitHub, 2023-12-12. https://github.com/coqui-ai/TTS. Accessed 2026-05-20. ↩
Idiap Research Institute, "idiap/coqui-ai-TTS (community fork)", GitHub, 2026-01-26. https://github.com/idiap/coqui-ai-TTS. Accessed 2026-05-20. ↩
Coqui AI, "Releases - coqui-ai/TTS (v0.22.0)", GitHub, 2023-12-12. https://github.com/coqui-ai/TTS/releases. Accessed 2026-05-20. ↩
Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber, "XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model", arXiv:2406.04904 (Interspeech 2024), 2024-06-07. https://arxiv.org/abs/2406.04904. Accessed 2026-05-20. ↩
Baseten, "Streaming real-time text to speech with XTTS V2", Baseten Blog, 2024-02-20. https://www.baseten.co/blog/streaming-real-time-text-to-speech-with-xtts-v2/. Accessed 2026-05-20. ↩
Coqui AI, "Coqui Public Model License 1.0.0 (LICENSE.txt)", Hugging Face, 2023-09-26. https://huggingface.co/coqui/XTTS-v2/blob/main/LICENSE.txt. Accessed 2026-05-20. ↩
DeClom, "Coqui AI: A Post-Mortem on the Speech Tech Startup", DeClom Blog, 2024-02-10. https://declom.com/coqui. Accessed 2026-05-20. ↩
Chengyi Wang et al., "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)", arXiv:2301.02111, 2023-01-05. https://arxiv.org/abs/2301.02111. Accessed 2026-05-20. ↩
James Betker, "neonbjb/tortoise-tts repository", GitHub, 2022-04-28. https://github.com/neonbjb/tortoise-tts. Accessed 2026-05-20. ↩
Coqui AI, "XTTS-v1 model card", Hugging Face, 2023-09-30. https://huggingface.co/coqui/XTTS-v1. Accessed 2026-05-20. ↩
Coqui AI community, "Coqui is shutting down (discussion #3489)", GitHub Discussions, 2024-01-03. https://github.com/coqui-ai/TTS/discussions/3489. Accessed 2026-05-20. ↩
fakerybakery, "XTTS License After Shutdown (issue #3490)", GitHub Issues, 2024-01-04. https://github.com/coqui-ai/TTS/issues/3490. Accessed 2026-05-20. ↩
DeepWiki, "XTTS Model (coqui-ai/TTS section 4.2)", deepwiki.com, 2024-09-14. https://deepwiki.com/coqui-ai/TTS/4.2-xtts-model. Accessed 2026-05-20. ↩
Eachlabs, "XTTS AI Model", eachlabs.ai, 2024-05-12. https://www.eachlabs.ai/coqui/xtts/xtts-v2. Accessed 2026-05-20. ↩
Local AI Master, "XTTS v2 Voice Cloning Guide (2026): Coqui TTS for 17 Languages", localaimaster.com, 2026-02-04. https://localaimaster.com/blog/xtts-v2-voice-cloning-guide. Accessed 2026-05-20. ↩
FindSkill.ai, "Best Open-Source TTS in 2026: 5 Models, Ranked by Quality", findskill.ai, 2026-03-18. https://findskill.ai/blog/best-open-source-tts-2026/. Accessed 2026-05-20. ↩
BentoML, "The Best Open-Source Text-to-Speech Models in 2026", bentoml.com, 2026-01-22. https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Best AI Voice Generators (Text-to-Speech)Text-to-Speech Models

Infobox

History

Coqui AI and the road to XTTS

XTTS v1 (September 2023)

XTTS v2 (November 2023)

The Coqui shutdown

How XTTS works

Overall pipeline

Text frontend and tokenizer

GPT-style autoregressive token predictor

HiFi-GAN-style decoder

Voice cloning conditioning

Streaming and inference

Training data

Architecture summary table

XTTS v1 versus XTTS v2

Implementations and ecosystem

Coqui TTS toolkit

Hugging Face

Fine-tuning and adapters

Third-party hosting and integrations

Adoption and applications

Limitations and criticisms

License and the post-shutdown problem

Quality and robustness

Safety and abuse

Maintenance and forward compatibility

Comparison with related systems

See also

References

Improve this article

Related Articles

Sesame (AI company)

Moshi

Murf AI

ElevenLabs

Voice cloning

Hume AI

What links here

Related Articles

Sesame (AI company)

Moshi

Murf AI

ElevenLabs

Voice cloning

Hume AI

What links here