# XTTS (Coqui XTTS)

> Source: https://aiwiki.ai/wiki/xtts
> Updated: 2026-06-09
> Categories: Open Source AI, Speech & Audio AI, Voice AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# XTTS (Coqui XTTS)

**XTTS** (sometimes stylized ⓍTTS, short for "cross-lingual text-to-speech") is an open-weights multilingual text-to-speech model developed by Coqui AI that performs zero-shot voice cloning from short reference audio prompts of roughly six seconds.[^1][^2] The model couples a [GPT-style](/wiki/gpt-2) autoregressive [transformer](/wiki/transformer) over discrete speech tokens with a HiFi-GAN-style decoder, and it was first released to the public on 30 September 2023 in collaboration with [Hugging Face](/wiki/hugging_face).[^3] A refined second release, XTTS v2, followed in November 2023 and extended language coverage to seventeen languages (English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and later Hindi).[^1][^4] The weights are distributed under the non-commercial Coqui Public Model License (CPML), and although Coqui AI shut down in January 2024, the model and its training toolkit remain widely used through community forks on [Hugging Face](/wiki/hugging_face) and GitHub.[^5][^6][^7]

## Infobox

| Field | Value |
|---|---|
| Developer | Coqui AI |
| Initial release | XTTS v1, 30 September 2023[^3] |
| Latest release | XTTS v2 (model card finalized November 2023; minor patch releases in December 2023)[^1][^8] |
| Parameters | ~750M (commonly reported model size)[^9] |
| Sample rate | 24 kHz output; 22 kHz conditioning input[^2] |
| Reference clip length | ~6 seconds for voice cloning[^1] |
| Streaming latency | ~200 ms round trip to first chunk, <100 ms inference on GPU[^10] |
| Languages (v2) | 17: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi[^1] |
| License | Coqui Public Model License 1.0.0 (non-commercial; commercial license was sold separately while Coqui operated)[^5][^11] |
| Code repository | github.com/coqui-ai/TTS (original, unmaintained) and github.com/idiap/coqui-ai-TTS (community fork)[^6][^7] |
| Model weights | huggingface.co/coqui/XTTS-v2[^1] |

## History

### Coqui AI and the road to XTTS

Coqui AI was founded in 2021 by alumni of Mozilla's machine learning group, including Josh Meyer, Eren Gölge, Reuben Morais, and Kelly Davis.[^12][^7] The company started by maintaining and expanding the open-source speech research stack that some of the same engineers had begun building inside Mozilla under the Common Voice and Mozilla TTS projects. The flagship public artifact of this effort was the `coqui-ai/TTS` library on GitHub, a [PyTorch](/wiki/pytorch) toolkit covering Tacotron 2, FastSpeech 2, Glow-TTS, VITS, and a port of [Suno](/wiki/suno)'s Bark, distributed under the MPL-2.0 license.[^6] By late 2023 the repository carried tens of thousands of GitHub stars and the company was offering a commercial product, Coqui Studio, layered on top of the same models.[^6][^12]

Coqui's research direction in 2022 and 2023 shifted toward zero-shot, multilingual, language-model-style speech synthesis, in line with broader industry developments such as [VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e), [SoundStream](/wiki/soundstream), and the Tortoise-TTS system that James Betker had released on GitHub in 2022.[^13][^14] Tortoise itself trained a [GPT-style](/wiki/gpt-2) autoregressive model to predict mel-spectrogram codebook tokens and then decoded those tokens with a diffusion model and a vocoder; it became the conceptual template that Coqui's team extended into XTTS.[^14]

### XTTS v1 (September 2023)

XTTS v1 was unveiled on 30 September 2023 through a joint announcement with [Hugging Face](/wiki/hugging_face).[^3] At launch the model supported thirteen languages: English, Spanish, French, German, Italian, Brazilian Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, and Mandarin Chinese.[^3] The original announcement framed XTTS as "the first generative voice AI foundation model" trained for cross-lingual cloning, and it advertised the ability to clone a speaker's voice from a short audio sample of a few seconds and have that voice speak in any of the thirteen supported languages.[^3] The official model card on [Hugging Face](/wiki/hugging_face) confirmed the 24 kHz output sampling rate, the six-second reference-clip requirement, and the use of the Coqui Public Model License for the weights.[^15] In the days after release the repository was reported as the top trending project on GitHub and the top trending Space on [Hugging Face](/wiki/hugging_face).[^3]

A Japanese language code was added to the same v1 family shortly after the initial release, bringing the early count to fourteen languages on the v1 model card.[^15]

The launch positioning was deliberately framed against the proprietary voice cloning incumbents. Coqui's announcement copy described XTTS as a "foundation model for generative voice" and contrasted the six-second reference-clip requirement with the multi-minute or multi-hour data requirements typical of earlier speaker-adaptive TTS systems.[^3] Coqui co-founder Joshua Meyer wrote in the press release that the partnership with [Hugging Face](/wiki/hugging_face) was intended to ensure that the new foundation model was widely accessible to researchers and hobbyists, and Hugging Face CTO Julien Chaumond described the launch as part of an ongoing collaboration between the two companies on speech models.[^3] In the public discussion that followed the release, developers reported success cloning their own voices from clips of three to ten seconds and demonstrated cross-lingual transfer of English reference speakers into Spanish, German, and Mandarin.[^15][^22]

### XTTS v2 (November 2023)

XTTS v2 was released about two months after v1. The v2 model card adds Hungarian and Korean (bringing the total to sixteen languages at launch), along with revised speaker conditioning that supports multiple reference clips and interpolation between reference speakers, generally improved stability, and reportedly better prosody and audio quality across languages.[^1] Patch releases continued through December 2023; version 2.0.2 of the v2 weights, shipped alongside Coqui TTS library v0.22.0 on 12 December 2023, is the last release published under the original Coqui organization.[^8] An additional Hindi language code was rolled into the model card after launch, taking the language count to seventeen as listed on the current Hugging Face page.[^1]

The interpolation-style speaker conditioning in v2 is exposed in the Coqui TTS API as a `get_conditioning_latents()` function that returns two tensors per reference: a GPT conditioning latent that biases the autoregressive token predictor and a speaker embedding consumed by the decoder.[^2] Multiple reference files can be passed at once, with the cost of inference unchanged because the latents and embeddings are aggregated before generation begins.[^2]

Beyond the new languages and conditioning mechanism, the v2 model card lists "stability improvements" and "better prosody and audio quality" as the headline improvements.[^1] In practical terms, users reported fewer instances of the autoregressive [transformer](/wiki/transformer) entering pathological loops on long inputs (a behavior that occasionally caused v1 to repeat syllables or stop generating before the input text was exhausted), more consistent emotional intonation across sentences in a paragraph, and somewhat cleaner consonant articulation. These improvements were not accompanied by a corresponding paper at the time of release; the formal academic write-up of the system came later, in the Interspeech 2024 paper that documents the model trained on sixteen languages.[^9]

### The Coqui shutdown

On 3 January 2024 Coqui co-founder Josh Meyer announced that the company was winding down operations, writing "Coqui is shutting down" on social media and pointing the community to the open-source repository as the canonical home for XTTS going forward.[^12][^16] The decision followed a wave of consolidation in the generative-voice market in late 2023, where well-funded incumbents such as [ElevenLabs](/wiki/elevenlabs) dominated paid voice cloning while open-weight alternatives proliferated. The closure of Coqui Studio and the Coqui API followed shortly afterwards, and the official `coqui-ai/TTS` GitHub repository has had no new releases since v0.22.0 on 12 December 2023.[^6][^8]

The shutdown immediately raised questions in the developer community about the future of the XTTS license, with users on the repository's discussion board asking Coqui to relicense the weights under a permissive license such as Apache 2.0 or MIT so that the model could be used commercially without negotiating with a now-defunct company.[^17] No relicensing occurred, and the CPML notice on the [Hugging Face](/wiki/hugging_face) page for `coqui/XTTS-v2` remained in place. The Idiap Research Institute in Switzerland created an actively maintained community fork at `github.com/idiap/coqui-ai-TTS`, publishing a new PyPI package named `coqui-tts` and continuing to ship feature releases through 2025 and into early 2026.[^7]

## How XTTS works

### Overall pipeline

XTTS is an end-to-end zero-shot voice cloning text-to-speech system whose runtime data flow can be summarized in four stages: a tokenizer turns input text into BPE-style symbol tokens, a [GPT-style](/wiki/gpt-2) autoregressive transformer predicts a sequence of discrete audio tokens conditioned on those text tokens and on speaker latents extracted from a reference clip, a HiFi-GAN-style decoder converts the predicted token sequence (and a separate speaker embedding) into a 24 kHz waveform, and an optional streaming wrapper exposes the system as a low-latency real-time interface.[^2][^9][^18]

The architecture inherits the broad pattern that James Betker introduced in Tortoise TTS: a transformer language model over discrete speech codes serves as the prosody and phonetic predictor, while a separate non-autoregressive neural network reconstructs the waveform from those codes.[^14][^18] XTTS keeps that two-stage structure but swaps several components and adds explicit multilingual and multi-speaker conditioning.

### Text frontend and tokenizer

Text inputs are normalized per language and tokenized with a byte-pair-encoded (BPE) vocabulary referred to in the code as `VoiceBpeTokenizer`.[^18] The tokenizer is multilingual: a single shared vocabulary handles all of the supported languages, and the model receives a language-id token at the start of each generation so it knows which phonological and prosodic distribution to sample from.[^2][^9] This design is what allows the v2 model to perform cross-language transfer in which a six-second English reference clip is used to synthesize, for example, Korean or Hindi speech in the same speaker's voice.[^1]

### GPT-style autoregressive token predictor

The core of XTTS is a decoder-only [transformer](/wiki/transformer) modeled on the [GPT-2](/wiki/gpt-2) architecture and configured for speech.[^18] The transformer's vocabulary contains a mixture of text BPE tokens (consumed as context) and discrete audio tokens (the output target). During training the audio tokens come from a discrete variational autoencoder (a DVAE, conceptually equivalent to a VQ-VAE) that compresses 24 kHz mel-spectrograms into a small codebook; the [transformer](/wiki/transformer) then learns to predict the next audio token given the preceding text tokens, language id, and audio tokens already emitted.[^18][^9] This is the same pattern as Tortoise and as autoregressive neural-codec language models such as [VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e) and AudioLM, with the practical difference that the codebook in XTTS encodes mel-spectrogram patches rather than [EnCodec](/wiki/encodec) residual codes.[^14][^18]

The XTTS v2 paper and accompanying documentation also describe a perceiver resampler that maps the variable-length speaker reference into a fixed-size set of conditioning vectors prepended to the [transformer](/wiki/transformer) input.[^9][^18] This is the mechanism that turns a several-second reference clip into a stable speaker prior even when the clip has different length and prosody from the target utterance, and it is one of the components that the v2 release explicitly changed relative to v1.[^9]

### HiFi-GAN-style decoder

The predicted audio tokens are converted back into a 24 kHz waveform by a HiFi-GAN-style decoder.[^18] The decoder is conditioned on a speaker embedding extracted from the reference clip in parallel with the GPT conditioning latents, which lets the timbre of the cloned speaker influence not just the token sequence but also the spectral details of the synthesized waveform.[^2][^18] Deep Learning toolkits that document XTTS internals report that the HiFi-GAN decoder uses a multi-scale generator architecture and is trained with adversarial and reconstruction losses, consistent with the original HiFi-GAN design.[^18]

Older Coqui documentation references diffusion-based decoders inherited from Tortoise; in practice the production XTTS pipeline shipped with the v2 weights uses the HiFi-GAN-style waveform generator described in the v2 paper and code, not a diffusion decoder.[^9][^18]

### Voice cloning conditioning

Voice cloning in XTTS is zero-shot: the model is not fine-tuned per speaker, and adding a new voice does not require gradient updates. Instead, the reference audio is run through a speaker encoder that produces a fixed-length embedding (consumed by the decoder) and through the perceiver resampler that produces the GPT conditioning latent (consumed by the [transformer](/wiki/transformer)).[^2][^9] Both representations can be cached, so repeated inference for the same speaker reuses cached latents and avoids re-encoding the reference clip.[^2] In practice this caching, plus the comparatively small size of the autoregressive token sequence, is what enables the model's sub-200-millisecond streaming first-chunk latency on a single GPU.[^10]

The official documentation states that XTTS can operate with reference clips as short as three seconds and that quality plateaus around six seconds, with multiple shorter clips often outperforming a single longer one because the model can average over different prosodic states.[^2]

The conditioning pipeline imposes important practical constraints. Because the speaker embedding and the GPT conditioning latent are computed from the raw waveform, the quality of the reference clip dominates the quality of the clone: clean studio-quality recordings produce significantly more faithful clones than telephone-band or noisy field recordings, and clips containing music, overlapping voices, or strong room reverberation can leak those characteristics into the synthesized output.[^2][^20] The documentation recommends using mono 24 kHz reference audio cropped to clean speech segments, and the community fork's v0.27 release added a caching layer specifically to let production systems amortize the cost of cleaning and encoding reference clips across many synthesis requests for the same speaker.[^7]

### Streaming and inference

A separate streaming wrapper in the Coqui TTS library lets the model emit audio chunks as the [transformer](/wiki/transformer) generates tokens, rather than waiting for the full utterance.[^10] On a T4-class GPU the documented round trip time to the first audio chunk is roughly 200 milliseconds, with under 100 milliseconds of that spent inside the [transformer](/wiki/transformer) itself.[^10] The library exposes DeepSpeed-accelerated inference paths for users who want to push throughput higher on commodity hardware.[^2]

### Training data

The XTTS paper accepted at Interspeech 2024 describes training on a multilingual corpus covering sixteen languages, including low- and medium-resource languages such as Hungarian, Korean, Czech, and Arabic.[^9] The authors emphasize that XTTS was the first massively multilingual zero-shot TTS to cover this breadth, going beyond earlier multilingual systems such as YourTTS and VALL-E X that supported only a handful of high-resource languages.[^9] Exact per-language hour counts are reported in the paper's experimental section. The released v2 model on [Hugging Face](/wiki/hugging_face) is the production version of this system, with Hindi added through additional training after the paper submission.[^1][^9]

A small number of design choices in the data pipeline are documented in the paper and in third-party reproductions: the audio is resampled to a unified rate before mel-spectrogram extraction, language ids are injected at the start of the [transformer](/wiki/transformer) context window, and the DVAE codebook is trained jointly with the autoregressive model so that the discrete audio token distribution adapts to the multilingual distribution rather than being frozen from a monolingual pretraining stage.[^9][^18] The choice to use a single shared multilingual BPE vocabulary (rather than language-specific tokenizers) was justified in the paper as a way to share representations across related languages, which the authors argue is what allows XTTS to perform reasonably on the lower-resource languages in the training set.[^9]

## Architecture summary table

| Stage | Component | Role |
|---|---|---|
| Frontend | `VoiceBpeTokenizer` (BPE) | Tokenizes multilingual text input[^18] |
| Conditioning extractor | Speaker encoder + perceiver resampler | Produces a speaker embedding and GPT conditioning latents from a ~6s reference clip[^2][^9] |
| Core LM | Decoder-only [transformer](/wiki/transformer) (~GPT-2 style) | Predicts discrete audio tokens autoregressively from text, language id, and speaker latents[^18][^9] |
| Codebook | DVAE / VQ-style audio codebook | Defines the discrete audio token vocabulary that the LM emits[^18][^9] |
| Decoder | HiFi-GAN-style generator | Reconstructs 24 kHz waveform from predicted tokens conditioned on the speaker embedding[^18][^9] |
| Optional | DeepSpeed and streaming wrappers | Lower-latency inference, real-time chunked output[^2][^10] |

## XTTS v1 versus XTTS v2

| Aspect | XTTS v1 | XTTS v2 |
|---|---|---|
| Initial release | 30 September 2023[^3] | November 2023 (model card finalized; v0.22.0 patch on 12 December 2023)[^1][^8] |
| Languages at release | 13 (en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh)[^3] | 16 at launch, 17 with later Hindi addition (adds hu, ko, hi to the v1 set; ja was added to v1 after its release)[^1][^15] |
| Speaker conditioning | Single-reference speaker encoder + Tortoise-style latents[^15] | Perceiver resampler + multi-reference and interpolation support[^1][^9] |
| Audio quality | Production-grade but with reported stability issues on long prompts | "Better prosody and audio quality" plus general stability improvements per the v2 model card[^1] |
| Streaming | Not officially supported at launch | Supported with ~200 ms time to first chunk[^10] |

## Implementations and ecosystem

### Coqui TTS toolkit

The reference implementation lives in the Coqui TTS library, originally at `github.com/coqui-ai/TTS` and now actively maintained at `github.com/idiap/coqui-ai-TTS`.[^6][^7] The library is the same toolkit that also implements Tacotron 2, FastSpeech 2, Glow-TTS, VITS, and a port of Bark, and that previously hosted around 1,100 Fairseq Massively Multilingual Speech (MMS) models for low-resource languages.[^6] After the Idiap fork the new PyPI distribution is named `coqui-tts`, separate from the original `TTS` package.[^7] Public release notes from the Idiap fork track new features such as cached cloned-voice latents added in v0.27.0 and continued maintenance into early 2026.[^7]

### Hugging Face

The canonical model weights are published as `coqui/XTTS-v2` on [Hugging Face](/wiki/hugging_face), with `coqui/XTTS-v1` retained for the earlier release.[^1][^15] As of May 2026 the v2 page reports millions of monthly downloads, making it one of the most-downloaded open-weight speech models on the platform.[^1] A demo Space at `huggingface.co/spaces/coqui/xtts` provides an interactive playground for cloning a voice from an uploaded reference clip.[^1]

### Fine-tuning and adapters

The Coqui TTS library ships fine-tuning recipes for the GPT portion of XTTS, including a Gradio-based interface that walks through dataset preparation, GPT fine-tuning, and inference with the fine-tuned weights.[^2] The official documentation reports that adapting XTTS to a new voice or stylistic register typically requires around ten minutes of clean speech from the target speaker.[^2][^15] Because the HiFi-GAN-style decoder is shared across speakers, fine-tuning is generally restricted to the [transformer](/wiki/transformer) block.

### Third-party hosting and integrations

Independent vendors have packaged XTTS as a managed inference endpoint, including Baseten and Eachlabs, both of which document the streaming endpoint and a real-time factor of roughly 0.3 on consumer-grade GPUs.[^10][^19] The model has also been wrapped by community projects for use inside open-source pipelines such as local AI agents, developer-tool integrations of various kinds, and audiobook generators. Vendors and forum guides routinely describe XTTS v2 as the leading open-weight option for cross-lingual voice cloning available in 2026, alongside newer entrants such as [F5-TTS](/wiki/f5_tts) and various flow-matching systems.[^20][^21]

## Adoption and applications

XTTS gained traction in the months between its September 2023 release and the Coqui shutdown because it combined three properties that were rare in 2023: zero-shot voice cloning that produced recognizable output from a short reference, broad multilingual coverage, and openly downloadable weights.[^3][^1] Common deployment scenarios documented in third-party guides and product write-ups include localized [voice cloning](/wiki/voice_cloning) for content creators, dubbing and accessibility audio for video games, audiobook narration in languages that lack established commercial voices, real-time agentic interactions in voice-enabled assistants, and personal-use voice synthesis where uploading speech to a commercial API is undesirable.[^20][^22]

The model is also frequently used as a baseline or comparison target in academic and industrial work on neural TTS, where it serves as the canonical open-weight zero-shot multilingual baseline alongside [VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e)-style systems.[^9]

## Limitations and criticisms

### License and the post-shutdown problem

The defining practical limitation of XTTS in 2026 is licensing. The weights ship under the Coqui Public Model License 1.0.0, which explicitly forbids any "direct or indirect payment arising from the use of the model or its output," prohibits using XTTS to train other models for commercial use, and reserves the right to grant separate commercial licenses to Coqui.[^11] While Coqui operated, that separate commercial license existed and was reportedly priced at modest annual fees for small companies.[^5] After the January 2024 shutdown no party is in a position to sell or sign a CPML commercial license, leaving an unresolved gap in which the technical capability to use XTTS commercially exists, but the legal mechanism is dormant.[^5][^11][^17] Independent commentary has warned would-be commercial users that the absence of an active rights holder does not in itself create a permissive license, and several have recommended switching to alternative models with permissive licenses for production deployments.[^5][^21]

### Quality and robustness

In comparative evaluations XTTS v2 is consistently described as a strong open-weight system whose subjective quality approaches but does not quite match the best proprietary services such as [ElevenLabs](/wiki/elevenlabs).[^20][^21] Reported failure modes include occasional mispronunciations on rare proper nouns, prosody drift on very long utterances, and quality degradation when the reference clip is noisy or contains overlapping speech.[^21][^22] Newer open systems released after XTTS v2 (such as [F5-TTS](/wiki/f5_tts), OpenVoice v2, and various flow-matching architectures) have begun to outperform it on individual axes such as real-time factor and time to first audio chunk, even when overall naturalness remains close.[^21]

### Safety and abuse

Like all expressive zero-shot voice cloning systems, XTTS can be used to fabricate convincing audio of real people without their consent, which is the central abuse vector that motivated regulators and platform operators to focus on voice [deepfakes](/wiki/deepfake) from 2023 onwards. The CPML does not contain a use-based behavioral clause comparable to RAIL-style licenses; it controls commercial use rather than misuse, leaving deepfake risk mitigation to downstream users and platforms.[^11] Coqui acknowledged this risk in their original September 2023 announcement by foregrounding research and creative use cases and by emphasizing the non-commercial license as a friction layer against large-scale abuse.[^3]

### Maintenance and forward compatibility

Because the original `coqui-ai/TTS` repository has not shipped a release since December 2023, users who depend on XTTS in long-lived stacks must rely on the Idiap community fork or maintain their own forks.[^6][^7] The Idiap fork has so far tracked Python and [PyTorch](/wiki/pytorch) ecosystem changes (including new CUDA versions and updated DeepSpeed builds), but its long-term governance depends on a single research institution rather than a commercial sponsor.[^7]

## Comparison with related systems

| System | Released | Open weights? | Languages | Zero-shot voice cloning | License |
|---|---|---|---|---|---|
| Tortoise TTS | 2022 | Yes (Apache 2.0) | English only | Yes | Apache 2.0[^14] |
| XTTS v2 | 2023 | Yes | 17 | Yes, ~6s ref clip | CPML (non-commercial)[^1][^11] |
| Bark | 2023 | Yes | Several (incl. non-speech sounds) | Limited (preset speakers) | MIT[^6] |
| [VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e) | 2023 | No (Microsoft Research paper) | Originally English; VALL-E X added cross-lingual | Yes, 3s ref clip | Closed[^13] |
| [ElevenLabs](/wiki/elevenlabs) | 2022- | No (proprietary API) | 32+ | Yes, paid API | Commercial[^20] |
| [F5-TTS](/wiki/f5_tts) | 2024 | Yes | Several | Yes | More permissive[^21] |

XTTS occupies a specific niche in this landscape: it has broader language coverage than English-only Tortoise and Bark, openly published weights unlike [VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e) and [ElevenLabs](/wiki/elevenlabs), and a stricter license than Tortoise or [F5-TTS](/wiki/f5_tts). The Interspeech 2024 paper positions XTTS as "the first massively multilingual ZS-TTS model supporting low/medium resource languages" and credits the perceiver resampler and the unified BPE vocabulary as the components that make this scale of multilingual coverage tractable.[^9]

## See also

- [VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e)
- [ElevenLabs](/wiki/elevenlabs)
- [F5-TTS](/wiki/f5_tts)
- [Voice cloning](/wiki/voice_cloning)
- [Deepfake](/wiki/deepfake)
- [EnCodec](/wiki/encodec)
- [SoundStream](/wiki/soundstream)
- [GPT-2](/wiki/gpt-2)
- [Transformer](/wiki/transformer)
- [Hugging Face](/wiki/hugging_face)
- [PyTorch](/wiki/pytorch)
- [Suno](/wiki/suno)

## References

[^1]: Coqui AI, "XTTS-v2 model card", Hugging Face, 2023-11. https://huggingface.co/coqui/XTTS-v2. Accessed 2026-05-20.
[^2]: Coqui AI, "ⓍTTS - TTS 0.22.0 documentation", Coqui Docs, 2023-12-12. https://docs.coqui.ai/en/latest/models/xtts.html. Accessed 2026-05-20.
[^3]: Coqui AI and Hugging Face, "Coqui and Hugging Face Partner to Revolutionize Voice AI with New Open-Access XTTS Model", NewsFileCorp press release, 2023-09-30. https://www.newsfilecorp.com/release/182483/Coqui-and-Hugging-Face-Partner-to-Revolutionize-Voice-AI-with-New-OpenAccess-XTTS-Model. Accessed 2026-05-20.
[^4]: Coqui AI, "XTTS model documentation (dev branch)", GitHub, 2023-12-12. https://github.com/coqui-ai/TTS/blob/dev/docs/source/models/xtts.md. Accessed 2026-05-20.
[^5]: AI Models Blog, "Coqui XTTS and the Coqui Public Model License: A Close Look at Non-Commercial Use and Copyright Law", aimodels.org, 2024-08-15. https://aimodels.org/ai-blog/coqui-xtts-license-cpml-open-source/. Accessed 2026-05-20.
[^6]: Coqui AI, "coqui-ai/TTS repository", GitHub, 2023-12-12. https://github.com/coqui-ai/TTS. Accessed 2026-05-20.
[^7]: Idiap Research Institute, "idiap/coqui-ai-TTS (community fork)", GitHub, 2026-01-26. https://github.com/idiap/coqui-ai-TTS. Accessed 2026-05-20.
[^8]: Coqui AI, "Releases - coqui-ai/TTS (v0.22.0)", GitHub, 2023-12-12. https://github.com/coqui-ai/TTS/releases. Accessed 2026-05-20.
[^9]: Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber, "XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model", arXiv:2406.04904 (Interspeech 2024), 2024-06-07. https://arxiv.org/abs/2406.04904. Accessed 2026-05-20.
[^10]: Baseten, "Streaming real-time text to speech with XTTS V2", Baseten Blog, 2024-02-20. https://www.baseten.co/blog/streaming-real-time-text-to-speech-with-xtts-v2/. Accessed 2026-05-20.
[^11]: Coqui AI, "Coqui Public Model License 1.0.0 (LICENSE.txt)", Hugging Face, 2023-09-26. https://huggingface.co/coqui/XTTS-v2/blob/main/LICENSE.txt. Accessed 2026-05-20.
[^12]: DeClom, "Coqui AI: A Post-Mortem on the Speech Tech Startup", DeClom Blog, 2024-02-10. https://declom.com/coqui. Accessed 2026-05-20.
[^13]: Chengyi Wang et al., "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)", arXiv:2301.02111, 2023-01-05. https://arxiv.org/abs/2301.02111. Accessed 2026-05-20.
[^14]: James Betker, "neonbjb/tortoise-tts repository", GitHub, 2022-04-28. https://github.com/neonbjb/tortoise-tts. Accessed 2026-05-20.
[^15]: Coqui AI, "XTTS-v1 model card", Hugging Face, 2023-09-30. https://huggingface.co/coqui/XTTS-v1. Accessed 2026-05-20.
[^16]: Coqui AI community, "Coqui is shutting down (discussion #3489)", GitHub Discussions, 2024-01-03. https://github.com/coqui-ai/TTS/discussions/3489. Accessed 2026-05-20.
[^17]: fakerybakery, "XTTS License After Shutdown (issue #3490)", GitHub Issues, 2024-01-04. https://github.com/coqui-ai/TTS/issues/3490. Accessed 2026-05-20.
[^18]: DeepWiki, "XTTS Model (coqui-ai/TTS section 4.2)", deepwiki.com, 2024-09-14. https://deepwiki.com/coqui-ai/TTS/4.2-xtts-model. Accessed 2026-05-20.
[^19]: Eachlabs, "XTTS AI Model", eachlabs.ai, 2024-05-12. https://www.eachlabs.ai/coqui/xtts/xtts-v2. Accessed 2026-05-20.
[^20]: Local AI Master, "XTTS v2 Voice Cloning Guide (2026): Coqui TTS for 17 Languages", localaimaster.com, 2026-02-04. https://localaimaster.com/blog/xtts-v2-voice-cloning-guide. Accessed 2026-05-20.
[^21]: FindSkill.ai, "Best Open-Source TTS in 2026: 5 Models, Ranked by Quality", findskill.ai, 2026-03-18. https://findskill.ai/blog/best-open-source-tts-2026/. Accessed 2026-05-20.
[^22]: BentoML, "The Best Open-Source Text-to-Speech Models in 2026", bentoml.com, 2026-01-22. https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models. Accessed 2026-05-20.

