F5-TTS

AI Models Open Source AI Speech & Audio AI

25 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v3 · 5,086 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

F5-TTS (short for "A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching") is an open-source text-to-speech and zero-shot voice cloning model released in October 2024 by researchers from Shanghai Jiao Tong University's X-LANCE Lab, the University of Cambridge, and the Geely Automobile Research Institute.^[1] It is a fully non-autoregressive synthesizer built on flow matching with a Diffusion Transformer (DiT) backbone and a ConvNeXt V2 module that refines text embeddings so they align with mel-spectrogram frames by simple zero-padding, removing the duration model, phoneme aligner, and separate text encoder used by earlier systems.^[2] Trained on a public 100,000-hour multilingual (English and Mandarin) dataset, the 335.8-million-parameter model performs zero-shot voice cloning from a reference clip of only a few seconds, supports code-switching, and reaches an inference real-time factor (RTF) of about 0.15 using an inference-time technique the authors call "Sway Sampling".^[1]^[2] The code is released under the MIT License, while the pretrained checkpoints are governed by a Creative Commons CC-BY-NC license that inherits restrictions from the Emilia training corpus.^[3]^[4] After its release on Hugging Face and GitHub, F5-TTS became one of the most-starred open-source speech projects of 2024 and 2025, accumulating roughly 14,800 GitHub stars and triggering a wave of community fine-tunes and a reinforcement-learning successor, F5R-TTS.^[3]^[5]

Field	Value
Full name	F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Type	Non-autoregressive zero-shot TTS
Backbone	Diffusion Transformer (DiT) with ConvNeXt V2 text refiner
Vocoder	Vocos (pretrained, 24 kHz)
Parameters	335.8 million (base)
Training data	Emilia (~95K hours English and Mandarin after filtering, from a public 100K-hour corpus)
Sampling rate	24 kHz mel features, 256 hop length, 100 mel bins
Inference RTF	~0.15 on A100 GPU
arXiv preprint	2410.06885 (first posted 9 October 2024)
ACL 2025 paper	Long Papers volume, pages 6255-6271
Code license	MIT
Checkpoint license	CC-BY-NC-4.0
GitHub stars	~14,800 (mid-2025)
Official repository	github.com/SWivid/F5-TTS

What is F5-TTS?

F5-TTS is a non-autoregressive, flow-matching text-to-speech model that clones an unseen speaker's voice from a few seconds of reference audio and synthesizes new speech in that voice, without any per-speaker training. Its central design claim is radical simplicity: it drops the duration predictor, phoneme aligner, and dedicated text encoder that comparable diffusion-based systems rely on, and instead pads the input text with filler tokens to the length of the target speech before denoising. The paper's abstract states it directly: "F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation."^[1] The authors report that the design "allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models" and that, "trained on a public 100K hours multilingual dataset," F5-TTS "exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency."^[1]

When was F5-TTS released and who built it?

By 2024, zero-shot voice cloning had become a focal point of speech research. A wave of generative TTS systems showed that a network trained on diverse multilingual audio could imitate an unseen speaker after hearing only a short reference clip. Two contrasting design philosophies emerged. One descended from VALL-E and its kin, which framed TTS as next-token prediction over neural-codec tokens; these autoregressive codec language models inherited the slow inference of large language models and were prone to occasional hallucinations of repeated or skipped phonemes.^[6] The other philosophy used continuous-time diffusion or flow-matching objectives over mel-spectrogram frames, which were attractive for their parallel decoding but often required carefully designed duration models, phoneme aligners, and text encoders. Microsoft's Voicebox (2023) and the closely related E2 TTS (2024) explored the second route. E2 TTS in particular showed that a single Flat-UNet Transformer trained with the conditional flow-matching objective on text-guided speech infilling could match autoregressive quality without an explicit duration predictor, if the text input was padded with filler tokens to the length of the target speech.^[2]

F5-TTS, posted on arXiv on 9 October 2024, was built directly on top of the E2 TTS recipe and aimed to remove what its authors saw as two remaining inefficiencies. The first was the slow convergence of E2 TTS, which the authors attributed to the difficulty of the network in learning to align padded characters with speech frames using a flat trunk that was identical for text and audio. The second was the high number of function evaluations needed for high-quality sampling, which made diffusion-style systems expensive at inference time.^[2] The corresponding fix, as described in the paper, was to (a) refine the text representation with a small ConvNeXt V2 module before concatenating it with the masked mel-spectrogram, (b) replace the flat trunk with a Diffusion Transformer (DiT) trunk equipped with adaptive Layer Norm and rotary position embeddings, and (c) introduce an inference-time non-linear schedule for the flow matching time variable, branded as "Sway Sampling", that prioritizes steps near the noisy end of the trajectory.^[2]

The paper credits a team led by Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng (Cambridge), Chunhui Wang and Jian Zhao (Geely), Kai Yu, and corresponding author Xie Chen at Shanghai Jiao Tong University.^[7] A first version of the manuscript was uploaded to arXiv on 9 October 2024, a second on 15 October 2024, and a third on 20 May 2025.^[1] The work was accepted to the Annual Meeting of the Association for Computational Linguistics (ACL) 2025 and appears in the Long Papers volume of the proceedings, pages 6255-6271.^[7] In parallel with the paper, the team published model weights on Hugging Face and Replicate, source code on GitHub, demos on a dedicated project page, and an unofficial Hugging Face Space hosted by community member "mrfakename" that became a popular way to try the system without a local installation.^[3]^[8]

How does F5-TTS work?

F5-TTS is, at its highest level, a conditional flow-matching network that learns a velocity field over 100-dimensional log mel-spectrogram features sampled at 24 kHz with a 256-sample hop length, paired with a pretrained Vocos vocoder for waveform reconstruction.^[2] The network parameterizes a probability flow ordinary differential equation (ODE) from Gaussian noise at flow time t=0 to mel features matching the target audio at t=1, conditioned on a text sequence padded to the speech length. The total parameter count is 335.8 million, almost entirely concentrated in the DiT trunk; the ConvNeXt V2 refiner adds a few million parameters more.^[2]

How does text-conditioned flow matching work in F5-TTS?

Following the E2 TTS framing, F5-TTS trains a single network on the text-guided speech-infilling task. During training, the system samples a random mask over 70 to 100 percent of the mel frames of a clean audio clip, replaces the masked region with a noised version interpolated toward Gaussian noise, and asks the network to predict the velocity that points toward the clean features given (a) the unmasked context, (b) the noisy frames, and (c) a text sequence whose length has been padded with filler tokens to match the speech length.^[2] At inference, the entire audio is treated as masked except for a short reference clip provided by the user, and the unmasked region carries the speaker timbre of the reference. The flow ODE is solved by torchdiffeq, with the number of function evaluations (NFE) typically set to 32 for high-quality synthesis and 16 for fast preview.^[2]

The objective is the optimal-transport conditional flow matching loss used in flow matching. Unlike score-based diffusion, the loss is a simple regression on velocities, with classifier-free guidance applied at inference time by interpolating between the conditional and unconditional velocity at each step.^[2]

What does the ConvNeXt V2 text refiner do?

The text input is first converted into a character sequence (the official base checkpoint uses character tokens rather than phonemes, which keeps the model language-flexible) and embedded into 512-dimensional vectors. These embeddings are padded with filler tokens to match the speech length and passed through a four-layer 1D ConvNeXt V2 block stack with 512 hidden dimensions and 1024-dimensional feed-forward layers. The refined text features are then concatenated with the masked mel features along the channel dimension before being fed into the DiT trunk.^[2] The authors describe this refiner as the key change that makes the simple "pad and align" strategy work in fewer training updates than the original E2 TTS Flat-UNet, by giving the text branch its own inductive bias for capturing local prosodic structure.^[2]

What is the Diffusion Transformer (DiT) backbone?

The trunk of F5-TTS is a 22-layer Diffusion Transformer with 16 attention heads, 1024-dimensional embeddings, and 2048-dimensional feed-forward sublayers. It uses zero-initialized adaptive Layer Norm (adaLN-zero) so that the network starts close to the identity at the beginning of training, and rotary position embeddings (RoPE) in self-attention to handle variable lengths up to roughly 30 seconds of audio at the chosen hop length.^[2] Dropout of 0.1 is applied to attention and feed-forward sublayers. Unlike multimodal DiTs such as MMDiT, which separate text and image streams into different trunks that interact through joint attention, F5-TTS concatenates the text and mel streams into a single token sequence and lets the DiT handle them together.

Which vocoder does F5-TTS use?

Mel spectrograms produced by the flow ODE are converted to waveforms with the pretrained Vocos vocoder, which uses an inverse short-time Fourier transform head to operate in the time-frequency domain and produces 24 kHz audio. The vocoder is not retrained; F5-TTS simply matches its mel configuration (100 mels, 256 hop) to the public Vocos checkpoint.^[2] An Apple Silicon port, f5-tts-mlx, uses the vocos-mlx library for the same purpose on Mac hardware.^[9]

What is Sway Sampling?

The inference-time contribution of the paper is "Sway Sampling," a non-uniform schedule for the flow ODE time variable. Rather than spacing the NFE steps uniformly between 0 and 1, F5-TTS warps the schedule so that more steps are spent in the noisy half of the trajectory (small t), where alignment with the text is decided, and fewer steps in the clean half. The amount of warping is controlled by a single scalar parameter. The authors report that Sway Sampling can be applied to any pretrained flow-matching TTS model without retraining and that it improves both alignment and audio quality at fixed NFE, in some cases letting the system reach the quality of 32-step uniform sampling in only 16 steps.^[2]

What data was F5-TTS trained on?

F5-TTS is trained on the Emilia in-the-wild speech dataset, an extensive multilingual corpus introduced by Shanghai Jiao Tong University collaborators in 2024 that aggregates publicly sourced internet audio with automatic preprocessing for speaker diarization, noise filtering, and transcription. The official base checkpoint uses the English and Mandarin partitions, which after filtering total roughly 95,000 hours of speech drawn from the public 100,000-hour corpus the paper reports.^[1]^[2]^[3] The training configuration uses a batch size of 307,200 mel frames (about 0.91 hours of audio per step), the AdamW optimizer with a peak learning rate of 7.5e-5 and cosine decay, 1.2 million updates for the base checkpoint, and roughly one week of wall-clock time on a node of 8 NVIDIA A100 80 GB GPUs.^[2] An updated "v1" base checkpoint released in 2025 trained for 1.25 million updates.^[3]

The dataset choice is consequential for licensing. Because Emilia is sourced from publicly available recordings without explicit redistribution clearance for commercial generative use, derivative model weights are released under the non-commercial CC-BY-NC-4.0 license. The MIT-licensed source code can be used to train new checkpoints on any compatible corpus.^[4]

What can F5-TTS do?

To synthesize speech with F5-TTS, the user provides a reference audio clip (typically 5 to 15 seconds, in mono WAV, MP3, or FLAC), the transcript of that reference, and the text to synthesize.^[10] The reference clip is converted to mel features, concatenated with a zero-padded text sequence whose total length sums to the target output length, fed into the network, and denoised by the flow ODE. The vocoder converts the resulting mel features into a waveform, which is sliced to the new-text portion before being returned to the user. The official command-line interface and Gradio web demo wrap this pipeline; a Python package, f5-tts, can be installed from PyPI with pip install f5-tts.^[10]^[11]

How good is F5-TTS at zero-shot voice cloning?

The system is explicitly trained for zero-shot synthesis: there is no per-speaker fine-tuning step, and the model has never seen the reference speaker at training time. The official demo page presents English, Mandarin, and code-switched samples synthesized from clips of seconds-long duration, including direct-quote sentences, tongue twisters, and emotional reads such as calm, angry, happy, sad, fearful, and disgusted variants.^[8] Community reports note that voice fidelity is highest for same-language references and degrades when cloning across languages absent from training (for instance, an English reference used to synthesize Japanese tends to lose timbre).^[5]

Which languages does F5-TTS support?

The base checkpoint covers English and Mandarin and supports code-switching within a single utterance, because the Emilia corpus contains both languages and a character-level vocabulary covers both Latin and Chinese scripts.^[2] Community fine-tunes have extended the model to Japanese, Korean, Spanish, Portuguese, Sinhala, Brazilian Portuguese, and several other languages, with corresponding checkpoints listed in the project's SHARED.md registry.^[3]^[5]

Can F5-TTS control speed and emotion?

By choosing the length of the output mel sequence, the user can directly control the speaking rate. The demo page reports stable behavior between 0.7x and 1.3x of the natural speed implied by character count.^[8] Emotion is inherited from the reference clip; a clip in a sad register tends to bias the synthesis toward similarly emotive prosody without any explicit emotion conditioning input.^[8]

What hardware does F5-TTS need?

The model can run on consumer GPUs. Independent benchmarks measured a peak GPU memory of about 2,994 MB for F5-TTS at 32 NFE inference, which is lower than most diffusion-based zero-shot TTS systems and competitive with several smaller models.^[12] Apple Silicon support is provided by the f5-tts-mlx port, which runs entirely on the Metal Performance Shaders backend.^[9] Triton and TensorRT-LLM deployment recipes are included in the official repository.^[3]

How does F5-TTS perform on benchmarks?

Table 1 summarizes the headline numbers from the F5-TTS paper, evaluated on the LibriSpeech-PC test-clean continuation subset and the Seed-TTS English and Chinese test sets.^[2] Reported metrics are word error rate (WER) from an ASR model on the synthesized speech, speaker similarity (SIM-o) measured by a verification model against the reference, real-time factor (RTF) on an A100, and, where available, comparative mean opinion score (CMOS) judged by listeners.

Test set	System	NFE	WER (%)	SIM-o	RTF
LibriSpeech-PC test-clean	F5-TTS	32	2.42	0.66	0.31
LibriSpeech-PC test-clean	E2 TTS	32	2.95	0.69	0.68
LibriSpeech-PC test-clean	Voicebox	64	2.03	0.64	0.64
Seed-TTS test-en	F5-TTS	32	1.83	0.67	0.31
Seed-TTS test-en	E2 TTS	32	2.19	0.71	0.30

Independent comparisons published in 2025 placed F5-TTS in the top group of open-source TTS systems on a mix of quality, controllability, and resource use. A multi-model survey from Inferless found that F5-TTS and Sesame's csm-1b were the best-rounded performers across synthesized quality and controllability axes, with Zonos-v0.1-transformer distinguishing itself on per-attribute controllability.^[12] The same survey highlighted that F5-TTS's strict non-autoregressive design inherently limits low-latency streaming, while autoregressive systems such as XTTS-v2 can emit a first audio chunk within a few hundred milliseconds.^[12]

What variants and ports of F5-TTS exist?

Reference implementations and packaging

The canonical reference implementation lives in the SWivid/F5-TTS repository on GitHub. The codebase shipped with full training scripts, a Gradio web app, a CLI, and an optional voice-chat mode powered by an external Qwen2.5-3B-Instruct LLM for end-to-end speech interaction.^[3] The official package is also distributed on PyPI as f5-tts, on Hugging Face as SWivid/F5-TTS, on ModelScope, and on Wisemodel.^[3]^[11] A Replicate deployment under x-lance/f5-tts makes the model accessible through a hosted REST API.^[13] An MLX port, lucasnewman/f5-tts-mlx, runs the same model on Apple Silicon and is distributed through PyPI as f5-tts-mlx.^[9]

Community apps and integrations

The unofficial mrfakename/E2-F5-TTS Space on Hugging Face has been a popular entry point for casual users, exposing both the F5-TTS and E2 TTS checkpoints behind a Gradio interface that runs on ZeroGPU A100 instances. Free-tier usage is metered to about five GPU minutes per day per user.^[14] A widely circulated ComfyUI custom node, niknah/ComfyUI-F5-TTS, lets visual workflows drive F5-TTS directly, combining LoadAudio inputs, Whisper transcription, and F5TTSAudioInputs nodes into reusable voice-cloning graphs that have been shipped in tutorials and a TTS-Audio-Suite multi-engine bundle.^[15] Independent ports include "F5-TTS-Plus" by gjnave (a community fork with quality-of-life additions) and Windows desktop installers from third-party shops.^[3]

Commercial off-shoots

Because the official checkpoints are CC-BY-NC, commercial deployments are restricted to permissive reimplementations. The most prominent of these is OpenF5-TTS-Base by mrfakename, an Apache 2.0 reimplementation that retrains the F5-TTS architecture on permissively licensed audio. Its documentation describes the model as still in alpha and "inferior to the official NC-licensed F5-TTS model," but it allows commercial use.^[4] A separate ecosystem of commercial cloud TTS providers, including Murf, ElevenLabs, PlayHT, and HeyGen, target the same zero-shot voice-cloning use case as F5-TTS but ship their own proprietary models.^[16] G2 and similar review sites regularly group F5-TTS alongside CosyVoice, Synthesia, VEED, and Murf in "best alternatives" lists, with the open-source nature of F5-TTS framing it as a self-hostable option for research and prototyping.^[16]

F5R-TTS: reinforcement-learning variant

In April 2025, a team from Frontier Labs published F5R-TTS, an extension of the F5-TTS architecture that integrates GRPO (Group Relative Policy Optimization), a variant of policy-gradient reinforcement learning from human feedback, into the flow-matching training loop.^[17] The authors reformulate the deterministic velocity field as a Gaussian distribution over velocities so that the model can be treated as a stochastic policy. A reward model combines two automatic signals, ASR-based WER and embedding-based speaker similarity (SIM), and GRPO updates the policy toward higher-reward generations. F5R-TTS reports a 29.5 percent relative reduction in WER and a 4.6 percent relative SIM improvement over a vanilla F5-TTS baseline on zero-shot voice cloning, evaluated under the same protocol.^[17]

Cross-Lingual F5-TTS

In September 2025, a separate paper, "Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis," proposed a training recipe that removes the requirement for a transcript of the reference audio. The approach uses the Massively Multilingual Speech (MMS) forced aligner to extract phoneme or syllable boundaries from training data and trains transformer-based speaking-rate predictors at phoneme, syllable, and word granularity. This makes prompt-based cross-lingual cloning practical even when the reference language is not in the model's text vocabulary.^[18]

Community fine-tunes

The SHARED.md registry maintained in the SWivid repository tracks community checkpoints. As of mid-2025 it included Japanese fine-tunes contributed by community member Jmica, Korean fine-tunes, Spanish (F5-Spanish by jpgallegoar), Brazilian Portuguese (ModelsLab/F5-tts-brazilian), Sinhala (tharindumihi/tts-si-F5-TTS), and additional European languages.^[3]^[19] F5-TTS uses full fine-tuning by default rather than LoRA, meaning that fine-tuned checkpoints are full-size replacements rather than additive adapters.^[5]

What is F5-TTS used for?

The most discussed application of F5-TTS is zero-shot voice cloning for video and audio production. The ComfyUI ecosystem in particular has used F5-TTS as a drop-in for voice work in generative video pipelines, pairing it with image and motion models for character voiceover.^[15] Open-source dubbing pipelines have integrated F5-TTS to translate a source audio track into a different language while preserving speaker timbre, often using a Whisper-based pipeline for the source transcript and then routing the translation through F5-TTS with the original audio as the reference clip.^[20] Hobbyist projects use the model for audiobook narration, podcast generation, and accessibility tools that need a personalized voice. Researchers use the model as a strong open baseline against which to compare new TTS architectures, and the F5R-TTS work is one explicit example of using F5-TTS as a starting checkpoint for reinforcement-learning extensions.^[17]

Because the model can synthesize from a short, easily harvested reference clip, F5-TTS also figures in broader discussions of voice deepfakes and consent. The non-commercial license on the checkpoint and the model card's content notice ask users to obtain consent from the speakers whose voices they reproduce, but the underlying architecture does not include any speaker-identity watermark or technical safeguard. Several commercial TTS vendors have referenced F5-TTS as a representative open baseline when arguing for their own provenance and watermark schemes.^[16]

What are the limitations of F5-TTS?

The model has known limitations. Reference clips longer than about 15 seconds are silently truncated, because the network was trained for total sequence lengths corresponding to roughly 30 seconds of audio, and quality degrades when very long target syntheses are concatenated rather than batched in shorter chunks.^[15] The non-autoregressive design produces the entire utterance in parallel, which prevents the kind of low-latency streaming first-chunk emission offered by autoregressive systems such as XTTS-v2, and the inference-time RTF of 0.15 is measured on an A100 GPU; on lower-end consumer GPUs the same configuration runs noticeably slower.^[12]

Audio quality is generally competitive with closed-source commercial systems, but user reports describe occasional artifacts on edge cases (very fast speech, heavily accented references, noisy reference audio). Some users have noted that the model's voice fidelity is not yet at parity with the best commercial cloning APIs and that artifacts can be audible in production-grade workflows.^[4] The official base checkpoint, like other models trained on in-the-wild data, can reproduce biases of the training corpus; the Emilia partitions overrepresent certain English and Mandarin accents, and community reports note that uncommon dialects sometimes shift toward more familiar variants.^[3]

The non-commercial license on the checkpoints is the most-discussed structural limitation. Because the weights inherit a CC-BY-NC restriction from Emilia, they cannot legally be used in monetized YouTube videos, commercial dubbing, or other revenue-generating settings, even after community fine-tuning of the CC-BY-NC base.^[4] The repository discussion forums include long threads on this point, with maintainers reaffirming the constraint and pointing commercial users to the permissively licensed OpenF5-TTS-Base reimplementation as a workaround.^[4]

A final criticism is one shared across the zero-shot voice cloning literature: the technical ease with which the model can imitate any voice from a short clip raises misuse risks that the project does not fully solve, and the model card explicitly relies on users to follow ethical guidelines rather than imposing technical limits.^[3]

How does F5-TTS compare to other TTS models?

System	Year	Architecture	Open weights	License (weights)	Real-time factor	Notes
F5-TTS	2024	Flow matching + DiT + ConvNeXt	Yes	CC-BY-NC-4.0	~0.15 (A100)	Reference for this article
E2 TTS	2024	Flow matching + Flat-UNet	Reproduced in F5-TTS repo	CC-BY-NC-4.0	~0.30 (A100)	Direct precursor^[2]
CosyVoice	2024	Codec LM + flow matching	Yes	Apache 2.0 (varies)	Streaming-capable	Strong speaker similarity^[12]
VALL-E (and successors)	2023	Neural codec LM	Reimplementations only	Research	Autoregressive	Token-by-token generation^[6]
XTTS-v2	2023	GPT-style codec LM	Yes	Coqui CPML	RTF 0.48 with ~3 s latency	Streaming, multilingual^[12]
Tortoise-TTS	2022	Autoregressive + diffusion refiner	Yes	Apache 2.0	High latency	Limited input length^[12]
MaskGCT	2024	Masked generative codec transformer	Yes	Research	Parallel	Competing non-AR approach
ChatTTS	2024	Conversational AR model	Yes	CC-BY-NC	Moderate	Dialogue-focused
Sesame CSM (csm-1b)	2025	Conversational speech model	Yes	Apache 2.0	Competitive	Cited as peer of F5-TTS in surveys^[12]
F5R-TTS	2025	F5-TTS + GRPO RL	Yes (FrontierLabs)	Research	Same as F5-TTS	RLHF variant, better WER/SIM^[17]
ElevenLabs / ElevenLabs v3	2023+	Proprietary	No	Closed	Streaming	Dominant commercial baseline^[16]
Murf	Commercial	Proprietary	No	Closed	Streaming	Template-driven workflow^[16]

In open-source surveys published throughout 2025, F5-TTS consistently appears in the top group on naturalness and speaker similarity while distinguishing itself by a relatively low GPU memory footprint (roughly 3 GB for typical inference) and an absence of the rare hallucinations seen in some autoregressive codec language models.^[12]

Is F5-TTS open source and free to use?

F5-TTS is open source, but its licensing is split. The source code in the SWivid/F5-TTS repository is released under the permissive MIT License, so the architecture and training scripts can be reused, modified, and redistributed freely, including in commercial products.^[3] The pretrained model checkpoints are a different matter: they carry a Creative Commons CC-BY-NC-4.0 (attribution, non-commercial) license, inherited from the Emilia training corpus, which is assembled from in-the-wild internet audio without commercial-use clearance.^[4] In practice this means the official weights can be used freely for research, prototyping, and personal projects but not in monetized or revenue-generating deployments. Developers who need commercial use either train a fresh checkpoint on a permissively licensed corpus using the MIT code, or use the Apache 2.0 OpenF5-TTS-Base reimplementation that the maintainers point commercial users toward.^[4]

Why is F5-TTS significant?

F5-TTS occupies a particular niche in the post-2023 TTS landscape. It demonstrated that a clean conditional flow-matching objective, paired with a Diffusion Transformer trunk and a small ConvNeXt V2 text refiner, could match or surpass more elaborate diffusion- and codec-language-model designs on the standard zero-shot benchmarks while remaining short on moving parts: no duration predictor, no phoneme aligner, no per-language text encoder, and an inference-time speed-up (Sway Sampling) that can be applied as a drop-in to other flow-matching TTS models.^[2] As an artifact, the project has acted as a base for academic follow-ups (F5R-TTS, Cross-Lingual F5-TTS), as a baseline against which new commercial systems advertise themselves, and as a popular open-source tool for hobbyists and small studios that need self-hostable voice cloning. As of mid-2025 it is one of the most-starred open-source TTS repositories on GitHub, with roughly 14,800 stars and an active community of fine-tuners and integrators.^[3] The combination of permissive code licensing, a clean architecture amenable to extension, and an actively maintained ecosystem of ports (MLX, Triton, TensorRT-LLM, ComfyUI) accounts for its outsized visibility relative to closely related research models that did not see comparable community uptake.^[3]^[9]^[15]

Flow matching (Lipman et al., 2022) is the generative modeling framework used by F5-TTS.
Diffusion Transformer (DiT) (Peebles and Xie, 2022) is the backbone class instantiated in F5-TTS, originally proposed for image diffusion and adapted here to mel-spectrogram features.
MMDiT is a related multimodal DiT variant for text-and-image diffusion that separates modality streams in the trunk; F5-TTS instead concatenates them.
ConvNeXt is the convolutional family used (in its ConvNeXt V2 incarnation) as the text refiner.
Voice cloning is the general capability F5-TTS targets.
CosyVoice is a contemporaneous open-source TTS system pairing codec language modeling with flow-matching refinement.
VALL-E is a representative codec language-model TTS in the autoregressive tradition.
Sesame CSM is a 2025 conversational speech model frequently grouped with F5-TTS in open-source TTS surveys.
GRPO is the reinforcement-learning algorithm used by F5R-TTS to fine-tune F5-TTS against ASR and speaker-similarity rewards.
RLHF is the broader paradigm to which the F5R-TTS extension belongs.

References

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen, "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching", arXiv (preprint history v1 2024-10-09, v2 2024-10-15, v3 2025-05-20). https://arxiv.org/abs/2410.06885. Accessed 2026-06-28. ↩
Yushen Chen et al., "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching" (full text), arXiv HTML version v3, 2025-05-20. https://arxiv.org/html/2410.06885v3. Accessed 2026-06-28. ↩
SWivid, "F5-TTS official repository README and SHARED.md", GitHub, 2024-2025. https://github.com/SWivid/F5-TTS. Accessed 2026-06-28. ↩
SWivid, "Clarification on Training Data, Licensing, and Building a Commercial Base Model", GitHub Discussion #997, SWivid/F5-TTS, 2024-2025. https://github.com/SWivid/F5-TTS/discussions/997. Accessed 2026-06-28. ↩
Local AI Master, "F5-TTS Setup Guide (2026): The Best Open-Source Voice Cloning Model", 2026. https://localaimaster.com/blog/f5-tts-setup-guide. Accessed 2026-06-28. ↩
Chengyi Wang et al., "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)", Microsoft Research, arXiv:2301.02111, 2023-01-05. https://arxiv.org/abs/2301.02111. Accessed 2026-06-28. ↩
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen, "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching", Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6255-6271, ACL Anthology, 2025-07. https://aclanthology.org/2025.acl-long.313/. Accessed 2026-06-28. ↩
SWivid, "F5-TTS demo page", 2024-10. https://swivid.github.io/F5-TTS/. Accessed 2026-06-28. ↩
Lucas Newman, "f5-tts-mlx: F5-TTS port for Apple Silicon", PyPI / GitHub, 2024-2025. https://pypi.org/project/f5-tts-mlx/. Accessed 2026-06-28. ↩
SWivid, "F5-TTS model card", Hugging Face, 2024-2025. https://huggingface.co/SWivid/F5-TTS. Accessed 2026-06-28. ↩
SWivid, "f5-tts package on PyPI", PyPI, 2024-2025. https://pypi.org/project/f5-tts/. Accessed 2026-06-28. ↩
Inferless, "12 Best Open-Source TTS Models Compared (2025): Latency, Quality, Voice Cloning and More", 2025. https://www.inferless.com/learn/comparing-different-text-to-speech---tts--models-part-2. Accessed 2026-06-28. ↩
X-LANCE, "f5-tts on Replicate", Replicate, 2024-2025. https://replicate.com/x-lance/f5-tts/readme. Accessed 2026-06-28. ↩
mrfakename, "E2/F5 TTS Hugging Face Space", Hugging Face, 2024-2025. https://huggingface.co/spaces/mrfakename/E2-F5-TTS. Accessed 2026-06-28. ↩
niknah, "ComfyUI-F5-TTS custom node", GitHub, 2024-2025. https://github.com/niknah/ComfyUI-F5-TTS. Accessed 2026-06-28. ↩
G2, "Top 10 F5-TTS Alternatives and Competitors in 2026", G2.com, 2026. https://www.g2.com/products/f5-tts/competitors/alternatives. Accessed 2026-06-28. ↩
Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang, "F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization", arXiv:2504.02407, 2025-04. https://arxiv.org/abs/2504.02407. Accessed 2026-06-28. ↩
"Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis", arXiv:2509.14579, 2025-09. https://arxiv.org/abs/2509.14579. Accessed 2026-06-28. ↩
jpgallegoar, "F5-Spanish: F5-TTS Spanish fine-tune", Hugging Face / aimodels.fyi listing, 2025. https://www.aimodels.fyi/models/huggingFace/f5-spanish-jpgallegoar. Accessed 2026-06-28. ↩
GhostFork, "Demystifying AI dubbing - Part 3: Text-To-Speech with F5-TTS", 2025. https://ghostfork.me/blog/demystifying-dubbing-part3-text-to-speech/. Accessed 2026-06-28. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Best AI Voice Generators (Text-to-Speech)Paper2Video Text-to-Speech Models Voicebox XTTS (Coqui XTTS)

What is F5-TTS?

When was F5-TTS released and who built it?

How does F5-TTS work?

How does text-conditioned flow matching work in F5-TTS?

What does the ConvNeXt V2 text refiner do?

What is the Diffusion Transformer (DiT) backbone?

Which vocoder does F5-TTS use?

What is Sway Sampling?

What data was F5-TTS trained on?

What can F5-TTS do?

How good is F5-TTS at zero-shot voice cloning?

Which languages does F5-TTS support?

Can F5-TTS control speed and emotion?

What hardware does F5-TTS need?

How does F5-TTS perform on benchmarks?

What variants and ports of F5-TTS exist?

Reference implementations and packaging

Community apps and integrations

Commercial off-shoots

F5R-TTS: reinforcement-learning variant

Cross-Lingual F5-TTS

Community fine-tunes

What is F5-TTS used for?

What are the limitations of F5-TTS?

How does F5-TTS compare to other TTS models?

Is F5-TTS open source and free to use?

Why is F5-TTS significant?

Related work

See also

References

Improve this article

Related Articles

Moshi

Sesame CSM

Sesame (AI company)

XTTS (Coqui XTTS)

Voxtral

NVIDIA Parakeet

What links here

Related Articles

Moshi

Sesame CSM

Sesame (AI company)

XTTS (Coqui XTTS)

Voxtral

NVIDIA Parakeet

What links here