F5-TTS
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,668 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,668 words
Add missing citations, update stale details, or suggest a clearer explanation.
F5-TTS (short for "A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching") is an open-source text-to-speech system introduced in October 2024 by researchers from Shanghai Jiao Tong University's X-LANCE Lab, the University of Cambridge, and the Geely Automobile Research Institute.[1] The system is a fully non-autoregressive synthesizer built on flow matching with a Diffusion Transformer (DiT) backbone, augmented with a ConvNeXt V2 module that refines text embeddings so that they can be aligned with mel-spectrogram frames simply by zero-padding the text sequence.[2] Trained on roughly 100,000 hours of English and Mandarin audio drawn from the Emilia in-the-wild dataset, the 335.8-million-parameter model exhibits zero-shot voice cloning from a reference clip of only a few seconds, supports code-switching between languages, and reports an inference real-time factor of about 0.15 with an inference-time technique the authors call "Sway Sampling".[1][2] The code is released under the MIT License while the pretrained checkpoints are governed by a Creative Commons BY-NC license that inherits restrictions from the Emilia corpus.[3][4] Following its release on Hugging Face and GitHub, F5-TTS became one of the most-starred open-source speech projects of 2024 and 2025, accumulating well over ten thousand stars and triggering a wave of community fine-tunes and a reinforcement-learning successor, F5R-TTS.[3][5]
| Field | Value |
|---|---|
| Full name | F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching |
| Type | Non-autoregressive zero-shot TTS |
| Backbone | Diffusion Transformer (DiT) with ConvNeXt V2 text refiner |
| Vocoder | Vocos (pretrained, 24 kHz) |
| Parameters | 335.8 million (base) |
| Training data | Emilia (~95K hours English and Mandarin after filtering) |
| Sampling rate | 24 kHz mel features, 256 hop length, 100 mel bins |
| arXiv preprint | 2410.06885 (first posted 9 October 2024) |
| ACL 2025 paper | Long Papers volume, pages 6255-6271 |
| Code license | MIT |
| Checkpoint license | CC-BY-NC-4.0 |
| Official repository | github.com/SWivid/F5-TTS |
| Inference RTF | ~0.15 on A100 GPU |
By 2024, zero-shot voice cloning had become a focal point of speech research. A wave of generative TTS systems showed that a network trained on diverse multilingual audio could imitate an unseen speaker after hearing only a short reference clip. Two contrasting design philosophies emerged. One descended from VALL-E and its kin, which framed TTS as next-token prediction over neural-codec tokens; these autoregressive codec language models inherited the slow inference of large language models and were prone to occasional hallucinations of repeated or skipped phonemes.[6] The other philosophy used continuous-time diffusion or flow-matching objectives over mel-spectrogram frames, which were attractive for their parallel decoding but often required carefully designed duration models, phoneme aligners, and text encoders. Microsoft's Voicebox (2023) and the closely related E2 TTS (2024) explored the second route. E2 TTS in particular showed that a single Flat-UNet Transformer trained with the conditional flow-matching objective on text-guided speech infilling could match autoregressive quality without an explicit duration predictor, if the text input was padded with filler tokens to the length of the target speech.[2]
F5-TTS, posted on arXiv on 9 October 2024, was built directly on top of the E2 TTS recipe and aimed to remove what its authors saw as two remaining inefficiencies. The first was the slow convergence of E2 TTS, which the authors attributed to the difficulty of the network in learning to align padded characters with speech frames using a flat trunk that was identical for text and audio. The second was the high number of function evaluations needed for high-quality sampling, which made diffusion-style systems expensive at inference time.[2] The corresponding fix, as described in the paper, was to (a) refine the text representation with a small ConvNeXt V2 module before concatenating it with the masked mel-spectrogram, (b) replace the flat trunk with a Diffusion Transformer (DiT) trunk equipped with adaptive Layer Norm and rotary position embeddings, and (c) introduce an inference-time non-linear schedule for the flow matching time variable, branded as "Sway Sampling", that prioritizes steps near the noisy end of the trajectory.[2]
The paper credits a team led by Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng (Cambridge), Chunhui Wang and Jian Zhao (Geely), Kai Yu, and corresponding author Xie Chen at Shanghai Jiao Tong University.[7] A first version of the manuscript was uploaded to arXiv on 9 October 2024, a second on 15 October 2024, and a third on 20 May 2025.[1] The work was accepted to the Annual Meeting of the Association for Computational Linguistics (ACL) 2025 and appears in the Long Papers volume of the proceedings, pages 6255-6271.[7] In parallel with the paper, the team published model weights on Hugging Face and Replicate, source code on GitHub, demos on a dedicated project page, and an unofficial Hugging Face Space hosted by community member "mrfakename" that became a popular way to try the system without a local installation.[3][8]
F5-TTS is, at its highest level, a conditional flow-matching network that learns a velocity field over 100-dimensional log mel-spectrogram features sampled at 24 kHz with a 256-sample hop length, paired with a pretrained Vocos vocoder for waveform reconstruction.[2] The network parameterizes a probability flow ordinary differential equation (ODE) from Gaussian noise at flow time t=0 to mel features matching the target audio at t=1, conditioned on a text sequence padded to the speech length. The total parameter count is 335.8 million, almost entirely concentrated in the DiT trunk; the ConvNeXt V2 refiner adds a few million parameters more.[2]
Following the E2 TTS framing, F5-TTS trains a single network on the text-guided speech-infilling task. During training, the system samples a random mask over 70 to 100 percent of the mel frames of a clean audio clip, replaces the masked region with a noised version interpolated toward Gaussian noise, and asks the network to predict the velocity that points toward the clean features given (a) the unmasked context, (b) the noisy frames, and (c) a text sequence whose length has been padded with filler tokens to match the speech length.[2] At inference, the entire audio is treated as masked except for a short reference clip provided by the user, and the unmasked region carries the speaker timbre of the reference. The flow ODE is solved by torchdiffeq, with the number of function evaluations (NFE) typically set to 32 for high-quality synthesis and 16 for fast preview.[2]
The objective is the optimal-transport conditional flow matching loss used in flow matching. Unlike score-based diffusion, the loss is a simple regression on velocities, with classifier-free guidance applied at inference time by interpolating between the conditional and unconditional velocity at each step.[2]
The text input is first converted into a character sequence (the official base checkpoint uses character tokens rather than phonemes, which keeps the model language-flexible) and embedded into 512-dimensional vectors. These embeddings are padded with filler tokens to match the speech length and passed through a four-layer 1D ConvNeXt V2 block stack with 512 hidden dimensions and 1024-dimensional feed-forward layers. The refined text features are then concatenated with the masked mel features along the channel dimension before being fed into the DiT trunk.[2] The authors describe this refiner as the key change that makes the simple "pad and align" strategy work in fewer training updates than the original E2 TTS Flat-UNet, by giving the text branch its own inductive bias for capturing local prosodic structure.[2]
The trunk of F5-TTS is a 22-layer Diffusion Transformer with 16 attention heads, 1024-dimensional embeddings, and 2048-dimensional feed-forward sublayers. It uses zero-initialized adaptive Layer Norm (adaLN-zero) so that the network starts close to the identity at the beginning of training, and rotary position embeddings (RoPE) in self-attention to handle variable lengths up to roughly 30 seconds of audio at the chosen hop length.[2] Dropout of 0.1 is applied to attention and feed-forward sublayers. Unlike multimodal DiTs such as MMDiT, which separate text and image streams into different trunks that interact through joint attention, F5-TTS concatenates the text and mel streams into a single token sequence and lets the DiT handle them together.
Mel spectrograms produced by the flow ODE are converted to waveforms with the pretrained Vocos vocoder, which uses an inverse short-time Fourier transform head to operate in the time-frequency domain and produces 24 kHz audio. The vocoder is not retrained; F5-TTS simply matches its mel configuration (100 mels, 256 hop) to the public Vocos checkpoint.[2] An Apple Silicon port, f5-tts-mlx, uses the vocos-mlx library for the same purpose on Mac hardware.[9]
The inference-time contribution of the paper is "Sway Sampling," a non-uniform schedule for the flow ODE time variable. Rather than spacing the NFE steps uniformly between 0 and 1, F5-TTS warps the schedule so that more steps are spent in the noisy half of the trajectory (small t), where alignment with the text is decided, and fewer steps in the clean half. The amount of warping is controlled by a single scalar parameter. The authors report that Sway Sampling can be applied to any pretrained flow-matching TTS model without retraining and that it improves both alignment and audio quality at fixed NFE, in some cases letting the system reach the quality of 32-step uniform sampling in only 16 steps.[2]
F5-TTS is trained on the Emilia in-the-wild speech dataset, an extensive multilingual corpus introduced by Shanghai Jiao Tong University collaborators in 2024 that aggregates publicly sourced internet audio with automatic preprocessing for speaker diarization, noise filtering, and transcription. The official base checkpoint uses the English and Mandarin partitions, which after filtering total roughly 95,000 hours of speech.[2][3] The training configuration uses a batch size of 307,200 mel frames (about 0.91 hours of audio per step), the AdamW optimizer with a peak learning rate of 7.5e-5 and cosine decay, 1.2 million updates for the base checkpoint, and roughly one week of wall-clock time on a node of 8 NVIDIA A100 80 GB GPUs.[2] An updated "v1" base checkpoint released in 2025 trained for 1.25 million updates.[3]
The dataset choice is consequential for licensing. Because Emilia is sourced from publicly available recordings without explicit redistribution clearance for commercial generative use, derivative model weights are released under the non-commercial CC-BY-NC-4.0 license. The MIT-licensed source code can be used to train new checkpoints on any compatible corpus.[4]
To synthesize speech with F5-TTS, the user provides a reference audio clip (typically 5 to 15 seconds, in mono WAV, MP3, or FLAC), the transcript of that reference, and the text to synthesize.[10] The reference clip is converted to mel features, concatenated with a zero-padded text sequence whose total length sums to the target output length, fed into the network, and denoised by the flow ODE. The vocoder converts the resulting mel features into a waveform, which is sliced to the new-text portion before being returned to the user. The official command-line interface and Gradio web demo wrap this pipeline; a Python package, f5-tts, can be installed from PyPI with pip install f5-tts.[10][11]
The system is explicitly trained for zero-shot synthesis: there is no per-speaker fine-tuning step, and the model has never seen the reference speaker at training time. The official demo page presents English, Mandarin, and code-switched samples synthesized from clips of seconds-long duration, including direct-quote sentences, tongue twisters, and emotional reads such as calm, angry, happy, sad, fearful, and disgusted variants.[8] Community reports note that voice fidelity is highest for same-language references and degrades when cloning across languages absent from training (for instance, an English reference used to synthesize Japanese tends to lose timbre).[5]
The base checkpoint covers English and Mandarin and supports code-switching within a single utterance, because the Emilia corpus contains both languages and a character-level vocabulary covers both Latin and Chinese scripts.[2] Community fine-tunes have extended the model to Japanese, Korean, Spanish, Portuguese, Sinhala, Brazilian Portuguese, and several other languages, with corresponding checkpoints listed in the project's SHARED.md registry.[3][5]
By choosing the length of the output mel sequence, the user can directly control the speaking rate. The demo page reports stable behavior between 0.7x and 1.3x of the natural speed implied by character count.[8] Emotion is inherited from the reference clip; a clip in a sad register tends to bias the synthesis toward similarly emotive prosody without any explicit emotion conditioning input.[8]
The model can run on consumer GPUs. Independent benchmarks measured a peak GPU memory of about 2,994 MB for F5-TTS at 32 NFE inference, which is lower than most diffusion-based zero-shot TTS systems and competitive with several smaller models.[12] Apple Silicon support is provided by the f5-tts-mlx port, which runs entirely on the Metal Performance Shaders backend.[9] Triton and TensorRT-LLM deployment recipes are included in the official repository.[3]
Table 1 summarizes the headline numbers from the F5-TTS paper, evaluated on the LibriSpeech-PC test-clean continuation subset and the Seed-TTS English and Chinese test sets.[2] Reported metrics are word error rate (WER) from an ASR model on the synthesized speech, speaker similarity (SIM-o) measured by a verification model against the reference, real-time factor (RTF) on an A100, and, where available, comparative mean opinion score (CMOS) judged by listeners.
| Test set | System | NFE | WER (%) | SIM-o | RTF |
|---|---|---|---|---|---|
| LibriSpeech-PC test-clean | F5-TTS | 32 | 2.42 | 0.66 | 0.31 |
| LibriSpeech-PC test-clean | E2 TTS | 32 | 2.95 | 0.69 | 0.68 |
| LibriSpeech-PC test-clean | Voicebox | 64 | 2.03 | 0.64 | 0.64 |
| Seed-TTS test-en | F5-TTS | 32 | 1.83 | 0.67 | 0.31 |
| Seed-TTS test-en | E2 TTS | 32 | 2.19 | 0.71 | 0.30 |
Independent comparisons published in 2025 placed F5-TTS in the top group of open-source TTS systems on a mix of quality, controllability, and resource use. A multi-model survey from Inferless found that F5-TTS and Sesame's csm-1b were the best-rounded performers across synthesized quality and controllability axes, with Zonos-v0.1-transformer distinguishing itself on per-attribute controllability.[12] The same survey highlighted that F5-TTS's strict non-autoregressive design inherently limits low-latency streaming, while autoregressive systems such as XTTS-v2 can emit a first audio chunk within a few hundred milliseconds.[12]
The canonical reference implementation lives in the SWivid/F5-TTS repository on GitHub. The codebase shipped with full training scripts, a Gradio web app, a CLI, and an optional voice-chat mode powered by an external Qwen2.5-3B-Instruct LLM for end-to-end speech interaction.[3] The official package is also distributed on PyPI as f5-tts, on Hugging Face as SWivid/F5-TTS, on ModelScope, and on Wisemodel.[3][11] A Replicate deployment under x-lance/f5-tts makes the model accessible through a hosted REST API.[13] An MLX port, lucasnewman/f5-tts-mlx, runs the same model on Apple Silicon and is distributed through PyPI as f5-tts-mlx.[9]
The unofficial mrfakename/E2-F5-TTS Space on Hugging Face has been a popular entry point for casual users, exposing both the F5-TTS and E2 TTS checkpoints behind a Gradio interface that runs on ZeroGPU A100 instances. Free-tier usage is metered to about five GPU minutes per day per user.[14] A widely circulated ComfyUI custom node, niknah/ComfyUI-F5-TTS, lets visual workflows drive F5-TTS directly, combining LoadAudio inputs, Whisper transcription, and F5TTSAudioInputs nodes into reusable voice-cloning graphs that have been shipped in tutorials and a TTS-Audio-Suite multi-engine bundle.[15] Independent ports include "F5-TTS-Plus" by gjnave (a community fork with quality-of-life additions) and Windows desktop installers from third-party shops.[3]
Because the official checkpoints are CC-BY-NC, commercial deployments are restricted to permissive reimplementations. The most prominent of these is OpenF5-TTS-Base by mrfakename, an Apache 2.0 reimplementation that retrains the F5-TTS architecture on permissively licensed audio. Its documentation describes the model as still in alpha and "inferior to the official NC-licensed F5-TTS model," but it allows commercial use.[4] A separate ecosystem of commercial cloud TTS providers, including Murf, ElevenLabs, PlayHT, and HeyGen, target the same zero-shot voice-cloning use case as F5-TTS but ship their own proprietary models.[16] G2 and similar review sites regularly group F5-TTS alongside CosyVoice, Synthesia, VEED, and Murf in "best alternatives" lists, with the open-source nature of F5-TTS framing it as a self-hostable option for research and prototyping.[16]
In April 2025, a team from Frontier Labs published F5R-TTS, an extension of the F5-TTS architecture that integrates GRPO (Group Relative Policy Optimization), a variant of policy-gradient reinforcement learning from human feedback, into the flow-matching training loop.[17] The authors reformulate the deterministic velocity field as a Gaussian distribution over velocities so that the model can be treated as a stochastic policy. A reward model combines two automatic signals, ASR-based WER and embedding-based speaker similarity (SIM), and GRPO updates the policy toward higher-reward generations. F5R-TTS reports a 29.5 percent relative reduction in WER and a 4.6 percent relative SIM improvement over a vanilla F5-TTS baseline on zero-shot voice cloning, evaluated under the same protocol.[17]
In September 2025, a separate paper, "Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis," proposed a training recipe that removes the requirement for a transcript of the reference audio. The approach uses the Massively Multilingual Speech (MMS) forced aligner to extract phoneme or syllable boundaries from training data and trains transformer-based speaking-rate predictors at phoneme, syllable, and word granularity. This makes prompt-based cross-lingual cloning practical even when the reference language is not in the model's text vocabulary.[18]
The SHARED.md registry maintained in the SWivid repository tracks community checkpoints. As of mid-2025 it included Japanese fine-tunes contributed by community member Jmica, Korean fine-tunes, Spanish (F5-Spanish by jpgallegoar), Brazilian Portuguese (ModelsLab/F5-tts-brazilian), Sinhala (tharindumihi/tts-si-F5-TTS), and additional European languages.[3][19] F5-TTS uses full fine-tuning by default rather than LoRA, meaning that fine-tuned checkpoints are full-size replacements rather than additive adapters.[5]
The most discussed application of F5-TTS is zero-shot voice cloning for video and audio production. The ComfyUI ecosystem in particular has used F5-TTS as a drop-in for voice work in generative video pipelines, pairing it with image and motion models for character voiceover.[15] Open-source dubbing pipelines have integrated F5-TTS to translate a source audio track into a different language while preserving speaker timbre, often using a Whisper-based pipeline for the source transcript and then routing the translation through F5-TTS with the original audio as the reference clip.[20] Hobbyist projects use the model for audiobook narration, podcast generation, and accessibility tools that need a personalized voice. Researchers use the model as a strong open baseline against which to compare new TTS architectures, and the F5R-TTS work is one explicit example of using F5-TTS as a starting checkpoint for reinforcement-learning extensions.[17]
Because the model can synthesize from a short, easily harvested reference clip, F5-TTS also figures in broader discussions of voice deepfakes and consent. The non-commercial license on the checkpoint and the model card's content notice ask users to obtain consent from the speakers whose voices they reproduce, but the underlying architecture does not include any speaker-identity watermark or technical safeguard. Several commercial TTS vendors have referenced F5-TTS as a representative open baseline when arguing for their own provenance and watermark schemes.[16]
The model has known limitations. Reference clips longer than about 15 seconds are silently truncated, because the network was trained for total sequence lengths corresponding to roughly 30 seconds of audio, and quality degrades when very long target syntheses are concatenated rather than batched in shorter chunks.[15] The non-autoregressive design produces the entire utterance in parallel, which prevents the kind of low-latency streaming first-chunk emission offered by autoregressive systems such as XTTS-v2, and the inference-time RTF of 0.15 is measured on an A100 GPU; on lower-end consumer GPUs the same configuration runs noticeably slower.[12]
Audio quality is generally competitive with closed-source commercial systems, but user reports describe occasional artifacts on edge cases (very fast speech, heavily accented references, noisy reference audio). Some users have noted that the model's voice fidelity is not yet at parity with the best commercial cloning APIs and that artifacts can be audible in production-grade workflows.[4] The official base checkpoint, like other models trained on in-the-wild data, can reproduce biases of the training corpus; the Emilia partitions overrepresent certain English and Mandarin accents, and community reports note that uncommon dialects sometimes shift toward more familiar variants.[3]
The non-commercial license on the checkpoints is the most-discussed structural limitation. Because the weights inherit a CC-BY-NC restriction from Emilia, they cannot legally be used in monetized YouTube videos, commercial dubbing, or other revenue-generating settings, even after community fine-tuning of the CC-BY-NC base.[4] The repository discussion forums include long threads on this point, with maintainers reaffirming the constraint and pointing commercial users to the permissively licensed OpenF5-TTS-Base reimplementation as a workaround.[4]
A final criticism is one shared across the zero-shot voice cloning literature: the technical ease with which the model can imitate any voice from a short clip raises misuse risks that the project does not fully solve, and the model card explicitly relies on users to follow ethical guidelines rather than imposing technical limits.[3]
| System | Year | Architecture | Open weights | License (weights) | Real-time factor | Notes |
|---|---|---|---|---|---|---|
| F5-TTS | 2024 | Flow matching + DiT + ConvNeXt | Yes | CC-BY-NC-4.0 | ~0.15 (A100) | Reference for this article |
| E2 TTS | 2024 | Flow matching + Flat-UNet | Reproduced in F5-TTS repo | CC-BY-NC-4.0 | ~0.30 (A100) | Direct precursor[2] |
| CosyVoice | 2024 | Codec LM + flow matching | Yes | Apache 2.0 (varies) | Streaming-capable | Strong speaker similarity[12] |
| VALL-E (and successors) | 2023 | Neural codec LM | Reimplementations only | Research | Autoregressive | Token-by-token generation[6] |
| XTTS-v2 | 2023 | GPT-style codec LM | Yes | Coqui CPML | RTF 0.48 with ~3 s latency | Streaming, multilingual[12] |
| Tortoise-TTS | 2022 | Autoregressive + diffusion refiner | Yes | Apache 2.0 | High latency | Limited input length[12] |
| MaskGCT | 2024 | Masked generative codec transformer | Yes | Research | Parallel | Competing non-AR approach |
| ChatTTS | 2024 | Conversational AR model | Yes | CC-BY-NC | Moderate | Dialogue-focused |
| Sesame CSM (csm-1b) | 2025 | Conversational speech model | Yes | Apache 2.0 | Competitive | Cited as peer of F5-TTS in surveys[12] |
| F5R-TTS | 2025 | F5-TTS + GRPO RL | Yes (FrontierLabs) | Research | Same as F5-TTS | RLHF variant, better WER/SIM[17] |
| ElevenLabs / ElevenLabs v3 | 2023+ | Proprietary | No | Closed | Streaming | Dominant commercial baseline[16] |
| Murf | Commercial | Proprietary | No | Closed | Streaming | Template-driven workflow[16] |
In open-source surveys published throughout 2025, F5-TTS consistently appears in the top group on naturalness and speaker similarity while distinguishing itself by a relatively low GPU memory footprint (roughly 3 GB for typical inference) and an absence of the rare hallucinations seen in some autoregressive codec language models.[12]
F5-TTS occupies a particular niche in the post-2023 TTS landscape. It demonstrated that a clean conditional flow-matching objective, paired with a Diffusion Transformer trunk and a small ConvNeXt V2 text refiner, could match or surpass more elaborate diffusion- and codec-language-model designs on the standard zero-shot benchmarks while remaining short on moving parts: no duration predictor, no phoneme aligner, no per-language text encoder, and an inference-time speed-up (Sway Sampling) that can be applied as a drop-in to other flow-matching TTS models.[2] As an artifact, the project has acted as a base for academic follow-ups (F5R-TTS, Cross-Lingual F5-TTS), as a baseline against which new commercial systems advertise themselves, and as a popular open-source tool for hobbyists and small studios that need self-hostable voice cloning. As of mid-2025 it is one of the most-starred open-source TTS repositories on GitHub, with on the order of 14,000 stars and an active community of fine-tuners and integrators.[3] The combination of permissive code licensing, a clean architecture amenable to extension, and an actively maintained ecosystem of ports (MLX, Triton, TensorRT-LLM, ComfyUI) accounts for its outsized visibility relative to closely related research models that did not see comparable community uptake.[3][9][15]