Voicebox
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,405 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,405 words
Add missing citations, update stale details, or suggest a clearer explanation.
Voicebox is a non-autoregressive, text-conditioned generative model for speech developed by Meta AI Research and announced in June 2023. It is trained with a conditional flow-matching objective on an infill-style audio prediction task, which lets a single set of weights perform zero-shot text-to-speech (TTS), voice editing, noise removal, content editing, style transfer, and multilingual TTS without task-specific fine-tuning.[1][2] Voicebox was one of the first speech models to scale flow matching as the generative objective, demonstrated state-of-the-art zero-shot TTS quality at the time of release, and was withheld from public release by Meta on safety grounds, with only audio samples, a detection classifier, and a research paper released publicly.[2][3] The system is the direct predecessor of Audiobox and a methodological influence on later flow-matching speech systems such as E2 TTS and F5-TTS.[4][5]
| Field | Value |
|---|---|
| Developer | Meta AI Research (FAIR) |
| Announced | June 16, 2023[2] |
| arXiv paper | 2306.15687, submitted June 23, 2023[1] |
| Venue | NeurIPS 2023[6] |
| Lead authors | Matthew Le, Apoorv Vyas, Bowen Shi, Wei-Ning Hsu (corresponding)[1] |
| Audio backbone | 24-layer Transformer, 16 heads, 1024 dim, 4096 FFN[7] |
| Audio model size | approximately 330M parameters[7] |
| Duration model | 28M (English) / 34M (multilingual) parameters[7] |
| Training objective | Conditional flow matching with optimal-transport path[1][7] |
| Audio representation | 80-dim log mel spectrogram at 100 Hz[7] |
| Training data (English) | approximately 60,000 hours of audiobooks[7] |
| Training data (multilingual) | approximately 50,000 hours across six languages[7] |
| Languages | English, French, German, Spanish, Polish, Portuguese[7] |
| Public weights | Not released[2][3] |
| Demo URL | voicebox.metademolab.com[2] |
| Successor | Audiobox (December 2023)[4] |
Modern neural text-to-speech moved through several paradigms in the decade before Voicebox. Concatenative and statistical-parametric systems were largely replaced by neural sequence-to-sequence models in the mid-2010s, and subsequently by neural codec language models and diffusion-based approaches. By 2022 and 2023 two distinct directions dominated state-of-the-art zero-shot TTS: discrete-token autoregressive models such as VALL-E, which framed TTS as next-token prediction over neural audio codec tokens, and continuous-feature diffusion or flow models that predicted mel spectrograms or codec latents directly.[8][9] Voicebox sits firmly in the second category but distinguishes itself by training the model to fill in masked regions of an audio sequence given surrounding audio context, rather than to generate audio left-to-right.[1]
The shift toward generalist speech systems mirrored a broader change in machine-learning practice. Where text and image generation had moved decisively to large pretrained generalists capable of in-context learning by 2022, speech generation in the same period was still dominated by narrowly task-specific systems: one model for TTS, another for voice conversion, a third for denoising, and so on. The Voicebox authors open their paper by drawing this contrast explicitly, noting that "speech generative models are still primitive in terms of scale and task generalization" relative to systems such as GPT and DALL-E, and that the Voicebox project was conceived as a deliberate attempt to close that gap by training a single non-task-specific generative model on a much larger and less curated speech corpus than previous work.[1] In that respect Voicebox should be read alongside contemporary efforts such as VALL-E, Bark, and NaturalSpeech 2 as part of a 2022 to 2023 wave of zero-shot speech models that all aimed for in-context generalisation, rather than as an isolated paper.[8][9]
Voicebox originates in Meta's Fundamental AI Research (FAIR) speech and audio team. The project is led by Matthew Le and Apoorv Vyas with senior contributions from Wei-Ning Hsu, who corresponds for the paper and who had previously led the HuBERT self-supervised speech representation work.[1] Other authors are Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, and Jay Mahadeokar.[1] The team's motivation, stated explicitly in the paper's introduction, was to bring to speech the in-context, generalist behaviour that large language models such as GPT had demonstrated for text and that DALL-E had demonstrated for images.[1]
Meta announced Voicebox publicly on June 16, 2023, with a blog post on the AI at Meta site and a curated demo gallery at voicebox.metademolab.com.[2] The accompanying arXiv preprint, "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale," was submitted on June 23, 2023.[1] The paper was subsequently accepted to the Neural Information Processing Systems (NeurIPS) 2023 conference, where it appeared in the main proceedings.[6] In its announcement Meta explicitly framed Voicebox as "the first model that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance."[2]
Unusually for a high-profile FAIR speech paper, Meta declined to publish either model weights or training code. The blog post stated, "There are many exciting use cases for generative speech models, but because of the potential risks of misuse, we are not making the Voicebox model or code publicly available at this time," and described the company's policy of trying to "strike the right balance between openness with responsibility" by releasing the paper, audio samples, and a detection classifier instead of the model.[2][10][11] The withholding decision generated substantial press coverage, with outlets including AI Business, Entrepreneur, and Reworked focusing on Voicebox's ability to clone a voice from approximately two seconds of reference audio and the resulting deepfake risk.[10][11][12]
The contrast with Meta's roughly simultaneous open release of Llama 2 in July 2023 was widely commented on at the time, and the company's stated reasoning hinged on the differential risk profile of high-quality voice cloning compared with text generation. The blog post specifically frames the Voicebox detection classifier as part of a longer-term plan: Meta said it intended to study evaluations and tooling for distinguishing real and synthetic speech before considering any broader release. As of the publication of the Audiobox successor in December 2023, no Voicebox weights had been released to the public, and Meta's external API offerings in 2023 and 2024 did not expose Voicebox itself.[2][4][10]
In December 2023, Meta released Audiobox as a successor system that generalised Voicebox's flow-matching backbone from speech-only to a broader audio-generation setting including sound effects and music, with natural-language prompts as conditioning. The Audiobox paper explicitly describes itself as built on Voicebox and on Meta's SpeechFlow self-supervised pre-training work.[4] Voicebox's flow-matching-on-mel-spectrograms design has since been adopted, adapted, or simplified in several other zero-shot TTS systems, including E2 TTS from Microsoft (2024) and the open F5-TTS model (2024), both of which compare directly with Voicebox as a primary baseline.[5][13]
Voicebox formulates speech generation as a masked-infill problem rather than as next-token prediction. Given a (text, audio) pair, the training procedure samples a binary mask m over audio frames, exposes the unmasked audio frames x_ctx and the full text z to the model, and asks the model to predict the masked frames.[1][7] Because the mask can cover any subset of frames, the same trained model supports many downstream behaviours simply by choosing which frames to mask at inference time:
The authors describe this as making Voicebox "more flexible" than autoregressive systems such as GPT, "as it can also condition on future context."[1]
Voicebox is trained with conditional flow matching (CFM), the simulation-free continuous-normalising-flow technique introduced by Lipman and colleagues at ICLR 2023.[14] The high-level idea of flow matching is to learn a time-dependent vector field v_t such that the ordinary differential equation dx/dt = v_t(x) transports a simple noise distribution at t=0 to the data distribution at t=1. Rather than fitting v_t by intractable maximum-likelihood training of a diffusion-like continuous normalising flow, flow matching regresses v_t directly against the analytic vector field of a chosen conditional probability path, which makes training a straightforward regression problem.[14] Voicebox uses the optimal-transport conditional path proposed in the same paper, which interpolates linearly between Gaussian noise x_0 and a data sample x_1.[1][7]
In practice this turns audio synthesis into a familiar regression problem at training time: for each training audio sample x_1 and a randomly sampled noise x_0 of the same shape, a time step t is drawn uniformly in [0, 1], the interpolant w = (1 - (1 - sigma_min) t) x_0 + t x_1 is constructed, and the Transformer is asked to predict the constant-in-time target vector x_1 - (1 - sigma_min) x_0 from w, the unmasked audio context, and the phoneme conditioning. The same Transformer is reused at all values of t; the flow step is supplied as an explicit positional embedding so that the model can adapt its behaviour smoothly between very noisy and nearly clean interpolants. This formulation is simulation-free in the sense that no actual ODE integration is required during training, which is what makes flow matching tractable at the scale of 330 million parameters and tens of thousands of hours of audio.[14][7]
The training loss is a masked CFM loss restricted to the masked frames, written in the paper as
L_audio-CFM-m(theta) = E[ || m * ( ( x - (1 - sigma_min) x_0 ) - v_t(w, x_ctx, z; theta) ) ||^2 ]
with sigma_min approximately 1e-5.[7] Here w is the noisy interpolant, x_ctx is the unmasked audio context, z is the frame-aligned phone transcript, and theta are the Transformer parameters parameterising v_t. The duration model uses an analogous CFM loss to predict phone durations from masked durations.[7]
At inference time Voicebox uses classifier-free guidance in the style introduced for diffusion models, with an unconditional pass that drops the text z. The guided vector field is
v_tilde_t = (1 + alpha) * v_t(w, x_ctx, z; theta) - alpha * v_t(w; theta)
The model is trained with an unconditional drop probability p_uncond = 0.2, and the guidance scale alpha is tuned per task, typically around 0.7 for zero-shot TTS.[7] Inference is then a standard ODE integration: Voicebox's default is the midpoint solver with step size 0.0625 (giving 64 function evaluations of v), and the authors report that under classifier-free guidance 32 NFEs suffice to recover the best quality. At the low end the model can synthesise ten seconds of audio in roughly 0.31 seconds at NFE=2 on the authors' hardware, which the paper characterises as up to twenty times faster than the autoregressive VALL-E baseline.[7]
The audio backbone is a 24-layer Transformer with 16 attention heads, 1024-dimensional embeddings, and 4096-dimensional feed-forward layers (approximately 330M parameters in the audio model). Positional information is provided by convolutional positional embeddings, and self-attention uses symmetric bi-directional ALiBi biases instead of fixed sinusoidal encodings.[7] The Transformer is wrapped in U-Net-style skip connections that link the first layer to the last layer, the second to the second-to-last, and so on, giving the network the long-range residual flow that has become common in modern denoiser networks.[7]
Audio is represented as an 80-dimensional log-mel-spectrogram sampled at 100 Hz, which is converted back to waveform by a HiFi-GAN vocoder trained separately on the same data.[7] The duration model is a much smaller Transformer (28M parameters in 8 layers for English, 34M parameters in 10 layers for multilingual), and predicts per-phoneme durations conditioned on the masked duration target and the phoneme sequence.[7] At inference, the duration model is sampled first to determine the alignment between phonemes and frames, and then the audio model fills in the frame-level mel spectrogram conditioned on the resulting frame-aligned phone transcript.[7]
Voicebox is trained on transcribed audiobook speech in two configurations.[7]
Crucially, neither configuration applies the heavy filtering or enhancement used in many speech corpora; the paper repeatedly emphasises that Voicebox is "trained on over 50K hours of speech that are not filtered or enhanced," which it identifies as a key reason for the model's strong robustness on out-of-domain inputs.[1]
The audio model is trained for 500,000 updates in the English setting and 750,000 updates in the multilingual setting, with the duration model trained for 600,000 updates. Optimisation uses Adam at peak learning rate 1e-4, with a 5,000-step linear warm-up followed by linear decay, gradient norm clipping at 0.2, and FP16 mixed precision. The effective batch size is 240,000 frames for the audio model and 60,000 for the duration model. Masking during training uses a 30% probability of full-sequence dropout, otherwise a contiguous segment of r% of the frames is masked, where r is sampled uniformly from [70, 100].[7]
Because Voicebox is a single masked-infill model, the same checkpoint is used across very different generative tasks; only the inference-time mask and the input conditioning change.
In the canonical zero-shot TTS configuration, the model receives an unmasked audio prompt of a target speaker plus the text to synthesise. The mask covers all frames after the prompt, and Voicebox is asked to generate the corresponding mel spectrogram in the prompt speaker's voice.[1][7] The paper reports that on Librispeech test-clean with three-second prompts, Voicebox achieves 1.9% word error rate and 0.681 speaker similarity (SIM-r), compared with 5.9% WER and 0.580 SIM-r for VALL-E under similar evaluation conditions.[1][7]
By masking a middle region of an existing utterance and supplying both the surrounding audio and a corrected transcript, Voicebox can perform localised speech editing: replacing a single word or phrase while preserving the speaker identity, room acoustics, and prosody of the surrounding context.[1][2] The Meta demo gallery and supplementary materials emphasise this capability, including replacing mis-spoken words and removing background noise via infill, both of which are framed as instances of the same masked-prediction task.[2]
Voicebox can be conditioned on a noisy reference together with a mask covering only the noise-corrupted spans, in which case it generates clean speech in the same speaker's voice for those spans while keeping the clean unmasked context intact. The paper presents this as a free side benefit of the infilling formulation rather than a separately fine-tuned mode.[1][7]
In the multilingual configuration the same model handles all six training languages. For cross-lingual style transfer the audio prompt is in one language and the requested transcript is in another; the model is expected to produce the new language's content in a voice resembling the prompt. The paper reports an average WER of 5.2% and SIM-r of 0.481 in this cross-lingual setting, against 10.9% WER and 0.335 SIM-r for the YourTTS baseline.[7]
The cross-lingual numbers should be read with the upsampling regime in mind. Because the multilingual corpus is heavily skewed toward English, the multinomial upsampling with exponent beta = 0.25 makes the effective sampling distribution much flatter than the raw hour counts. The paper reports that this upsampling is essential for the low-resource languages in the set, including Polish, and that without it the model's per-language similarity scores degrade sharply. The authors do not extend the multilingual evaluation beyond the six training languages, so Voicebox's behaviour on truly out-of-distribution languages such as Mandarin Chinese or Japanese is not characterised in the original paper.[7]
Voicebox can sample multiple diverse continuations of the same text or use unconditional generation as a data-augmentation engine. The paper shows that a speech recogniser trained on Voicebox-generated speech rather than real LibriSpeech audio loses only 0.4% and 1.7% absolute WER on test-other and test-clean respectively, suggesting that the model's samples preserve enough acoustic and linguistic variability to substitute for real audio in speech recognition training.[7]
Meta did not release Voicebox weights or code, so the only fully sanctioned variants are the two configurations reported in the paper: the English-only VB-En and the multilingual VB-Multi.[1][7] Third parties have re-implemented the architecture from the paper, including SpeechifyInc's "Meta-voicebox" repository and lucidrains' "voicebox-pytorch" PyTorch reference, though neither distribution constitutes an official release and reproduction quality depends on the specific dataset and vocoder used by the re-implementer.[15][16]
Voicebox is the first widely cited demonstration that conditional flow matching, in particular Lipman et al.'s optimal-transport variant, can scale to large speech models and yield state-of-the-art zero-shot TTS while remaining substantially faster at inference than autoregressive codec language models.[1][14] The Audiobox successor explicitly retains the same flow-matching mel-spectrogram backbone for its speech component.[4] Later non-autoregressive zero-shot TTS systems including Microsoft's E2 TTS and the open F5-TTS use Voicebox as a primary point of comparison and adopt very similar flow-matching-on-mel-spectrograms architectures, simplifying only the alignment supervision.[5][13] The continuous, non-token approach has thus become one of the two dominant paradigms in modern zero-shot TTS, alongside discrete codec autoregression in the VALL-E line.[8]
Beyond the specific flow-matching choice, Voicebox is significant for arguing that a single non-task-specific generative model can serve as a "universal" speech model. The paper opens by contrasting the in-context generalist behaviour of large language and image models with the much narrower task scope of contemporary speech models, and presents Voicebox as the speech-domain analogue.[1] This framing prefigures Meta's broader 2023-2024 audio strategy, including Audiobox's natural-language-prompted unified audio generation and the SpeechFlow self-supervised audio backbone.[4]
The decision to publish a NeurIPS paper and audio samples but withhold weights drew significant attention to the question of how high-fidelity voice-cloning systems should be released. Press coverage emphasised that Voicebox could imitate a target voice from approximately two seconds of audio, well below the requirements of earlier systems, and several articles framed Meta's restraint as a notable departure from the company's then-recent strategy of openly releasing large models such as LLaMA and Llama 2.[10][11][12] In the blog post Meta indicated that it had released a detection classifier intended to distinguish authentic from Voicebox-generated speech, and described future work on "evaluations and tools" before any broader release.[2]
The table below summarises the headline quantitative comparisons reported in the Voicebox paper.
| Task | Setting | Metric | Voicebox | Baseline | Baseline name |
|---|---|---|---|---|---|
| English zero-shot TTS | Librispeech test-clean | WER (lower is better) | 1.9% | 5.9% | VALL-E[1][7] |
| English zero-shot TTS | Librispeech test-clean | SIM-r (higher is better) | 0.681 | 0.580 | VALL-E[1][7] |
| Inference speed | 10 s audio, NFE=2 | wall-clock | approximately 0.31 s | approximately 20x slower | VALL-E[7] |
| Cross-lingual TTS | Average across six languages | WER | 5.2% | 10.9% | YourTTS[7] |
| Cross-lingual TTS | Average across six languages | SIM-r | 0.481 | 0.335 | YourTTS[7] |
| ASR training data | LibriSpeech test-other | absolute WER increase vs real | 0.4% | n/a | n/a[7] |
| ASR training data | LibriSpeech test-clean | absolute WER increase vs real | 1.7% | n/a | n/a[7] |
These numbers are drawn from the published version of the paper and reflect the authors' own evaluation; later third-party benchmarks under different prompt durations or test sets do not always reproduce them exactly.[5][13]
Voicebox inherits several limitations that the authors and later work have identified.
Reliance on frame-aligned phone transcripts. Voicebox requires a frame-level phoneme alignment as conditioning, which in turn requires a separate forced-alignment model at training time. Later systems such as E2 TTS were motivated in part by removing this requirement, arguing that an end-to-end flow-matching Transformer plus a vocoder-style decoder can achieve comparable quality without phoneme alignment.[5]
Audiobook-only training distribution. Both VB-En and VB-Multi are trained on transcribed audiobook recordings. While the authors do not filter or enhance the data, the recording conditions remain those of audiobook narration: largely single-speaker, clean, expressive but contained prosody. Generalisation to telephony, conversational speech, and noisy real-world environments is not the focus of the original evaluation.[7]
Mel spectrogram bottleneck. Voicebox operates in mel-spectrogram space rather than directly on waveform or on a learned neural audio codec. Audio quality is therefore upper-bounded by the quality of the separately trained HiFi-GAN vocoder, and certain fine acoustic details (such as breath or transient noise) are filtered through the vocoder rather than modelled by the flow.[7]
Closed weights. Because Meta did not release weights, independent reproduction is constrained by the availability of audiobook corpora and the engineering choices in re-implementations such as lucidrains' open-source PyTorch port; reproduction parity with Meta's reported numbers is not guaranteed.[16]
Misuse risk. Meta's own announcement and the surrounding coverage acknowledge that voice cloning from approximately two seconds of audio creates a clear deepfake risk; this is the explicit reason for the closed release.[2][10][11]
Public reception focused primarily on two threads. Technically, the speech-AI community received Voicebox as a strong step beyond VALL-E both in quality and in inference speed, and as a convincing demonstration that conditional flow matching was a competitive alternative to discrete-token autoregression for speech generation.[9][17] The NeurIPS 2023 acceptance and the rapid uptake of similar flow-matching designs in 2024 corroborated this technical reception.[5][6][13]
Less technically, Meta drew a mixed response for its non-release. Some commentators welcomed the restraint as a model for responsible release, particularly given the existing prevalence of voice-cloning scams in 2023, while others pointed out the tension between Meta's stated openness policy and the closed Voicebox decision, particularly while Meta was simultaneously open-sourcing Llama 2.[10][11][12] Meta's promise of evaluations and tooling before any broader release has not, as of the publication of the Audiobox successor in late 2023, translated into open weights for Voicebox itself.[4]
Voicebox sits at the intersection of several active research lines: