Voicebox

Generative AI Meta AI Speech & Audio AI

23 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v3 · 4,672 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Voicebox is a non-autoregressive, text-conditioned generative model for speech developed by Meta AI Research and announced on June 16, 2023. It is trained with a conditional flow-matching objective on an infill-style audio prediction task, which lets a single set of weights perform zero-shot text-to-speech (TTS), voice editing, noise removal, content editing, style transfer, and multilingual TTS without task-specific fine-tuning.^[1]^[2] Meta's paper describes it as "the most versatile text-guided generative model for speech at scale," and reports that it can match a target voice from an audio prompt as short as two seconds, cut the English zero-shot TTS word error rate to 1.9% (against 5.9% for VALL-E), and run up to 20 times faster than that autoregressive baseline.^[1]^[2] Voicebox was one of the first speech models to scale flow matching as the generative objective, demonstrated state-of-the-art zero-shot TTS quality at the time of release, and was withheld from public release by Meta on safety grounds, with only audio samples, a detection classifier, and a research paper released publicly.^[2]^[3] The system is the direct predecessor of Audiobox and a methodological influence on later flow-matching speech systems such as E2 TTS and F5-TTS.^[4]^[5]

This article is about Meta's speech-generation model. Voicebox Technologies Corporation, a separate natural-language-understanding and speech-recognition company founded in Bellevue, Washington in 2001 and acquired by Nuance Communications in 2018, is unrelated.^[18]

Infobox

Field	Value
Developer	Meta AI Research (FAIR)
Announced	June 16, 2023^[2]
arXiv paper	2306.15687, submitted June 23, 2023^[1]
Venue	NeurIPS 2023^[6]
Lead authors	Matthew Le, Apoorv Vyas, Bowen Shi, Wei-Ning Hsu (corresponding)^[1]
Audio backbone	24-layer Transformer, 16 heads, 1024 dim, 4096 FFN^[7]
Audio model size	approximately 330M parameters^[7]
Duration model	28M (English) / 34M (multilingual) parameters^[7]
Training objective	Conditional flow matching with optimal-transport path^[1]^[7]
Audio representation	80-dim log mel spectrogram at 100 Hz^[7]
Training data (English)	approximately 60,000 hours of audiobooks^[7]
Training data (multilingual)	approximately 50,000 hours across six languages^[7]
Languages	English, French, German, Spanish, Polish, Portuguese^[7]
Voice prompt length	approximately 2 seconds of reference audio^[2]
Inference speed	up to 20x faster than VALL-E^[1]
Public weights	Not released^[2]^[3]
Demo URL	voicebox.metademolab.com^[2]
Successor	Audiobox (December 2023)^[4]

How was Voicebox developed?

speech generation before Voicebox

Modern neural text-to-speech moved through several paradigms in the decade before Voicebox. Concatenative and statistical-parametric systems were largely replaced by neural sequence-to-sequence models in the mid-2010s, and subsequently by neural codec language models and diffusion-based approaches. By 2022 and 2023 two distinct directions dominated state-of-the-art zero-shot TTS: discrete-token autoregressive models such as VALL-E, which framed TTS as next-token prediction over neural audio codec tokens, and continuous-feature diffusion or flow models that predicted mel spectrograms or codec latents directly.^[8]^[9] Voicebox sits firmly in the second category but distinguishes itself by training the model to fill in masked regions of an audio sequence given surrounding audio context, rather than to generate audio left-to-right.^[1]

The shift toward generalist speech systems mirrored a broader change in machine-learning practice. Where text and image generation had moved decisively to large pretrained generalists capable of in-context learning by 2022, speech generation in the same period was still dominated by narrowly task-specific systems: one model for TTS, another for voice conversion, a third for denoising, and so on. The Voicebox authors open their paper by drawing this contrast explicitly, noting that "speech generative models are still primitive in terms of scale and task generalization" relative to systems such as GPT and DALL-E, and that the Voicebox project was conceived as a deliberate attempt to close that gap by training a single non-task-specific generative model on a much larger and less curated speech corpus than previous work.^[1] In that respect Voicebox should be read alongside contemporary efforts such as VALL-E, Bark, and NaturalSpeech 2 as part of a 2022 to 2023 wave of zero-shot speech models that all aimed for in-context generalisation, rather than as an isolated paper.^[8]^[9]

conception at FAIR

Voicebox originates in Meta's Fundamental AI Research (FAIR) speech and audio team. The project is led by Matthew Le and Apoorv Vyas with senior contributions from Wei-Ning Hsu, who corresponds for the paper and who had previously led the HuBERT self-supervised speech representation work.^[1] Other authors are Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, and Jay Mahadeokar; the paper lists eleven authors in total.^[1] The team's motivation, stated explicitly in the paper's introduction, was to bring to speech the in-context, generalist behaviour that large language models such as GPT had demonstrated for text and that DALL-E had demonstrated for images.^[1]

public announcement and arXiv release

Meta announced Voicebox publicly on June 16, 2023, with a blog post on the AI at Meta site and a curated demo gallery at voicebox.metademolab.com.^[2] The accompanying arXiv preprint, "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale," was submitted on June 23, 2023.^[1] The paper was subsequently accepted to the Neural Information Processing Systems (NeurIPS) 2023 conference, where it appeared in the main proceedings.^[6] In its announcement Meta explicitly framed Voicebox as "the first model that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance."^[2]

the decision not to release model weights

Unusually for a high-profile FAIR speech paper, Meta declined to publish either model weights or training code. The blog post stated, "There are many exciting use cases for generative speech models, but because of the potential risks of misuse, we are not making the Voicebox model or code publicly available at this time," and described the company's policy of trying to "strike the right balance between openness with responsibility" by releasing the paper, audio samples, and a detection classifier instead of the model.^[2]^[10]^[11] The withholding decision generated substantial press coverage, with outlets including AI Business, Entrepreneur, and Reworked focusing on Voicebox's ability to clone a voice from approximately two seconds of reference audio and the resulting deepfake risk.^[10]^[11]^[12]

The contrast with Meta's roughly simultaneous open release of Llama 2 in July 2023 was widely commented on at the time, and the company's stated reasoning hinged on the differential risk profile of high-quality voice cloning compared with text generation. The blog post specifically frames the Voicebox detection classifier as part of a longer-term plan: Meta said it intended to study evaluations and tooling for distinguishing real and synthetic speech before considering any broader release. As of the publication of the Audiobox successor in December 2023, no Voicebox weights had been released to the public, and Meta's external API offerings in 2023 and 2024 did not expose Voicebox itself.^[2]^[4]^[10]

successor work

In December 2023, Meta released Audiobox as a successor system that generalised Voicebox's flow-matching backbone from speech-only to a broader audio-generation setting including sound effects and music, with natural-language prompts as conditioning. The Audiobox paper explicitly describes itself as built on Voicebox and on Meta's SpeechFlow self-supervised pre-training work.^[4] Voicebox's flow-matching-on-mel-spectrograms design has since been adopted, adapted, or simplified in several other zero-shot TTS systems, including E2 TTS from Microsoft (2024) and the open F5-TTS model (2024), both of which compare directly with Voicebox as a primary baseline.^[5]^[13]

How does Voicebox work?

problem formulation: text-guided audio infilling

Voicebox formulates speech generation as a masked-infill problem rather than as next-token prediction. Given a (text, audio) pair, the training procedure samples a binary mask m over audio frames, exposes the unmasked audio frames x_ctx and the full text z to the model, and asks the model to predict the masked frames.^[1]^[7] Because the mask can cover any subset of frames, the same trained model supports many downstream behaviours simply by choosing which frames to mask at inference time:

Masking all audio frames yields text-to-speech, where the unmasked audio is the speaker prompt.
Masking a contiguous middle region yields speech editing or content editing.
Masking high-noise regions yields denoising and infill.
Conditioning the masked region on text in a different language yields cross-lingual style transfer.^[1]^[2]

The authors describe this as making Voicebox "more flexible" than autoregressive systems such as GPT, "as it can also condition on future context."^[1]

conditional flow matching objective

Voicebox is trained with conditional flow matching (CFM), the simulation-free continuous-normalising-flow technique introduced by Lipman and colleagues at ICLR 2023.^[14] The high-level idea of flow matching is to learn a time-dependent vector field v_t such that the ordinary differential equation dx/dt = v_t(x) transports a simple noise distribution at t=0 to the data distribution at t=1. Rather than fitting v_t by intractable maximum-likelihood training of a diffusion-like continuous normalising flow, flow matching regresses v_t directly against the analytic vector field of a chosen conditional probability path, which makes training a straightforward regression problem.^[14] Voicebox uses the optimal-transport conditional path proposed in the same paper, which interpolates linearly between Gaussian noise x_0 and a data sample x_1.^[1]^[7]

In practice this turns audio synthesis into a familiar regression problem at training time: for each training audio sample x_1 and a randomly sampled noise x_0 of the same shape, a time step t is drawn uniformly in [0, 1], the interpolant w = (1 - (1 - sigma_min) t) x_0 + t x_1 is constructed, and the Transformer is asked to predict the constant-in-time target vector x_1 - (1 - sigma_min) x_0 from w, the unmasked audio context, and the phoneme conditioning. The same Transformer is reused at all values of t; the flow step is supplied as an explicit positional embedding so that the model can adapt its behaviour smoothly between very noisy and nearly clean interpolants. This formulation is simulation-free in the sense that no actual ODE integration is required during training, which is what makes flow matching tractable at the scale of 330 million parameters and tens of thousands of hours of audio.^[14]^[7]

The training loss is a masked CFM loss restricted to the masked frames, written in the paper as

L_audio-CFM-m(theta) = E[ || m * ( ( x - (1 - sigma_min) x_0 ) - v_t(w, x_ctx, z; theta) ) ||^2 ]

with sigma_min approximately 1e-5.^[7] Here w is the noisy interpolant, x_ctx is the unmasked audio context, z is the frame-aligned phone transcript, and theta are the Transformer parameters parameterising v_t. The duration model uses an analogous CFM loss to predict phone durations from masked durations.^[7]

classifier-free guidance and inference

At inference time Voicebox uses classifier-free guidance in the style introduced for diffusion models, with an unconditional pass that drops the text z. The guided vector field is

v_tilde_t = (1 + alpha) * v_t(w, x_ctx, z; theta) - alpha * v_t(w; theta)

The model is trained with an unconditional drop probability p_uncond = 0.2, and the guidance scale alpha is tuned per task, typically around 0.7 for zero-shot TTS.^[7] Inference is then a standard ODE integration. Voicebox's default is the midpoint solver with step size 0.0625, which the paper notes yields 64 function evaluations of v with classifier-free guidance and 32 without, since guidance adds a second unconditional forward pass at each step.^[7] Quality holds up at far fewer evaluations: at NFE=2 the model can synthesise ten seconds of audio in roughly 0.31 seconds on the authors' hardware, which the paper reports is about twenty times faster than the autoregressive VALL-E baseline, while at the higher-quality NFE=64 setting Voicebox is only about 4% slower than VALL-E.^[7]

architecture

The audio backbone is a 24-layer Transformer with 16 attention heads, 1024-dimensional embeddings, and 4096-dimensional feed-forward layers (approximately 330M parameters in the audio model). Positional information is provided by convolutional positional embeddings, and self-attention uses symmetric bi-directional ALiBi biases instead of fixed sinusoidal encodings.^[7] The Transformer is wrapped in U-Net-style skip connections that link the first layer to the last layer, the second to the second-to-last, and so on, giving the network the long-range residual flow that has become common in modern denoiser networks.^[7]

Audio is represented as an 80-dimensional log-mel-spectrogram sampled at 100 Hz, which is converted back to waveform by a HiFi-GAN vocoder trained separately on the same data.^[7] The duration model is a much smaller Transformer (28M parameters in 8 layers for English, 34M parameters in 10 layers for multilingual), and predicts per-phoneme durations conditioned on the masked duration target and the phoneme sequence.^[7] At inference, the duration model is sampled first to determine the alignment between phonemes and frames, and then the audio model fills in the frame-level mel spectrogram conditioned on the resulting frame-aligned phone transcript.^[7]

training data

Voicebox is trained on transcribed audiobook speech in two configurations.^[7]

VB-En (monolingual English). Approximately 60,000 hours of English audiobook recordings.
VB-Multi (multilingual). Approximately 50,000 hours of audiobook recordings spread across English, French, German, Spanish, Polish, and Portuguese. To balance the heavy English skew, the multilingual data is upsampled with a multinomial distribution with temperature exponent beta = 0.25.^[7]

Crucially, neither configuration applies the heavy filtering or enhancement used in many speech corpora; the paper repeatedly emphasises that Voicebox is "trained on over 50K hours of speech that are not filtered or enhanced," which it identifies as a key reason for the model's strong robustness on out-of-domain inputs.^[1]

training configuration

The audio model is trained for 500,000 updates in the English setting and 750,000 updates in the multilingual setting, with the duration model trained for 600,000 updates. Optimisation uses Adam at peak learning rate 1e-4, with a 5,000-step linear warm-up followed by linear decay, gradient norm clipping at 0.2, and FP16 mixed precision. The effective batch size is 240,000 frames for the audio model and 60,000 for the duration model. Masking during training uses a 30% probability of full-sequence dropout, otherwise a contiguous segment of r% of the frames is masked, where r is sampled uniformly from [70, 100].^[7]

What can Voicebox do?

Because Voicebox is a single masked-infill model, the same checkpoint is used across very different generative tasks; only the inference-time mask and the input conditioning change.

zero-shot text-to-speech

In the canonical zero-shot TTS configuration, the model receives an unmasked audio prompt of a target speaker plus the text to synthesise. The mask covers all frames after the prompt, and Voicebox is asked to generate the corresponding mel spectrogram in the prompt speaker's voice.^[1]^[7] Meta stresses how little reference audio this requires: "Using an input audio sample just two seconds in length, Voicebox can match the sample's audio style and use it for text-to-speech generation."^[2] The paper reports that on Librispeech test-clean with three-second prompts, Voicebox achieves 1.9% word error rate and 0.681 speaker similarity (SIM-r), compared with 5.9% WER and 0.580 SIM-r for VALL-E under similar evaluation conditions.^[1]^[7]

voice editing and infill

By masking a middle region of an existing utterance and supplying both the surrounding audio and a corrected transcript, Voicebox can perform localised speech editing: replacing a single word or phrase while preserving the speaker identity, room acoustics, and prosody of the surrounding context.^[1]^[2] The Meta demo gallery and supplementary materials emphasise this capability, including replacing mis-spoken words and removing background noise via infill, both of which are framed as instances of the same masked-prediction task.^[2]

noise removal

Voicebox can be conditioned on a noisy reference together with a mask covering only the noise-corrupted spans, in which case it generates clean speech in the same speaker's voice for those spans while keeping the clean unmasked context intact. The paper presents this as a free side benefit of the infilling formulation rather than a separately fine-tuned mode.^[1]^[7]

multilingual and cross-lingual TTS

In the multilingual configuration the same model handles all six training languages. For cross-lingual style transfer the audio prompt is in one language and the requested transcript is in another; the model is expected to produce the new language's content in a voice resembling the prompt. The paper reports an average WER of 5.2% and SIM-r of 0.481 in this cross-lingual setting, against 10.9% WER and 0.335 SIM-r for the YourTTS baseline.^[7]

The cross-lingual numbers should be read with the upsampling regime in mind. Because the multilingual corpus is heavily skewed toward English, the multinomial upsampling with exponent beta = 0.25 makes the effective sampling distribution much flatter than the raw hour counts. The paper reports that this upsampling is essential for the low-resource languages in the set, including Polish, and that without it the model's per-language similarity scores degrade sharply. The authors do not extend the multilingual evaluation beyond the six training languages, so Voicebox's behaviour on truly out-of-distribution languages such as Mandarin Chinese or Japanese is not characterised in the original paper.^[7]

diverse and ASR-training sampling

Voicebox can sample multiple diverse continuations of the same text or use unconditional generation as a data-augmentation engine. The paper shows that a speech recogniser trained on Voicebox-generated speech rather than real LibriSpeech audio loses only 0.4% and 1.7% absolute WER on test-other and test-clean respectively, suggesting that the model's samples preserve enough acoustic and linguistic variability to substitute for real audio in speech recognition training.^[7]

Is Voicebox open source?

No. Meta did not release Voicebox weights or code, so the only fully sanctioned variants are the two configurations reported in the paper: the English-only VB-En and the multilingual VB-Multi.^[1]^[7] Third parties have re-implemented the architecture from the paper, including SpeechifyInc's "Meta-voicebox" repository and lucidrains' "voicebox-pytorch" PyTorch reference, though neither distribution constitutes an official release and reproduction quality depends on the specific dataset and vocoder used by the re-implementer.^[15]^[16]

Why is Voicebox significant?

flow matching at scale for speech

Voicebox is the first widely cited demonstration that conditional flow matching, in particular Lipman et al.'s optimal-transport variant, can scale to large speech models and yield state-of-the-art zero-shot TTS while remaining substantially faster at inference than autoregressive codec language models.^[1]^[14] The Audiobox successor explicitly retains the same flow-matching mel-spectrogram backbone for its speech component.^[4] Later non-autoregressive zero-shot TTS systems including Microsoft's E2 TTS and the open F5-TTS use Voicebox as a primary point of comparison and adopt very similar flow-matching-on-mel-spectrograms architectures, simplifying only the alignment supervision.^[5]^[13] The continuous, non-token approach has thus become one of the two dominant paradigms in modern zero-shot TTS, alongside discrete codec autoregression in the VALL-E line.^[8]

universal generalist behaviour

Beyond the specific flow-matching choice, Voicebox is significant for arguing that a single non-task-specific generative model can serve as a "universal" speech model. The paper opens by contrasting the in-context generalist behaviour of large language and image models with the much narrower task scope of contemporary speech models, and presents Voicebox as the speech-domain analogue.^[1] This framing prefigures Meta's broader 2023-2024 audio strategy, including Audiobox's natural-language-prompted unified audio generation and the SpeechFlow self-supervised audio backbone.^[4]

influence on the responsible-release debate

The decision to publish a NeurIPS paper and audio samples but withhold weights drew significant attention to the question of how high-fidelity voice-cloning systems should be released. Press coverage emphasised that Voicebox could imitate a target voice from approximately two seconds of audio, well below the requirements of earlier systems, and several articles framed Meta's restraint as a notable departure from the company's then-recent strategy of openly releasing large models such as LLaMA and Llama 2.^[10]^[11]^[12] In the blog post Meta indicated that it had released a detection classifier intended to distinguish authentic from Voicebox-generated speech, and described future work on "evaluations and tools" before any broader release.^[2]

What are Voicebox's benchmark results?

The table below summarises the headline quantitative comparisons reported in the Voicebox paper. The paper's abstract states the core result plainly: Voicebox "outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster," where the first figure in each pair is VALL-E's and the second is Voicebox's.^[1]

Task	Setting	Metric	Voicebox	Baseline	Baseline name
English zero-shot TTS	Librispeech test-clean	WER (lower is better)	1.9%	5.9%	VALL-E^[1]^[7]
English zero-shot TTS	Librispeech test-clean	SIM-r (higher is better)	0.681	0.580	VALL-E^[1]^[7]
Inference speed	10 s audio, NFE=2	wall-clock	approximately 0.31 s	approximately 20x slower	VALL-E^[7]
Cross-lingual TTS	Average across six languages	WER	5.2%	10.9%	YourTTS^[7]
Cross-lingual TTS	Average across six languages	SIM-r	0.481	0.335	YourTTS^[7]
ASR training data	LibriSpeech test-other	absolute WER increase vs real	0.4%	n/a	n/a^[7]
ASR training data	LibriSpeech test-clean	absolute WER increase vs real	1.7%	n/a	n/a^[7]

These numbers are drawn from the published version of the paper and reflect the authors' own evaluation; later third-party benchmarks under different prompt durations or test sets do not always reproduce them exactly.^[5]^[13]

What are the limitations of Voicebox?

Voicebox inherits several limitations that the authors and later work have identified.

Reliance on frame-aligned phone transcripts. Voicebox requires a frame-level phoneme alignment as conditioning, which in turn requires a separate forced-alignment model at training time. Later systems such as E2 TTS were motivated in part by removing this requirement, arguing that an end-to-end flow-matching Transformer plus a vocoder-style decoder can achieve comparable quality without phoneme alignment.^[5]

Audiobook-only training distribution. Both VB-En and VB-Multi are trained on transcribed audiobook recordings. While the authors do not filter or enhance the data, the recording conditions remain those of audiobook narration: largely single-speaker, clean, expressive but contained prosody. Generalisation to telephony, conversational speech, and noisy real-world environments is not the focus of the original evaluation.^[7]

Mel spectrogram bottleneck. Voicebox operates in mel-spectrogram space rather than directly on waveform or on a learned neural audio codec. Audio quality is therefore upper-bounded by the quality of the separately trained HiFi-GAN vocoder, and certain fine acoustic details (such as breath or transient noise) are filtered through the vocoder rather than modelled by the flow.^[7]

Closed weights. Because Meta did not release weights, independent reproduction is constrained by the availability of audiobook corpora and the engineering choices in re-implementations such as lucidrains' open-source PyTorch port; reproduction parity with Meta's reported numbers is not guaranteed.^[16]

Misuse risk. Meta's own announcement and the surrounding coverage acknowledge that voice cloning from approximately two seconds of audio creates a clear deepfake risk; this is the explicit reason for the closed release.^[2]^[10]^[11]

How was Voicebox received?

Public reception focused primarily on two threads. Technically, the speech-AI community received Voicebox as a strong step beyond VALL-E both in quality and in inference speed, and as a convincing demonstration that conditional flow matching was a competitive alternative to discrete-token autoregression for speech generation.^[9]^[17] The NeurIPS 2023 acceptance and the rapid uptake of similar flow-matching designs in 2024 corroborated this technical reception.^[5]^[6]^[13]

Less technically, Meta drew a mixed response for its non-release. Some commentators welcomed the restraint as a model for responsible release, particularly given the existing prevalence of voice-cloning scams in 2023, while others pointed out the tension between Meta's stated openness policy and the closed Voicebox decision, particularly while Meta was simultaneously open-sourcing Llama 2.^[10]^[11]^[12] Meta's promise of evaluations and tooling before any broader release has not, as of the publication of the Audiobox successor in late 2023, translated into open weights for Voicebox itself.^[4]

Voicebox sits at the intersection of several active research lines:

Flow matching and continuous normalising flows. Voicebox's training objective is a direct application of Lipman et al.'s 2023 conditional flow matching formulation with the optimal-transport probability path.^[14] The same family of techniques underlies rectified flow and many subsequent image and video generation systems.
Zero-shot text-to-speech. The most direct comparison point is VALL-E, which uses a neural codec language model rather than mel-spectrogram flow matching.^[8] Voicebox's flow-matching design has been carried forward in F5-TTS and in Microsoft's E2 TTS.^[5]^[13]
Universal speech and audio generation. Audiobox, also from Meta AI, extends Voicebox's flow-matching backbone from speech to broader audio with natural-language prompts.^[4]
Self-supervised speech representation. Several Voicebox authors, including Wei-Ning Hsu, previously worked on self-supervised representation learning for speech (HuBERT, wav2vec family); Voicebox draws on this lineage although it is itself a supervised text-to-speech model rather than a representation learner.^[7]
Responsible release of generative speech. Voicebox's closed-weights decision and the accompanying detection-classifier release fit into a broader 2023 debate about voice-cloning deepfake risk that also encompassed commercial systems such as ElevenLabs.^[2]^[10]

References

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu, "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale", arXiv preprint 2306.15687, 2023-06-23. https://arxiv.org/abs/2306.15687. Accessed 2026-05-20. ↩
Meta AI, "Introducing Voicebox: The first generative AI model for speech to generalize across tasks with state-of-the-art performance", AI at Meta blog, 2023-06-16. https://ai.meta.com/blog/voicebox-generative-ai-model-speech/. Accessed 2026-05-20. ↩
Meta AI Research, "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale", AI at Meta publications page, 2023-06-16. https://ai.meta.com/research/publications/voicebox-text-guided-multilingual-universal-speech-generation-at-scale/. Accessed 2026-05-20. ↩
Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu, "Audiobox: Unified Audio Generation with Natural Language Prompts", arXiv preprint 2312.15821, 2023-12-25. https://arxiv.org/abs/2312.15821. Accessed 2026-05-20. ↩
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda, "E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS", arXiv preprint 2406.18009, 2024-06-26. https://arxiv.org/abs/2406.18009. Accessed 2026-05-20. ↩
OpenReview, "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale", NeurIPS 2023 main conference page, 2023-09-22. https://openreview.net/forum?id=gzCS252hCO. Accessed 2026-05-20. ↩
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu, "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale", arXiv HTML preprint 2306.15687v1, 2023-06-23. https://arxiv.org/html/2306.15687v1. Accessed 2026-05-20. ↩
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei, "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers", arXiv preprint 2301.02111, 2023-01-05. https://arxiv.org/abs/2301.02111. Accessed 2026-05-20. ↩
Andrey Lukyanenko, "Paper Review: Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale", andlukyane.com, 2023-07-06. https://andlukyane.com/blog/paper-review-voicebox. Accessed 2026-05-20. ↩
Ben Wodecki, "Meta Shows Off Revolutionary Audio AI But Won't Release It", AI Business, 2023-06-19. https://aibusiness.com/companies/meta-unveils-revolutionary-audio-ai-but-won-t-release-it. Accessed 2026-05-20. ↩
Entrepreneur Staff, "Meta Decides Not to Release AI That Can Mimic the Voices of Everyone You Know", Entrepreneur, 2023-06-21. https://www.entrepreneur.com/business-news/meta-decides-not-to-release-ai-that-mimics-peoples-voices/454441. Accessed 2026-05-20. ↩
Martin Brinkmann, "What is Meta Voicebox?", gHacks Tech News, 2023-06-20. https://www.ghacks.net/2023/06/20/what-is-meta-voicebox-deepfake/. Accessed 2026-05-20. ↩
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen, "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching", arXiv preprint 2410.06885, 2024-10-09. https://arxiv.org/abs/2410.06885. Accessed 2026-05-20. ↩
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, Matt Le, "Flow Matching for Generative Modeling", arXiv preprint 2210.02747, 2022-10-06. https://arxiv.org/abs/2210.02747. Accessed 2026-05-20. ↩
Speechify Inc., "SpeechifyInc/Meta-voicebox", GitHub repository, 2023-07-15. https://github.com/SpeechifyInc/Meta-voicebox. Accessed 2026-05-20. ↩
Phil Wang (lucidrains), "voicebox-pytorch", GitHub repository, 2023-06-26. https://github.com/lucidrains/voicebox-pytorch/blob/main/README.md. Accessed 2026-05-20. ↩
NeurIPS, "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale", NeurIPS 2023 proceedings PDF, 2023-12-10. https://proceedings.neurips.cc/paper_files/paper/2023/file/2d8911db9ecedf866015091b28946e15-Paper-Conference.pdf. Accessed 2026-05-20. ↩
Taylor Soper, "Nuance buys Voicebox Technologies, scooping up speech-recognition and natural-language pioneer", GeekWire, 2018-05-18. https://www.geekwire.com/2018/nuance-communications-buys-voicebox-technologies-scooping-another-seattle-area-company/. Accessed 2026-07-12. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Audiobox Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)Text-to-Speech Models

Infobox

How was Voicebox developed?

speech generation before Voicebox

conception at FAIR

public announcement and arXiv release

the decision not to release model weights

successor work

How does Voicebox work?

problem formulation: text-guided audio infilling

conditional flow matching objective

classifier-free guidance and inference

architecture

training data

training configuration

What can Voicebox do?

zero-shot text-to-speech

voice editing and infill

noise removal

multilingual and cross-lingual TTS

diverse and ASR-training sampling

Is Voicebox open source?

Why is Voicebox significant?

flow matching at scale for speech

universal generalist behaviour

influence on the responsible-release debate

What are Voicebox's benchmark results?

What are the limitations of Voicebox?

How was Voicebox received?

Related work

See also

References

Improve this article

Related Articles

AudioCraft

Wav2Vec

EnCodec

SeamlessM4T

Massively Multilingual Speech (MMS)

SpiRit-LM

What links here

Related Articles

AudioCraft

Wav2Vec

EnCodec

SeamlessM4T

Massively Multilingual Speech (MMS)

SpiRit-LM

What links here