Audiobox

Generative AI Meta AI Music & Audio Generation

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,416 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Audiobox is a foundation research model for audio generation developed by Meta AI and its Fundamental AI Research (FAIR) group. Announced in late 2023 and presented as the successor to Voicebox, it generates speech, sound effects, and soundscapes from natural language text descriptions, from audio or voice example prompts, or from a combination of the two. Its defining feature is dual prompting: a user can provide a recorded voice sample together with a written description of a style or environment, and the model synthesizes that voice under the described conditions. Meta released an interactive demo with safety measures including automatic audio watermarking, and distributed research artifacts under a research-only license.^[1]^[2]^[3]

Audiobox builds on the generative AI line of work at Meta that also includes the AudioCraft and MusicGen projects, but it targets a broader unification of audio modalities than those music-focused systems. The accompanying paper, "Audiobox: Unified Audio Generation with Natural Language Prompts," was submitted to arXiv on 25 December 2023 with 24 authors led by Apoorv Vyas and Wei-Ning Hsu.^[3]

Background and lineage

Voicebox, released by Meta in mid-2023, was a text-to-speech model trained with a flow-matching objective and an infilling (text-guided) training task, which let it perform zero-shot speech synthesis, editing, and noise removal. Meta declined to publicly release the Voicebox model itself, citing the risk of misuse for impersonation and deepfakes.^[1]^[2]

Audiobox inherits Voicebox's flow-matching generative modeling and its guided audio-infilling training objective, then extends them beyond speech to general audio. Where Voicebox was constrained to speech and required structured inputs such as phonetic transcripts and reference audio, Audiobox adds free-form natural language control and the ability to generate non-speech sound. Meta describes Audiobox as advancing the same research program "by unifying generation and editing capabilities for speech, sound effects, and soundscapes."^[1]^[3]

How it works

Audiobox is a flow-matching generative model, part of the same family of continuous normalizing flow methods as Voicebox. Rather than predicting raw waveforms or spectrograms directly, it is trained to predict latent audio features produced by an autoencoder (dense Encodec features taken before quantization).^[3]

The system is built in stages:

Component	Training data	Role
Audiobox SSL	~185K hours of unlabeled audio (~160K hours speech, 20K hours music, 6K hours sound)	Self-supervised pre-training on a masked-infilling objective adapted from SpeechFlow
Audiobox Speech	~100K hours of transcribed speech	Transcript-guided and voice-prompted speech generation
Audiobox Sound	~6K hours of sound (~150 hours captioned, the rest tagged)	Text-to-sound generation
Audiobox (unified)	Speech captions (500 hours human-annotated, the remainder generated by a language model) plus sound data	Single model combining speech and sound under natural language control

To turn loosely structured tags and transcripts into rich natural-language captions, the team used a large language model to synthesize descriptions, then trained the unified model to condition on those descriptions. The paper also introduces Joint-CLAP, a custom evaluation model trained on paired speech, sound, and text descriptions, because off-the-shelf CLAP models cannot distinguish fine-grained speaking styles such as accent or emotion.^[3]

A separate contribution, Bespoke Solvers, speeds up the model's ordinary differential equation sampling by more than 25 times compared with the default flow-matching solver, without loss of performance on several tasks. Meta's blog summarizes this as generating audio "more than 25 times faster than" Voicebox.^[1]^[3]

Capabilities

Audiobox unifies several generation and editing tasks that earlier systems handled with separate models.^[1]^[3]

Capability	Description	Example prompt
Description-based speech	Generate a voice from a written description plus a transcript	"A young woman speaks with a high pitch and fast pace"
Voice-prompted speech	Reproduce a voice supplied as an audio sample (zero-shot TTS)	A recorded voice clip plus text to narrate
Voice restyling (dual prompt)	Combine a voice sample with a text description of environment or emotion	A voice clip rendered "in a cathedral" or "speaks sadly and slowly"
Text-to-sound	Generate sound effects and ambient audio from a description	"A running river and birds chirping"
Infilling / restyling	Crop part of an audio clip and regenerate it from a description	Replace background audio while keeping the speech

The dual-prompt voice restyling, conditioning on a voice sample and a free-form text style description at the same time, is the feature Meta highlights as new with Audiobox. Independent control over transcript, vocal identity, and acoustic style lets a user, for example, take one person's voice and place it in an arbitrary described setting.^[1]^[2]

Reported performance

On zero-shot text-to-speech evaluated on LibriSpeech, the paper reports a style-similarity score of 0.745 for Audiobox Speech against 0.696 for Voicebox, with a word error rate of about 3.2 percent (Voicebox is lower at roughly 2.6 percent). On other speech domains the similarity gain over Voicebox ranges from about 0.096 to 0.156. Meta's blog frames the overall improvement as outperforming Voicebox on style similarity "by over 30 percent" across a range of speech styles.^[1]^[3]

For text-to-sound on AudioCaps, Audiobox Sound reports a Fréchet Audio Distance of 0.77, well below baselines such as AudioLDM2-Full (1.89) and TANGO (1.57), with a CLAP similarity of 0.71 and a subjective overall-quality rating of about 3.43 out of 5. The blog states that Audiobox "significantly surpasses prior best models (AudioLDM2, VoiceLDM, and TANGO) on quality and relevance."^[1]^[3]

Task / metric	Audiobox	Best prior baseline
LibriSpeech zero-shot TTS, style similarity	0.745	Voicebox 0.696
LibriSpeech zero-shot TTS, word error rate	~3.2%	Voicebox ~2.6%
AudioCaps text-to-sound, FAD (lower is better)	0.77	TANGO 1.57
AudioCaps text-to-sound, CLAP similarity	0.71	0.43 to 0.56

Safety measures and watermarking

Both the Audiobox model and its public demo apply automatic audio watermarking so that audio created with Audiobox can be traced to its origin. The watermark embeds a signal that is imperceptible to the human ear but detectable down to the frame level, allowing AI-generated segments within a longer clip to be identified.^[1]^[2]

This localized watermarking is the approach later published by Meta as AudioSeal, described in "Proactive Detection of Voice Cloning with Localized Watermarking" (accepted at ICML 2024). AudioSeal uses a jointly trained generator and detector with a localization loss, predicting at each time step whether a watermark is present, which makes detection fast enough for real-time use. Meta has stated that earlier versions of this watermarking were used in its public demos including Audiobox and Seamless, serving on the order of 100,000 users daily.^[4]^[5]

The interactive demo also included a voice-authentication step: to use a person's voice, a user had to speak a verification prompt in their own voice, with the prompts changing at rapid intervals to discourage uploading someone else's recording. Meta additionally reported testing for fairness across demographic groups, drawing on data from speakers in more than 150 countries speaking over 200 primary languages.^[1]^[3]

Release and access

Meta announced Audiobox as a foundation research model and opened an interactive demo on the project site (audiobox.metademolab.com) on 11 December 2023. The underlying model was made available under a research-only license to a limited, hand-selected set of researchers and institutions rather than as an open or commercial release; the move was consistent with Meta's earlier decision to withhold Voicebox over misuse concerns.^[1]^[2]^[3]

Alongside the demo, Meta announced the Audiobox Responsible Generation Grant, offering research teams funding and access to study safety, fairness, and ethics in generative audio. The FAIR group accepted applications for up to 10 grants of up to $50,000 each.^[6]

As of early 2026, the public Audiobox demo is no longer available.^[1]

References

Meta AI, "Audiobox: Generating audio from voice and natural language prompts," AI at Meta Blog. https://ai.meta.com/blog/audiobox-generating-audio-voice-natural-language-prompts/ ↩
M. Wright, "A Sound Decision: Meta Rolls Out AI-Powered Audiobox," Decrypt, 11 December 2023. https://decrypt.co/209347/meta-rolls-out-audiobox-for-ai-powered-sound-generation ↩
A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, et al., "Audiobox: Unified Audio Generation with Natural Language Prompts," arXiv:2312.15821, 25 December 2023. https://arxiv.org/abs/2312.15821 ↩
R. San Roman, P. Fernandez, A. Defossez, T. Furon, T. Tran, H. Elsahar, "Proactive Detection of Voice Cloning with Localized Watermarking," arXiv:2401.17264 (ICML 2024). https://arxiv.org/abs/2401.17264 ↩
Meta AI, "Proactive Detection of Voice Cloning with Localized Watermarking," AI at Meta Research. https://ai.meta.com/research/publications/proactive-detection-of-voice-cloning-with-localized-watermarking/ ↩
Meta AI, "Audiobox Responsible Generation Grant," AI at Meta Research. https://ai.meta.com/research/audiobox-responsible-generation-grant/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

MusicGen Voicebox

Background and lineage

How it works

Capabilities

Reported performance

Safety measures and watermarking

Release and access

See also

References

Improve this article

Related Articles

MusicGen

Suno

Udio

Stable Audio

Lyria

Suno v5

What links here

Related Articles

MusicGen

Suno

Udio

Stable Audio

Lyria

Suno v5

What links here