Audiobox
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,418 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,418 words
Add missing citations, update stale details, or suggest a clearer explanation.
Audiobox is a foundation research model for audio generation developed by Meta AI and its Fundamental AI Research (FAIR) group. Announced in late 2023 and presented as the successor to Voicebox, it generates speech, sound effects, and soundscapes from natural language text descriptions, from audio or voice example prompts, or from a combination of the two. Its defining feature is dual prompting: a user can provide a recorded voice sample together with a written description of a style or environment, and the model synthesizes that voice under the described conditions. Meta released an interactive demo with safety measures including automatic audio watermarking, and distributed research artifacts under a research-only license.[1][2][3]
Audiobox builds on the generative AI line of work at Meta that also includes the AudioCraft and MusicGen projects, but it targets a broader unification of audio modalities than those music-focused systems. The accompanying paper, "Audiobox: Unified Audio Generation with Natural Language Prompts," was submitted to arXiv on 25 December 2023 with 24 authors led by Apoorv Vyas and Wei-Ning Hsu.[3]
Voicebox, released by Meta in mid-2023, was a text-to-speech model trained with a flow-matching objective and an infilling (text-guided) training task, which let it perform zero-shot speech synthesis, editing, and noise removal. Meta declined to publicly release the Voicebox model itself, citing the risk of misuse for impersonation and deepfakes.[1][2]
Audiobox inherits Voicebox's flow-matching generative modeling and its guided audio-infilling training objective, then extends them beyond speech to general audio. Where Voicebox was constrained to speech and required structured inputs such as phonetic transcripts and reference audio, Audiobox adds free-form natural language control and the ability to generate non-speech sound. Meta describes Audiobox as advancing the same research program "by unifying generation and editing capabilities for speech, sound effects, and soundscapes."[1][3]
Audiobox is a flow-matching generative model, part of the same family of continuous normalizing flow methods as Voicebox. Rather than predicting raw waveforms or spectrograms directly, it is trained to predict latent audio features produced by an autoencoder (dense Encodec features taken before quantization).[3]
The system is built in stages:
| Component | Training data | Role |
|---|---|---|
| Audiobox SSL | ~185K hours of unlabeled audio (~160K hours speech, 20K hours music, 6K hours sound) | Self-supervised pre-training on a masked-infilling objective adapted from SpeechFlow |
| Audiobox Speech | ~100K hours of transcribed speech | Transcript-guided and voice-prompted speech generation |
| Audiobox Sound | ~6K hours of sound (~150 hours captioned, the rest tagged) | Text-to-sound generation |
| Audiobox (unified) | Speech captions (500 hours human-annotated, the remainder generated by a language model) plus sound data | Single model combining speech and sound under natural language control |
To turn loosely structured tags and transcripts into rich natural-language captions, the team used a large language model to synthesize descriptions, then trained the unified model to condition on those descriptions. The paper also introduces Joint-CLAP, a custom evaluation model trained on paired speech, sound, and text descriptions, because off-the-shelf CLAP models cannot distinguish fine-grained speaking styles such as accent or emotion.[3]
A separate contribution, Bespoke Solvers, speeds up the model's ordinary differential equation sampling by more than 25 times compared with the default flow-matching solver, without loss of performance on several tasks. Meta's blog summarizes this as generating audio "more than 25 times faster than" Voicebox.[1][3]
Audiobox unifies several generation and editing tasks that earlier systems handled with separate models.[1][3]
| Capability | Description | Example prompt |
|---|---|---|
| Description-based speech | Generate a voice from a written description plus a transcript | "A young woman speaks with a high pitch and fast pace" |
| Voice-prompted speech | Reproduce a voice supplied as an audio sample (zero-shot TTS) | A recorded voice clip plus text to narrate |
| Voice restyling (dual prompt) | Combine a voice sample with a text description of environment or emotion | A voice clip rendered "in a cathedral" or "speaks sadly and slowly" |
| Text-to-sound | Generate sound effects and ambient audio from a description | "A running river and birds chirping" |
| Infilling / restyling | Crop part of an audio clip and regenerate it from a description | Replace background audio while keeping the speech |
The dual-prompt voice restyling, conditioning on a voice sample and a free-form text style description at the same time, is the feature Meta highlights as new with Audiobox. Independent control over transcript, vocal identity, and acoustic style lets a user, for example, take one person's voice and place it in an arbitrary described setting.[1][2]
On zero-shot text-to-speech evaluated on LibriSpeech, the paper reports a style-similarity score of 0.745 for Audiobox Speech against 0.696 for Voicebox, with a word error rate of about 3.2 percent (Voicebox is lower at roughly 2.6 percent). On other speech domains the similarity gain over Voicebox ranges from about 0.096 to 0.156. Meta's blog frames the overall improvement as outperforming Voicebox on style similarity "by over 30 percent" across a range of speech styles.[1][3]
For text-to-sound on AudioCaps, Audiobox Sound reports a Fréchet Audio Distance of 0.77, well below baselines such as AudioLDM2-Full (1.89) and TANGO (1.57), with a CLAP similarity of 0.71 and a subjective overall-quality rating of about 3.43 out of 5. The blog states that Audiobox "significantly surpasses prior best models (AudioLDM2, VoiceLDM, and TANGO) on quality and relevance."[1][3]
| Task / metric | Audiobox | Best prior baseline |
|---|---|---|
| LibriSpeech zero-shot TTS, style similarity | 0.745 | Voicebox 0.696 |
| LibriSpeech zero-shot TTS, word error rate | ~3.2% | Voicebox ~2.6% |
| AudioCaps text-to-sound, FAD (lower is better) | 0.77 | TANGO 1.57 |
| AudioCaps text-to-sound, CLAP similarity | 0.71 | 0.43 to 0.56 |
Both the Audiobox model and its public demo apply automatic audio watermarking so that audio created with Audiobox can be traced to its origin. The watermark embeds a signal that is imperceptible to the human ear but detectable down to the frame level, allowing AI-generated segments within a longer clip to be identified.[1][2]
This localized watermarking is the approach later published by Meta as AudioSeal, described in "Proactive Detection of Voice Cloning with Localized Watermarking" (accepted at ICML 2024). AudioSeal uses a jointly trained generator and detector with a localization loss, predicting at each time step whether a watermark is present, which makes detection fast enough for real-time use. Meta has stated that earlier versions of this watermarking were used in its public demos including Audiobox and Seamless, serving on the order of 100,000 users daily.[4][5]
The interactive demo also included a voice-authentication step: to use a person's voice, a user had to speak a verification prompt in their own voice, with the prompts changing at rapid intervals to discourage uploading someone else's recording. Meta additionally reported testing for fairness across demographic groups, drawing on data from speakers in more than 150 countries speaking over 200 primary languages.[1][3]
Meta announced Audiobox as a foundation research model and opened an interactive demo on the project site (audiobox.metademolab.com) on 11 December 2023. The underlying model was made available under a research-only license to a limited, hand-selected set of researchers and institutions rather than as an open or commercial release; the move was consistent with Meta's earlier decision to withhold Voicebox over misuse concerns.[1][2][3]
Alongside the demo, Meta announced the Audiobox Responsible Generation Grant, offering research teams funding and access to study safety, fairness, and ethics in generative audio. The FAIR group accepted applications for up to 10 grants of up to $50,000 each.[6]
As of early 2026, the public Audiobox demo is no longer available.[1]