See also: Generative AI, Meta AI, and Deep Learning
AudioCraft is an open-source framework developed by Meta AI for generating high-quality audio and music using deep learning. Released in August 2023, AudioCraft provides a unified codebase for audio processing and generation that bundles several state-of-the-art models under a single library. The framework operates on raw audio signals rather than symbolic representations like MIDI or piano rolls, which allows it to capture the full expressive range of sound, including tonal nuance, timbre, and recording conditions.
The core AudioCraft suite originally consisted of three models: MusicGen for text-to-music generation, AudioGen for text-to-sound-effect generation, and EnCodec for neural audio compression. Meta later expanded the framework to include MAGNeT (a non-autoregressive masked generative model) and Multi-Band Diffusion (an alternative diffusion-based decoder). All of these components share the same underlying tokenization approach based on EnCodec, which converts continuous audio waveforms into sequences of discrete tokens that language-model-style architectures can process.
AudioCraft's code is released under the MIT license, while the pretrained model weights are distributed under a CC-BY-NC 4.0 license. The framework is available on GitHub at facebookresearch/audiocraft and has been integrated into Hugging Face Transformers since version 4.31.0.
AudioCraft's design revolves around a shared pipeline: encode raw audio into discrete tokens using EnCodec, model those token sequences with a Transformer-based language model, and decode the predicted tokens back into audio. This approach draws on advances in both neural audio compression and autoregressive sequence modeling.
A key innovation in AudioCraft is its token interleaving strategy. EnCodec produces multiple parallel streams of tokens (codebooks) for each audio frame. Rather than requiring separate models to handle each codebook (as in hierarchical or cascaded approaches), AudioCraft introduces a delay pattern that staggers the codebooks in time. This allows a single Transformer decoder to predict all codebook tokens in one forward pass, with each codebook offset by a small delay relative to the previous one. The result is that the model needs only 50 autoregressive steps per second of audio, regardless of the number of codebooks.
AudioCraft models accept various forms of conditioning input. Text descriptions are encoded using a pretrained T5 text encoder, and the resulting embeddings are fed into the Transformer through cross-attention layers. MusicGen also supports melody conditioning through chromagram extraction (described in detail below). Classifier-free guidance is used during inference to improve the fidelity of generated audio relative to the text prompt, where the model is trained to sometimes drop the conditioning signal so that it can learn both conditional and unconditional distributions.
MusicGen is AudioCraft's text-to-music generation model. It was developed between April and May 2023 by a team led by Jade Copet, Felix Kreuk, Itai Gat, and others at Meta's Fundamental AI Research (FAIR) lab. The corresponding paper, "Simple and Controllable Music Generation," was presented at NeurIPS 2023.
MusicGen is a single-stage autoregressive Transformer decoder that generates music directly from text prompts. Unlike prior systems such as Google's MusicLM, MusicGen does not rely on a separate self-supervised semantic representation stage. Instead, it operates over four codebooks produced by a 32 kHz EnCodec tokenizer sampled at 50 Hz. The delay-based interleaving pattern allows all four codebooks to be generated in parallel within a single model pass.
During training, the model learns to predict the next set of audio tokens given the previous tokens and a text conditioning signal. At inference time, tokens are generated autoregressively and then decoded back to a waveform using the EnCodec decoder.
MusicGen is available in several configurations that trade off quality against computational cost:
| Model | Parameters | Description | Conditioning |
|---|---|---|---|
| MusicGen Small | 300M | Lightweight model suitable for experimentation and fast inference | Text |
| MusicGen Medium | 1.5B | Balanced quality and speed; recommended starting point | Text |
| MusicGen Large | 3.3B | Highest quality text-to-music generation | Text |
| MusicGen Melody | 1.5B | Supports melody-guided generation using chromagram conditioning | Text + Melody |
| MusicGen Melody-Large | 3.3B | Large-scale melody-guided generation | Text + Melody |
| MusicGen Stereo (Small) | 300M | Generates stereophonic audio | Text |
| MusicGen Stereo (Medium) | 1.5B | Generates stereophonic audio | Text |
| MusicGen Stereo (Large) | 3.3B | Generates stereophonic audio | Text |
| MusicGen Stereo Melody | 1.5B | Stereo with melody conditioning | Text + Melody |
| MusicGen Stereo Melody-Large | 3.3B | Stereo with melody conditioning | Text + Melody |
The Medium and Melody variants are generally considered the best trade-off between quality and computational requirements for most use cases.
One of MusicGen's distinctive features is its ability to condition generation on a melody extracted from a reference audio track. The process works as follows:
This approach allows users to hum a melody or provide a reference track, and MusicGen will produce a new composition in a different style or instrumentation that follows the same melodic shape.
MusicGen's architecture naturally extends to stereophonic music. The stereo models obtain two streams of tokens from a stereo-capable EnCodec model and interleave them using a variant of the delay pattern that alternates between left and right channel codebooks. Stereo generation adds no extra computational cost at either training or inference time compared to a mono model of the same parameter count. The stereo models were fine-tuned for 200,000 update steps starting from the pretrained mono checkpoints.
MusicGen was trained on approximately 20,000 hours of licensed music comprising:
All training data was either owned by Meta or specifically licensed for this purpose. The dataset included text descriptions and metadata for each track, amounting to roughly 400,000 recordings when counting individual segments. The music was predominantly instrumental and sampled at 32 kHz.
AudioGen is AudioCraft's text-to-audio generation model, focused on environmental sounds and sound effects rather than music. The original AudioGen paper, "AudioGen: Textually Guided Audio Generation" by Felix Kreuk and colleagues, was published in September 2022 and later integrated into the AudioCraft framework.
AudioGen follows the same general architecture as MusicGen but operates on a 16 kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. The released model has 1.5 billion parameters. Like MusicGen, it uses a T5 text encoder for conditioning and employs classifier-free guidance during inference.
Unlike MusicGen, which relies on a curated licensed music dataset, AudioGen was trained on a combination of publicly available and licensed audio datasets. The training corpus includes:
| Dataset | Type |
|---|---|
| AudioSet | Multi-label audio event classification |
| AudioCaps | Audio captioning with natural language descriptions |
| Clotho v2 | Audio captioning dataset |
| VGG-Sound | Audio-visual dataset |
| FSD50K | Freesound Dataset 50K, general-purpose audio events |
| Sinniss Game Effects | Game sound effects |
| WeSoundEffects | Sound effects library |
| Paramount Motion Odeon Cinematic Sound Effects | Cinematic sound effects |
| Free To Use Sounds | Freely licensed sound effects |
| BBC Sound Effects | Public sound effects archive |
All audio files were resampled to 16 kHz. The paper introduced a data augmentation technique that mixes different audio samples together during training, which encourages the model to learn how to internally separate multiple sound sources. This improves the model's ability to generate complex acoustic scenes containing multiple overlapping sounds.
AudioGen can produce a wide range of environmental sounds from text descriptions, including:
EnCodec is the neural audio codec that serves as the foundation for all other models in AudioCraft. Introduced in the paper "High Fidelity Neural Audio Compression" by Alexandre Defossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi (published on arXiv in October 2022 and later in Transactions on Machine Learning Research in September 2023), EnCodec provides real-time, high-fidelity audio compression and reconstruction using deep learning.
EnCodec follows an encoder-quantizer-decoder paradigm:
Encoder: The encoder is built on the SEANet (Sound Enhancement and Analysis Network) architecture, consisting of a series of 1D convolutional residual blocks with strided downsampling. Two bidirectional LSTM layers are included after the convolutional stack to improve sequence-level modeling. The encoder takes a raw audio waveform as input and produces a continuous latent representation.
Residual Vector Quantization (RVQ): The continuous latent representation is discretized using Residual Vector Quantization. RVQ works by applying a sequence of vector quantization steps, where each step quantizes the residual (error) left by the previous step. This produces multiple parallel streams of discrete tokens (codebooks), each capturing a different level of detail. The first codebook captures the coarsest information, while subsequent codebooks refine the representation progressively. A key advantage of RVQ is that the embedding dimensionality remains constant regardless of the target bitrate; the number of active codebooks simply changes.
Decoder: The decoder mirrors the encoder with a symmetric stack of transposed convolutional layers and residual blocks that upsample the quantized latent representation back into a waveform.
EnCodec supports two primary sampling rates:
| Configuration | Sample Rate | Channels | Bitrates | Codebooks | Use Case |
|---|---|---|---|---|---|
| EnCodec 24kHz | 24,000 Hz | Mono | 1.5, 3, 6, 12, 24 kbps | Up to 32 | Speech, general audio |
| EnCodec 48kHz | 48,000 Hz | Stereo | 3, 6, 12, 24 kbps | Up to 32 | High-fidelity music |
The 24 kHz model is the primary variant used within AudioCraft for MusicGen and AudioGen tokenization (resampled to 32 kHz for MusicGen). Structured quantization dropout is applied during training, where codebooks are randomly masked, enabling a single model to operate at multiple bitrates without architectural changes.
EnCodec is trained end-to-end with a composite loss function:
EnCodec achieves better perceptual quality than classical codecs such as Opus and EVS, as well as earlier neural codecs like Google's SoundStream, across a range of bitrates and audio types.
MAGNeT (Masked Audio Generation using a Single Non-Autoregressive Transformer) is a newer addition to AudioCraft, introduced in January 2024 by Alon Ziv, Itai Gat, and colleagues. The paper was presented at ICLR 2024.
Unlike MusicGen and AudioGen, which generate tokens autoregressively (one step at a time from left to right), MAGNeT uses a non-autoregressive approach based on masked token prediction. During training, spans of tokens are masked according to a masking scheduler, and the model learns to predict the masked tokens given the surrounding context. During inference, the model starts from a fully masked sequence and gradually fills in tokens over several decoding steps, refining its predictions iteratively.
MAGNeT operates over a 32 kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz, the same tokenization scheme used by MusicGen.
| Model | Parameters | Duration | Domain |
|---|---|---|---|
| MAGNeT Small (music) | 300M | 10 seconds | Text-to-music |
| MAGNeT Medium (music) | 1.5B | 30 seconds | Text-to-music |
| MAGNeT Small (audio) | 300M | 10 seconds | Text-to-audio |
| MAGNeT Medium (audio) | 1.5B | 30 seconds | Text-to-audio |
MAGNeT generates audio up to 7 times faster than autoregressive baselines of comparable quality. In benchmarks, its output quality is comparable to MusicGen and AudioGen while requiring significantly fewer decoding steps. The speed advantage makes MAGNeT particularly suitable for interactive applications where low latency matters.
MAGNeT was trained between November 2023 and January 2024 on the same 20,000 hours of licensed music used for MusicGen (Meta Music Initiative Sound Collection, ShutterStock, and Pond5), sampled at 32 kHz.
Multi-Band Diffusion is an alternative decoder for EnCodec tokens that uses a diffusion model approach instead of the standard convolutional decoder. The paper, "From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion" by Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, and Alexandre Defossez, was presented at NeurIPS 2023.
MBD consists of a collection of four diffusion models, each responsible for generating audio in a different frequency band. The models are conditioned on the embeddings extracted from a pretrained EnCodec model. By splitting the generation task across multiple frequency bands, each sub-model can specialize in reconstructing a particular range of the audio spectrum, leading to higher overall fidelity.
The primary advantage of MBD over the standard EnCodec decoder is reduced audio artifacts. Music and sound effects generated through MBD exhibit fewer metallic or buzzy artifacts that can occur with the purely convolutional decoder, especially at lower bitrates. The trade-off is increased computational cost, since running four diffusion models is more expensive than a single decoder pass. MBD is available as a toggle in AudioCraft's MusicGen API and demo interface.
At equal bitrate, MBD outperforms other generative decoding approaches in perceptual quality metrics across speech, music, and environmental sound modalities.
Meta took a deliberate approach to training data for AudioCraft, relying on licensed and owned content rather than scraped data. This distinguishes AudioCraft from several competing platforms that have faced legal challenges over training data provenance.
| Component | Training Data Sources | Hours | Sample Rate |
|---|---|---|---|
| MusicGen | Meta Music Initiative, ShutterStock, Pond5 | ~20,000 | 32 kHz |
| AudioGen | AudioSet, AudioCaps, Clotho, VGG-Sound, FSD50K, BBC Sound Effects, and others | Varies per dataset | 16 kHz |
| EnCodec | Diverse audio corpora (speech, music, environmental sound) | Not publicly disclosed | 24 kHz / 48 kHz |
| MAGNeT | Meta Music Initiative, ShutterStock, Pond5 (music); AudioSet-based (audio) | ~20,000 (music) | 32 kHz |
The Meta Music Initiative Sound Collection is an internal library of high-quality instrumental tracks. ShutterStock and Pond5 are commercial stock music platforms from which Meta obtained explicit licenses for training. The training data is acknowledged to have limitations, including a bias toward Western-style music and English-language metadata.
The AI music generation landscape has grown significantly since AudioCraft's release. The following table compares AudioCraft's MusicGen with other prominent systems:
| Feature | MusicGen (AudioCraft) | Suno | Udio | Stable Audio (Stability AI) |
|---|---|---|---|---|
| Developer | Meta AI (FAIR) | Suno Inc. | Udio Inc. | Stability AI |
| Release | June 2023 | December 2023 | April 2024 | March 2024 |
| Open Source | Yes (MIT license for code) | No | No | Partially (Stable Audio Open) |
| Vocals | No (instrumental only) | Yes (full songs with lyrics) | Yes (full songs with lyrics) | Limited (Stable Audio Open: no) |
| Max Duration | 30 seconds (default) | 4 minutes+ | 2+ minutes | 47 seconds (Open) / longer (commercial) |
| Architecture | Autoregressive Transformer | Proprietary | Proprietary | Latent Diffusion (DiT) |
| Melody Conditioning | Yes (chromagram-based) | No | No | No |
| Training Data | Licensed (ShutterStock, Pond5, Meta-owned) | Disputed (RIAA lawsuit filed July 2024) | Disputed (RIAA lawsuit filed July 2024) | Licensed (AudioSparx, Freesound for Open) |
| Local Deployment | Yes (full offline use) | No (cloud API only) | No (cloud API only) | Yes (Open version) |
| Stereo Output | Yes | Yes | Yes | Yes |
| Sample Rate | 32 kHz | Proprietary | Proprietary | 44.1 kHz |
MusicGen's primary strengths are its open-source availability, melody conditioning capability, and clean training data provenance. Its main limitations relative to commercial platforms like Suno and Udio are the lack of vocal generation and shorter default output duration. Suno and Udio produce full songs with vocals and lyrics, but both faced lawsuits from the Recording Industry Association of America (RIAA) in July 2024 alleging unauthorized use of copyrighted music in training.
Stable Audio Open, released by Stability AI in July 2024, uses a latent diffusion architecture rather than an autoregressive Transformer and generates variable-length stereo audio at 44.1 kHz. Its training data comes from Freesound and the Free Music Archive under Creative Commons licenses.
AudioCraft was released as an open-source project to encourage research and community development. The codebase is hosted at github.com/facebookresearch/audiocraft under the MIT license for code and CC-BY-NC 4.0 for model weights.
Since version 4.31.0 of the Hugging Face Transformers library, MusicGen and EnCodec have been available as first-class model implementations. This integration provides:
facebook/ namespace (e.g., facebook/musicgen-small, facebook/musicgen-medium, facebook/musicgen-large, facebook/musicgen-melody)pipeline and AutoModel interfacesThe AUDIOCRAFT_CACHE_DIR environment variable can be set to control where model weights are cached locally.
AudioCraft can be installed via pip:
pip install audiocraft
Basic usage for text-to-music generation with MusicGen:
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=8)
descriptions = ['upbeat electronic dance track with heavy bass']
wav = model.generate(descriptions)
The following papers describe the models and techniques within AudioCraft:
| Paper | Authors | Year | Venue |
|---|---|---|---|
| High Fidelity Neural Audio Compression (EnCodec) | Defossez, Copet, Synnaeve, Adi | 2022 | TMLR 2023 |
| AudioGen: Textually Guided Audio Generation | Kreuk, Synnaeve, Polyak, Singer, Defossez, Copet, Parikh, Taigman, Adi | 2022 | ICLR 2023 |
| Simple and Controllable Music Generation (MusicGen) | Copet, Kreuk, Gat, Remez, Kant, Synnaeve, Adi, Defossez | 2023 | NeurIPS 2023 |
| From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion | San Roman, Adi, Deleforge, Serizel, Synnaeve, Defossez | 2023 | NeurIPS 2023 |
| Masked Audio Generation using a Single Non-Autoregressive Transformer (MAGNeT) | Ziv, Gat, Le Lan, Remez, Kreuk, Defossez, Copet, Synnaeve, Adi | 2024 | ICLR 2024 |
Meta's research team has indicated several areas of ongoing and future work:
The MusicGen-Style variant, trained on 16,000 hours of licensed music, explores style-conditioned generation where users can provide a reference track to guide the overall musical style rather than just melody.