AudioCraft

See also: Generative AI, Meta AI, and Deep Learning

Overview

AudioCraft is an open-source framework developed by Meta AI for generating high-quality audio and music using deep learning. Released in August 2023, AudioCraft provides a unified codebase for audio processing and generation that bundles several state-of-the-art models under a single library. The framework operates on raw audio signals rather than symbolic representations like MIDI or piano rolls, which allows it to capture the full expressive range of sound, including tonal nuance, timbre, and recording conditions.

The core AudioCraft suite originally consisted of three models: MusicGen for text-to-music generation, AudioGen for text-to-sound-effect generation, and EnCodec for neural audio compression. Meta later expanded the framework to include MAGNeT (a non-autoregressive masked generative model) and Multi-Band Diffusion (an alternative diffusion-based decoder). All of these components share the same underlying tokenization approach based on EnCodec, which converts continuous audio waveforms into sequences of discrete tokens that language-model-style architectures can process.

AudioCraft's code is released under the MIT license, while the pretrained model weights are distributed under a CC-BY-NC 4.0 license. The framework is available on GitHub at facebookresearch/audiocraft and has been integrated into Hugging Face Transformers since version 4.31.0.

Architecture and Core Concepts

AudioCraft's design revolves around a shared pipeline: encode raw audio into discrete tokens using EnCodec, model those token sequences with a Transformer-based language model, and decode the predicted tokens back into audio. This approach draws on advances in both neural audio compression and autoregressive sequence modeling.

Token Interleaving Pattern

A key innovation in AudioCraft is its token interleaving strategy. EnCodec produces multiple parallel streams of tokens (codebooks) for each audio frame. Rather than requiring separate models to handle each codebook (as in hierarchical or cascaded approaches), AudioCraft introduces a delay pattern that staggers the codebooks in time. This allows a single Transformer decoder to predict all codebook tokens in one forward pass, with each codebook offset by a small delay relative to the previous one. The result is that the model needs only 50 autoregressive steps per second of audio, regardless of the number of codebooks.

Conditioning Mechanisms

AudioCraft models accept various forms of conditioning input. Text descriptions are encoded using a pretrained T5 text encoder, and the resulting embeddings are fed into the Transformer through cross-attention layers. MusicGen also supports melody conditioning through chromagram extraction (described in detail below). Classifier-free guidance is used during inference to improve the fidelity of generated audio relative to the text prompt, where the model is trained to sometimes drop the conditioning signal so that it can learn both conditional and unconditional distributions.

MusicGen

MusicGen is AudioCraft's text-to-music generation model. It was developed between April and May 2023 by a team led by Jade Copet, Felix Kreuk, Itai Gat, and others at Meta's Fundamental AI Research (FAIR) lab. The corresponding paper, "Simple and Controllable Music Generation," was presented at NeurIPS 2023.

How MusicGen Works

MusicGen is a single-stage autoregressive Transformer decoder that generates music directly from text prompts. Unlike prior systems such as Google's MusicLM, MusicGen does not rely on a separate self-supervised semantic representation stage. Instead, it operates over four codebooks produced by a 32 kHz EnCodec tokenizer sampled at 50 Hz. The delay-based interleaving pattern allows all four codebooks to be generated in parallel within a single model pass.

During training, the model learns to predict the next set of audio tokens given the previous tokens and a text conditioning signal. At inference time, tokens are generated autoregressively and then decoded back to a waveform using the EnCodec decoder.

Model Variants and Sizes

MusicGen is available in several configurations that trade off quality against computational cost:

Model	Parameters	Description	Conditioning
MusicGen Small	300M	Lightweight model suitable for experimentation and fast inference	Text
MusicGen Medium	1.5B	Balanced quality and speed; recommended starting point	Text
MusicGen Large	3.3B	Highest quality text-to-music generation	Text
MusicGen Melody	1.5B	Supports melody-guided generation using chromagram conditioning	Text + Melody
MusicGen Melody-Large	3.3B	Large-scale melody-guided generation	Text + Melody
MusicGen Stereo (Small)	300M	Generates stereophonic audio	Text
MusicGen Stereo (Medium)	1.5B	Generates stereophonic audio	Text
MusicGen Stereo (Large)	3.3B	Generates stereophonic audio	Text
MusicGen Stereo Melody	1.5B	Stereo with melody conditioning	Text + Melody
MusicGen Stereo Melody-Large	3.3B	Stereo with melody conditioning	Text + Melody

The Medium and Melody variants are generally considered the best trade-off between quality and computational requirements for most use cases.

Melody Conditioning

One of MusicGen's distinctive features is its ability to condition generation on a melody extracted from a reference audio track. The process works as follows:

Source separation: A music source separation model decomposes the reference audio into drums, bass, vocals, and residual components. The drums and bass are discarded, leaving only the melodic content.
Chromagram extraction: A 12-pitch-class chromagram is computed from the residual waveform using a window size of 2^14 and a hop size of 2^12. Chromagrams capture harmonic and melodic characteristics while being robust to changes in instrumentation and timbre.
Dominant pitch filtering: At each time step, the dominant pitch class is extracted from the chromagram via argmax. This filtering step prevents the model from simply reconstructing the original audio and forces it to capture only the broad melodic contour.
Conditioning integration: The quantized chromagram is fed to the Transformer through dedicated conditioning layers, guiding the generation to follow the extracted melody while remaining faithful to the text description.

This approach allows users to hum a melody or provide a reference track, and MusicGen will produce a new composition in a different style or instrumentation that follows the same melodic shape.

Stereo Generation

MusicGen's architecture naturally extends to stereophonic music. The stereo models obtain two streams of tokens from a stereo-capable EnCodec model and interleave them using a variant of the delay pattern that alternates between left and right channel codebooks. Stereo generation adds no extra computational cost at either training or inference time compared to a mono model of the same parameter count. The stereo models were fine-tuned for 200,000 update steps starting from the pretrained mono checkpoints.

Training Data

MusicGen was trained on approximately 20,000 hours of licensed music comprising:

An internal Meta dataset of 10,000 high-quality music tracks
Licensed music from the ShutterStock music collection
Licensed music from the Pond5 music collection

All training data was either owned by Meta or specifically licensed for this purpose. The dataset included text descriptions and metadata for each track, amounting to roughly 400,000 recordings when counting individual segments. The music was predominantly instrumental and sampled at 32 kHz.

AudioGen

AudioGen is AudioCraft's text-to-audio generation model, focused on environmental sounds and sound effects rather than music. The original AudioGen paper, "AudioGen: Textually Guided Audio Generation" by Felix Kreuk and colleagues, was published in September 2022 and later integrated into the AudioCraft framework.

Architecture

AudioGen follows the same general architecture as MusicGen but operates on a 16 kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. The released model has 1.5 billion parameters. Like MusicGen, it uses a T5 text encoder for conditioning and employs classifier-free guidance during inference.

Training Data and Approach

Unlike MusicGen, which relies on a curated licensed music dataset, AudioGen was trained on a combination of publicly available and licensed audio datasets. The training corpus includes:

Dataset	Type
AudioSet	Multi-label audio event classification
AudioCaps	Audio captioning with natural language descriptions
Clotho v2	Audio captioning dataset
VGG-Sound	Audio-visual dataset
FSD50K	Freesound Dataset 50K, general-purpose audio events
Sinniss Game Effects	Game sound effects
WeSoundEffects	Sound effects library
Paramount Motion Odeon Cinematic Sound Effects	Cinematic sound effects
Free To Use Sounds	Freely licensed sound effects
BBC Sound Effects	Public sound effects archive

All audio files were resampled to 16 kHz. The paper introduced a data augmentation technique that mixes different audio samples together during training, which encourages the model to learn how to internally separate multiple sound sources. This improves the model's ability to generate complex acoustic scenes containing multiple overlapping sounds.

Capabilities

AudioGen can produce a wide range of environmental sounds from text descriptions, including:

Animal sounds (dogs barking, birds singing, cats meowing)
Vehicle and traffic noises (car horns, engine sounds, sirens)
Weather sounds (rain, thunder, wind)
Domestic sounds (footsteps on wood, door creaking, keyboard typing)
Urban environments (crowd chatter, construction, street ambience)
Nature scenes (river flowing, forest ambience, ocean waves)

EnCodec

EnCodec is the neural audio codec that serves as the foundation for all other models in AudioCraft. Introduced in the paper "High Fidelity Neural Audio Compression" by Alexandre Defossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi (published on arXiv in October 2022 and later in Transactions on Machine Learning Research in September 2023), EnCodec provides real-time, high-fidelity audio compression and reconstruction using deep learning.

Architecture

EnCodec follows an encoder-quantizer-decoder paradigm:

Encoder: The encoder is built on the SEANet (Sound Enhancement and Analysis Network) architecture, consisting of a series of 1D convolutional residual blocks with strided downsampling. Two bidirectional LSTM layers are included after the convolutional stack to improve sequence-level modeling. The encoder takes a raw audio waveform as input and produces a continuous latent representation.
Residual Vector Quantization (RVQ): The continuous latent representation is discretized using Residual Vector Quantization. RVQ works by applying a sequence of vector quantization steps, where each step quantizes the residual (error) left by the previous step. This produces multiple parallel streams of discrete tokens (codebooks), each capturing a different level of detail. The first codebook captures the coarsest information, while subsequent codebooks refine the representation progressively. A key advantage of RVQ is that the embedding dimensionality remains constant regardless of the target bitrate; the number of active codebooks simply changes.
Decoder: The decoder mirrors the encoder with a symmetric stack of transposed convolutional layers and residual blocks that upsample the quantized latent representation back into a waveform.

Supported Configurations

EnCodec supports two primary sampling rates:

Configuration	Sample Rate	Channels	Bitrates	Codebooks	Use Case
EnCodec 24kHz	24,000 Hz	Mono	1.5, 3, 6, 12, 24 kbps	Up to 32	Speech, general audio
EnCodec 48kHz	48,000 Hz	Stereo	3, 6, 12, 24 kbps	Up to 32	High-fidelity music

The 24 kHz model is the primary variant used within AudioCraft for MusicGen and AudioGen tokenization (resampled to 32 kHz for MusicGen). Structured quantization dropout is applied during training, where codebooks are randomly masked, enabling a single model to operate at multiple bitrates without architectural changes.

Training Objective

EnCodec is trained end-to-end with a composite loss function:

Time-domain reconstruction loss: Measures the difference between the original and reconstructed waveforms in the time domain.
Multi-resolution spectral loss: Compares the spectrograms of the original and reconstructed audio at multiple resolutions.
Adversarial loss: A multi-scale Short-Time Fourier Transform (STFT) discriminator provides adversarial training signal, pushing the decoder to produce waveforms that are perceptually indistinguishable from real audio.
Loss balancer: A novel mechanism that automatically adjusts the relative weights of different loss components to stabilize training.

EnCodec achieves better perceptual quality than classical codecs such as Opus and EVS, as well as earlier neural codecs like Google's SoundStream, across a range of bitrates and audio types.

MAGNeT

MAGNeT (Masked Audio Generation using a Single Non-Autoregressive Transformer) is a newer addition to AudioCraft, introduced in January 2024 by Alon Ziv, Itai Gat, and colleagues. The paper was presented at ICLR 2024.

Architecture and Approach

Unlike MusicGen and AudioGen, which generate tokens autoregressively (one step at a time from left to right), MAGNeT uses a non-autoregressive approach based on masked token prediction. During training, spans of tokens are masked according to a masking scheduler, and the model learns to predict the masked tokens given the surrounding context. During inference, the model starts from a fully masked sequence and gradually fills in tokens over several decoding steps, refining its predictions iteratively.

MAGNeT operates over a 32 kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz, the same tokenization scheme used by MusicGen.

Key Innovations

Span masking: Rather than masking individual tokens, MAGNeT masks contiguous spans, which aligns better with the temporal structure of audio.
Hybrid rescoring: A novel rescoring method leverages an external pretrained model to rank and rescore MAGNeT's predictions, improving output quality.
Parallel decoding: Because the model is non-autoregressive, all tokens in the sequence can be predicted simultaneously in each decoding step, leading to substantial speed improvements.

Model Variants

Model	Parameters	Duration	Domain
MAGNeT Small (music)	300M	10 seconds	Text-to-music
MAGNeT Medium (music)	1.5B	30 seconds	Text-to-music
MAGNeT Small (audio)	300M	10 seconds	Text-to-audio
MAGNeT Medium (audio)	1.5B	30 seconds	Text-to-audio

Performance

MAGNeT generates audio up to 7 times faster than autoregressive baselines of comparable quality. In benchmarks, its output quality is comparable to MusicGen and AudioGen while requiring significantly fewer decoding steps. The speed advantage makes MAGNeT particularly suitable for interactive applications where low latency matters.

MAGNeT was trained between November 2023 and January 2024 on the same 20,000 hours of licensed music used for MusicGen (Meta Music Initiative Sound Collection, ShutterStock, and Pond5), sampled at 32 kHz.

Multi-Band Diffusion (MBD)

Multi-Band Diffusion is an alternative decoder for EnCodec tokens that uses a diffusion model approach instead of the standard convolutional decoder. The paper, "From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion" by Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, and Alexandre Defossez, was presented at NeurIPS 2023.

How It Works

MBD consists of a collection of four diffusion models, each responsible for generating audio in a different frequency band. The models are conditioned on the embeddings extracted from a pretrained EnCodec model. By splitting the generation task across multiple frequency bands, each sub-model can specialize in reconstructing a particular range of the audio spectrum, leading to higher overall fidelity.

Benefits

The primary advantage of MBD over the standard EnCodec decoder is reduced audio artifacts. Music and sound effects generated through MBD exhibit fewer metallic or buzzy artifacts that can occur with the purely convolutional decoder, especially at lower bitrates. The trade-off is increased computational cost, since running four diffusion models is more expensive than a single decoder pass. MBD is available as a toggle in AudioCraft's MusicGen API and demo interface.

At equal bitrate, MBD outperforms other generative decoding approaches in perceptual quality metrics across speech, music, and environmental sound modalities.

Training Data and Licensing

Meta took a deliberate approach to training data for AudioCraft, relying on licensed and owned content rather than scraped data. This distinguishes AudioCraft from several competing platforms that have faced legal challenges over training data provenance.

Component	Training Data Sources	Hours	Sample Rate
MusicGen	Meta Music Initiative, ShutterStock, Pond5	~20,000	32 kHz
AudioGen	AudioSet, AudioCaps, Clotho, VGG-Sound, FSD50K, BBC Sound Effects, and others	Varies per dataset	16 kHz
EnCodec	Diverse audio corpora (speech, music, environmental sound)	Not publicly disclosed	24 kHz / 48 kHz
MAGNeT	Meta Music Initiative, ShutterStock, Pond5 (music); AudioSet-based (audio)	~20,000 (music)	32 kHz

The Meta Music Initiative Sound Collection is an internal library of high-quality instrumental tracks. ShutterStock and Pond5 are commercial stock music platforms from which Meta obtained explicit licenses for training. The training data is acknowledged to have limitations, including a bias toward Western-style music and English-language metadata.

Comparison with Other AI Music Generation Systems

The AI music generation landscape has grown significantly since AudioCraft's release. The following table compares AudioCraft's MusicGen with other prominent systems:

Feature	MusicGen (AudioCraft)	Suno	Udio	Stable Audio (Stability AI)
Developer	Meta AI (FAIR)	Suno Inc.	Udio Inc.	Stability AI
Release	June 2023	December 2023	April 2024	March 2024
Open Source	Yes (MIT license for code)	No	No	Partially (Stable Audio Open)
Vocals	No (instrumental only)	Yes (full songs with lyrics)	Yes (full songs with lyrics)	Limited (Stable Audio Open: no)
Max Duration	30 seconds (default)	4 minutes+	2+ minutes	47 seconds (Open) / longer (commercial)
Architecture	Autoregressive Transformer	Proprietary	Proprietary	Latent Diffusion (DiT)
Melody Conditioning	Yes (chromagram-based)	No	No	No
Training Data	Licensed (ShutterStock, Pond5, Meta-owned)	Disputed (RIAA lawsuit filed July 2024)	Disputed (RIAA lawsuit filed July 2024)	Licensed (AudioSparx, Freesound for Open)
Local Deployment	Yes (full offline use)	No (cloud API only)	No (cloud API only)	Yes (Open version)
Stereo Output	Yes	Yes	Yes	Yes
Sample Rate	32 kHz	Proprietary	Proprietary	44.1 kHz

MusicGen's primary strengths are its open-source availability, melody conditioning capability, and clean training data provenance. Its main limitations relative to commercial platforms like Suno and Udio are the lack of vocal generation and shorter default output duration. Suno and Udio produce full songs with vocals and lyrics, but both faced lawsuits from the Recording Industry Association of America (RIAA) in July 2024 alleging unauthorized use of copyrighted music in training.

Stable Audio Open, released by Stability AI in July 2024, uses a latent diffusion architecture rather than an autoregressive Transformer and generates variable-length stereo audio at 44.1 kHz. Its training data comes from Freesound and the Free Music Archive under Creative Commons licenses.

Open-Source Release and Hugging Face Integration

AudioCraft was released as an open-source project to encourage research and community development. The codebase is hosted at github.com/facebookresearch/audiocraft under the MIT license for code and CC-BY-NC 4.0 for model weights.

Hugging Face Support

Since version 4.31.0 of the Hugging Face Transformers library, MusicGen and EnCodec have been available as first-class model implementations. This integration provides:

Pretrained model hosting on the Hugging Face Model Hub under the facebook/ namespace (e.g., facebook/musicgen-small, facebook/musicgen-medium, facebook/musicgen-large, facebook/musicgen-melody)
Standardized APIs compatible with the Transformers pipeline and AutoModel interfaces
Easy model loading with minimal dependencies
Support for inference on both CPU and GPU
Integration with Hugging Face Spaces for interactive demos

The AUDIOCRAFT_CACHE_DIR environment variable can be set to control where model weights are cached locally.

Installation and Usage

AudioCraft can be installed via pip:

pip install audiocraft

Basic usage for text-to-music generation with MusicGen:

from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=8)

descriptions = ['upbeat electronic dance track with heavy bass']
wav = model.generate(descriptions)

Research Papers

The following papers describe the models and techniques within AudioCraft:

Paper	Authors	Year	Venue
High Fidelity Neural Audio Compression (EnCodec)	Defossez, Copet, Synnaeve, Adi	2022	TMLR 2023
AudioGen: Textually Guided Audio Generation	Kreuk, Synnaeve, Polyak, Singer, Defossez, Copet, Parikh, Taigman, Adi	2022	ICLR 2023
Simple and Controllable Music Generation (MusicGen)	Copet, Kreuk, Gat, Remez, Kant, Synnaeve, Adi, Defossez	2023	NeurIPS 2023
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion	San Roman, Adi, Deleforge, Serizel, Synnaeve, Defossez	2023	NeurIPS 2023
Masked Audio Generation using a Single Non-Autoregressive Transformer (MAGNeT)	Ziv, Gat, Le Lan, Remez, Kreuk, Defossez, Copet, Synnaeve, Adi	2024	ICLR 2024

Future Directions

Meta's research team has indicated several areas of ongoing and future work:

Improving controllability of generative models for audio, including finer-grained conditioning on rhythm, chord progressions, and dynamics
Exploring additional conditioning modalities beyond text and melody
Extending models to capture longer-range dependencies for generating full-length compositions
Increasing generation speed and computational efficiency
Addressing dataset limitations around diversity and bias toward Western music traditions
Investigating vocal generation capabilities

The MusicGen-Style variant, trained on 16,000 hours of licensed music, explores style-conditioned generation where users can provide a reference track to guide the overall musical style rather than just melody.

References

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., & Defossez, A. (2023). "Simple and Controllable Music Generation." *Advances in Neural Information Processing Systems (NeurIPS 2023)*. arXiv:2306.05284.
Defossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). "High Fidelity Neural Audio Compression." *Transactions on Machine Learning Research (2023)*. arXiv:2210.13438.
Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Defossez, A., Copet, J., Parikh, D., Taigman, Y., & Adi, Y. (2022). "AudioGen: Textually Guided Audio Generation." *International Conference on Learning Representations (ICLR 2023)*. arXiv:2209.15352.
San Roman, R., Adi, Y., Deleforge, A., Serizel, R., Synnaeve, G., & Defossez, A. (2023). "From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion." *Advances in Neural Information Processing Systems (NeurIPS 2023)*. arXiv:2308.02560.
Ziv, A., Gat, I., Le Lan, G., Remez, T., Kreuk, F., Defossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2024). "Masked Audio Generation using a Single Non-Autoregressive Transformer." *International Conference on Learning Representations (ICLR 2024)*. arXiv:2401.04577.
Meta AI. "AudioCraft: A simple one-stop shop for audio modeling." Meta AI Blog, August 2023. https://ai.meta.com/blog/audiocraft-musicgen-audiogen-encodec-generative-ai-audio/
AudioCraft GitHub Repository. https://github.com/facebookresearch/audiocraft
Hugging Face MusicGen Documentation. https://huggingface.co/docs/transformers/model_doc/musicgen

Overview

Architecture and Core Concepts

Token Interleaving Pattern

Conditioning Mechanisms

MusicGen

How MusicGen Works

Model Variants and Sizes

Melody Conditioning

Stereo Generation

Training Data

AudioGen

Architecture

Training Data and Approach

Capabilities

EnCodec

Architecture

Supported Configurations

Training Objective

MAGNeT

Architecture and Approach

Key Innovations

Model Variants

Performance

Multi-Band Diffusion (MBD)

How It Works

Benefits

Training Data and Licensing

Comparison with Other AI Music Generation Systems

Open-Source Release and Hugging Face Integration

Hugging Face Support

Installation and Usage

Research Papers

Future Directions

References

Improve this article

Related Articles

Music

Suno

ElevenLabs v3

Hume Octave 2

Sesame CSM

Stable Audio 2.5

Overview

Architecture and Core Concepts

Token Interleaving Pattern

Conditioning Mechanisms

MusicGen

How MusicGen Works

Model Variants and Sizes

Melody Conditioning

Stereo Generation

Training Data

AudioGen

Architecture

Training Data and Approach

Capabilities

EnCodec

Architecture

Supported Configurations

Training Objective

MAGNeT

Architecture and Approach

Key Innovations

Model Variants

Performance

Multi-Band Diffusion (MBD)

How It Works

Benefits

Training Data and Licensing

Comparison with Other AI Music Generation Systems

Open-Source Release and Hugging Face Integration

Hugging Face Support

Installation and Usage

Research Papers

Future Directions

References

Related Articles

Music

Suno

ElevenLabs v3