# AudioCraft

> Source: https://aiwiki.ai/wiki/audiocraft
> Updated: 2026-07-13
> Categories: Deep Learning, Generative AI, Meta AI, Speech & Audio AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Generative AI](/wiki/generative_ai), [Meta AI](/wiki/meta_ai), and [Deep Learning](/wiki/deep_learning)*

AudioCraft is an open-source generative-audio library released by [Meta AI](/wiki/meta_ai) (Fundamental AI Research, FAIR) on August 2, 2023 that generates high-quality music and sound from text prompts using a single deep-learning codebase.[6][9] It bundles three core models, MusicGen for text-to-music, AudioGen for text-to-sound, and the EnCodec neural audio codec that tokenizes audio for both, with the research framework and training code released under the permissive MIT license.[6][9] Meta described AudioCraft as "our simple framework that generates high-quality, realistic audio and music from text-based user inputs," trained on roughly 20,000 hours of licensed and Meta-owned music rather than scraped data.[9][1]

## Overview

AudioCraft is an open-source framework developed by [Meta AI](/wiki/meta_ai) for generating high-quality audio and music using [deep learning](/wiki/deep_learning). Released in August 2023, AudioCraft provides a unified codebase for audio processing and generation that bundles several state-of-the-art models under a single library.[6] The framework operates on raw audio signals rather than symbolic representations like MIDI or piano rolls, which allows it to capture the full expressive range of sound, including tonal nuance, timbre, and recording conditions.[6]

The core AudioCraft suite originally consisted of three models: [MusicGen](/wiki/musicgen) for text-to-music generation, AudioGen for text-to-sound-effect generation, and [EnCodec](/wiki/encodec) for neural audio compression.[6] Meta later expanded the framework to include MAGNeT (a non-autoregressive masked generative model) and Multi-Band Diffusion (an alternative diffusion-based decoder).[5][4] All of these components share the same underlying tokenization approach based on EnCodec, which converts continuous audio waveforms into sequences of discrete tokens that language-model-style architectures can process.[2]

AudioCraft's code is released under the MIT license, while the pretrained model weights are distributed under a CC-BY-NC 4.0 license.[7][9] The framework is available on GitHub at `facebookresearch/audiocraft` and has been integrated into [Hugging Face](/wiki/hugging_face) Transformers since version 4.31.0.[8]

## When was AudioCraft released?

Meta open-sourced AudioCraft on August 2, 2023, announcing it through both the Meta AI research blog and the Meta newsroom.[6][9] At launch, Meta stated: "Our audio research framework and training code is released under the MIT license to enable the broader community to reproduce and build on top of our work."[9] The release packaged models that had been published separately over the prior year (EnCodec in October 2022, AudioGen in September 2022, and MusicGen in June 2023) into one library, alongside an improved EnCodec decoder that, in Meta's words, "allows higher quality music generation with fewer artifacts."[9] Meta positioned the release as a research-oriented "one-stop shop for audio modeling," emphasizing reproducibility and community building over a consumer product.[6]

## Architecture and Core Concepts

AudioCraft's design revolves around a shared pipeline: encode raw audio into discrete tokens using EnCodec, model those token sequences with a [Transformer](/wiki/transformer)-based language model, and decode the predicted tokens back into audio.[1] This approach draws on advances in both neural audio compression and autoregressive sequence modeling.

### Token Interleaving Pattern

A key innovation in AudioCraft is its token interleaving strategy. EnCodec produces multiple parallel streams of tokens (codebooks) for each audio frame.[2] Rather than requiring separate models to handle each codebook (as in hierarchical or cascaded approaches), AudioCraft introduces a delay pattern that staggers the codebooks in time.[1] This allows a single Transformer decoder to predict all codebook tokens in one forward pass, with each codebook offset by a small delay relative to the previous one.[1] The result is that the model needs only 50 autoregressive steps per second of audio, regardless of the number of codebooks.[1]

### Conditioning Mechanisms

AudioCraft models accept various forms of conditioning input. Text descriptions are encoded using a pretrained [T5](/wiki/t5) text encoder, and the resulting embeddings are fed into the Transformer through cross-attention layers.[1] MusicGen also supports melody conditioning through chromagram extraction (described in detail below).[1] Classifier-free guidance is used during inference to improve the fidelity of generated audio relative to the text prompt, where the model is trained to sometimes drop the conditioning signal so that it can learn both conditional and unconditional distributions.[1]

## MusicGen

MusicGen is AudioCraft's text-to-music generation model. It was developed between April and May 2023 by a team led by Jade Copet, Felix Kreuk, Itai Gat, and others at Meta's Fundamental AI Research (FAIR) lab. The corresponding paper, "Simple and Controllable Music Generation," was presented at [NeurIPS](/wiki/neurips) 2023.[1]

### How does MusicGen work?

MusicGen is a single-stage autoregressive [Transformer](/wiki/transformer) decoder that generates music directly from text prompts.[1] Unlike prior systems such as Google's [MusicLM](/wiki/musiclm), MusicGen does not rely on a separate self-supervised semantic representation stage.[1] Instead, it operates over four codebooks produced by a 32 kHz EnCodec tokenizer sampled at 50 Hz.[1] The delay-based interleaving pattern allows all four codebooks to be generated in parallel within a single model pass.[1] As the paper puts it, MusicGen is "a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens," eliminating "the need for cascading several models."[1]

During training, the model learns to predict the next set of audio tokens given the previous tokens and a text conditioning signal.[1] At inference time, tokens are generated autoregressively and then decoded back to a waveform using the EnCodec decoder.[1]

### Model Variants and Sizes

MusicGen is available in several configurations that trade off quality against computational cost:

| Model | Parameters | Description | Conditioning |
|---|---|---|---|
| MusicGen Small | 300M | Lightweight model suitable for experimentation and fast inference | Text |
| MusicGen Medium | 1.5B | Balanced quality and speed; recommended starting point | Text |
| MusicGen Large | 3.3B | Highest quality text-to-music generation | Text |
| MusicGen Melody | 1.5B | Supports melody-guided generation using chromagram conditioning | Text + Melody |
| MusicGen Melody-Large | 3.3B | Large-scale melody-guided generation | Text + Melody |
| MusicGen Stereo (Small) | 300M | Generates stereophonic audio | Text |
| MusicGen Stereo (Medium) | 1.5B | Generates stereophonic audio | Text |
| MusicGen Stereo (Large) | 3.3B | Generates stereophonic audio | Text |
| MusicGen Stereo Melody | 1.5B | Stereo with melody conditioning | Text + Melody |
| MusicGen Stereo Melody-Large | 3.3B | Stereo with melody conditioning | Text + Melody |

The Medium and Melody variants are generally considered the best trade-off between quality and computational requirements for most use cases.

### Melody Conditioning

One of MusicGen's distinctive features is its ability to condition generation on a melody extracted from a reference audio track.[1] The process works as follows:

1. **Source separation**: A music source separation model decomposes the reference audio into drums, bass, vocals, and residual components. The drums and bass are discarded, leaving only the melodic content.[1]
2. **Chromagram extraction**: A 12-pitch-class chromagram is computed from the residual waveform using a window size of $$2^{14}$$ and a hop size of $$2^{12}$$. Chromagrams capture harmonic and melodic characteristics while being robust to changes in instrumentation and timbre.[1]
3. **Dominant pitch filtering**: At each time step, the dominant pitch class is extracted from the chromagram via argmax. This filtering step prevents the model from simply reconstructing the original audio and forces it to capture only the broad melodic contour.[1]
4. **Conditioning integration**: The quantized chromagram is fed to the Transformer through dedicated conditioning layers, guiding the generation to follow the extracted melody while remaining faithful to the text description.[1]

This approach allows users to hum a melody or provide a reference track, and MusicGen will produce a new composition in a different style or instrumentation that follows the same melodic shape.

### Stereo Generation

MusicGen's architecture naturally extends to stereophonic music. The stereo models obtain two streams of tokens from a stereo-capable EnCodec model and interleave them using a variant of the delay pattern that alternates between left and right channel codebooks.[1] Stereo generation adds no extra computational cost at either training or inference time compared to a mono model of the same parameter count. The stereo models were fine-tuned for 200,000 update steps starting from the pretrained mono checkpoints.

### Training Data

MusicGen was trained on approximately 20,000 hours of licensed music comprising:[1]

- An internal Meta dataset of 10,000 high-quality music tracks
- Licensed music from the [ShutterStock](https://www.shutterstock.com/music) music collection
- Licensed music from the [Pond5](https://www.pond5.com/) music collection

All training data was either owned by Meta or specifically licensed for this purpose. The dataset included text descriptions and metadata for each track, amounting to roughly 400,000 recordings when counting individual segments. The music was predominantly instrumental and sampled at 32 kHz.[1]

## AudioGen

AudioGen is AudioCraft's text-to-audio generation model, focused on environmental sounds and sound effects rather than music. The original AudioGen paper, "AudioGen: Textually Guided Audio Generation" by Felix Kreuk and colleagues, was published in September 2022 and presented at ICLR 2023, then later integrated into the AudioCraft framework.[3]

### Architecture

AudioGen follows the same general architecture as MusicGen but operates on a 16 kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz.[3] The released model has 1.5 billion parameters. Like MusicGen, it uses a [T5](/wiki/t5) text encoder for conditioning and employs classifier-free guidance during inference.[3]

### Training Data and Approach

Unlike MusicGen, which relies on a curated licensed music dataset, AudioGen was trained on a combination of publicly available and licensed audio datasets.[3] The training corpus includes:

| Dataset | Type |
|---|---|
| [AudioSet](/wiki/audioset) | Multi-label audio event classification |
| AudioCaps | Audio captioning with natural language descriptions |
| Clotho v2 | Audio captioning dataset |
| VGG-Sound | Audio-visual dataset |
| FSD50K | Freesound Dataset 50K, general-purpose audio events |
| Sinniss Game Effects | Game sound effects |
| WeSoundEffects | Sound effects library |
| Paramount Motion Odeon Cinematic Sound Effects | Cinematic sound effects |
| Free To Use Sounds | Freely licensed sound effects |
| BBC Sound Effects | Public sound effects archive |

All audio files were resampled to 16 kHz. The paper introduced a data augmentation technique that mixes different audio samples together during training, which encourages the model to learn how to internally separate multiple sound sources.[3] This improves the model's ability to generate complex acoustic scenes containing multiple overlapping sounds.[3]

### Capabilities

AudioGen can produce a wide range of environmental sounds from text descriptions, including:[3]

- Animal sounds (dogs barking, birds singing, cats meowing)
- Vehicle and traffic noises (car horns, engine sounds, sirens)
- Weather sounds (rain, thunder, wind)
- Domestic sounds (footsteps on wood, door creaking, keyboard typing)
- Urban environments (crowd chatter, construction, street ambience)
- Nature scenes (river flowing, forest ambience, ocean waves)

## EnCodec

EnCodec is the neural audio codec that serves as the foundation for all other models in AudioCraft. Introduced in the paper "High Fidelity Neural Audio Compression" by Alexandre Defossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi (published on arXiv in October 2022 and later in Transactions on Machine Learning Research in September 2023), EnCodec provides real-time, high-fidelity audio compression and reconstruction using deep learning.[2]

### Architecture

EnCodec follows an encoder-quantizer-decoder paradigm:[2]

1. **Encoder**: The encoder is built on the SEANet (Sound Enhancement and Analysis Network) architecture, consisting of a series of 1D convolutional residual blocks with strided downsampling. Two bidirectional [LSTM](/wiki/rnn) layers are included after the convolutional stack to improve sequence-level modeling. The encoder takes a raw audio waveform as input and produces a continuous latent representation.[2]

2. **Residual Vector [Quantization](/wiki/quantization) (RVQ)**: The continuous latent representation is discretized using Residual Vector Quantization. RVQ works by applying a sequence of vector quantization steps, where each step quantizes the residual (error) left by the previous step. This produces multiple parallel streams of discrete tokens (codebooks), each capturing a different level of detail. The first codebook captures the coarsest information, while subsequent codebooks refine the representation progressively. A key advantage of RVQ is that the embedding dimensionality remains constant regardless of the target bitrate; the number of active codebooks simply changes.[2]

3. **Decoder**: The decoder mirrors the encoder with a symmetric stack of transposed convolutional layers and residual blocks that upsample the quantized latent representation back into a waveform.[2]

### Supported Configurations

EnCodec supports two primary sampling rates:

| Configuration | Sample Rate | Channels | Bitrates | Codebooks | Use Case |
|---|---|---|---|---|---|
| EnCodec 24kHz | 24,000 Hz | Mono | 1.5, 3, 6, 12, 24 kbps | Up to 32 | Speech, general audio |
| EnCodec 48kHz | 48,000 Hz | Stereo | 3, 6, 12, 24 kbps | Up to 32 | High-fidelity music |

The 24 kHz model is the primary variant used within AudioCraft for MusicGen and AudioGen tokenization (resampled to 32 kHz for MusicGen). Structured quantization dropout is applied during training, where codebooks are randomly masked, enabling a single model to operate at multiple bitrates without architectural changes.[2]

### Training Objective

EnCodec is trained end-to-end with a composite loss function:[2]

- **Time-domain reconstruction loss**: Measures the difference between the original and reconstructed waveforms in the time domain.
- **Multi-resolution spectral loss**: Compares the spectrograms of the original and reconstructed audio at multiple resolutions.
- **Adversarial loss**: A multi-scale Short-Time Fourier Transform (STFT) discriminator provides adversarial training signal, pushing the decoder to produce waveforms that are perceptually indistinguishable from real audio.
- **Loss balancer**: A novel mechanism that automatically adjusts the relative weights of different loss components to stabilize training.

EnCodec achieves better perceptual quality than classical codecs such as Opus and EVS, as well as earlier neural codecs like Google's SoundStream, across a range of bitrates and audio types.[2]

## MAGNeT

MAGNeT (Masked Audio Generation using a Single Non-Autoregressive Transformer) is a newer addition to AudioCraft, introduced in January 2024 by Alon Ziv, Itai Gat, and colleagues. The paper was presented at [ICLR](/wiki/iclr) 2024.[5]

### Architecture and Approach

Unlike MusicGen and AudioGen, which generate tokens autoregressively (one step at a time from left to right), MAGNeT uses a non-autoregressive approach based on masked token prediction.[5] During training, spans of tokens are masked according to a masking scheduler, and the model learns to predict the masked tokens given the surrounding context.[5] During inference, the model starts from a fully masked sequence and gradually fills in tokens over several decoding steps, refining its predictions iteratively.[5]

MAGNeT operates over a 32 kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz, the same tokenization scheme used by MusicGen.[5]

### Key Innovations

- **Span masking**: Rather than masking individual tokens, MAGNeT masks contiguous spans, which aligns better with the temporal structure of audio.[5]
- **Hybrid rescoring**: A novel rescoring method leverages an external pretrained model to rank and rescore MAGNeT's predictions, improving output quality.[5]
- **Parallel decoding**: Because the model is non-autoregressive, all tokens in the sequence can be predicted simultaneously in each decoding step, leading to substantial speed improvements.[5]

### Model Variants

| Model | Parameters | Duration | Domain |
|---|---|---|---|
| MAGNeT Small (music) | 300M | 10 seconds | Text-to-music |
| MAGNeT Medium (music) | 1.5B | 30 seconds | Text-to-music |
| MAGNeT Small (audio) | 300M | 10 seconds | Text-to-audio |
| MAGNeT Medium (audio) | 1.5B | 30 seconds | Text-to-audio |

### How much faster is MAGNeT than MusicGen?

MAGNeT generates audio up to 7 times faster than autoregressive baselines of comparable quality.[5][10] In benchmarks, its output quality is comparable to MusicGen and AudioGen while requiring significantly fewer decoding steps.[5] The speed advantage makes MAGNeT particularly suitable for interactive applications where low latency matters, since its parallel decoding can produce 30-second compositions in a fraction of a second.[10]

MAGNeT was trained between November 2023 and January 2024 on roughly 16,000 hours of licensed music (an internal dataset of 10,000 high-quality tracks plus ShutterStock and Pond5 data), sampled at 32 kHz.[5][10]

## Multi-Band Diffusion (MBD)

Multi-Band Diffusion is an alternative decoder for EnCodec tokens that uses a [diffusion model](/wiki/diffusion_models) approach instead of the standard convolutional decoder.[4] The paper, "From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion" by Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, and Alexandre Defossez, was presented at [NeurIPS](/wiki/neurips) 2023.[4]

### How It Works

MBD consists of a collection of four diffusion models, each responsible for generating audio in a different frequency band.[4] The models are conditioned on the embeddings extracted from a pretrained EnCodec model.[4] By splitting the generation task across multiple frequency bands, each sub-model can specialize in reconstructing a particular range of the audio spectrum, leading to higher overall fidelity.[4]

### Benefits

The primary advantage of MBD over the standard EnCodec decoder is reduced audio artifacts.[4] Music and sound effects generated through MBD exhibit fewer metallic or buzzy artifacts that can occur with the purely convolutional decoder, especially at lower bitrates.[4] The trade-off is increased computational cost, since running four diffusion models is more expensive than a single decoder pass. MBD is available as a toggle in AudioCraft's MusicGen API and demo interface.

At equal bitrate, MBD outperforms other generative decoding approaches in perceptual quality metrics across speech, music, and environmental sound modalities.[4]

## Training Data and Licensing

Meta took a deliberate approach to training data for AudioCraft, relying on licensed and owned content rather than scraped data.[6] This distinguishes AudioCraft from several competing platforms that have faced legal challenges over training data provenance.

| Component | Training Data Sources | Hours | Sample Rate |
|---|---|---|---|
| [MusicGen](/wiki/musicgen) | Meta Music Initiative, ShutterStock, Pond5 | ~20,000 | 32 kHz |
| AudioGen | AudioSet, AudioCaps, Clotho, VGG-Sound, FSD50K, BBC Sound Effects, and others | Varies per dataset | 16 kHz |
| [EnCodec](/wiki/encodec) | Diverse audio corpora (speech, music, environmental sound) | Not publicly disclosed | 24 kHz / 48 kHz |
| MAGNeT | Meta Music Initiative, ShutterStock, Pond5 (music); AudioSet-based (audio) | ~16,000 (music) | 32 kHz |

The Meta Music Initiative Sound Collection is an internal library of high-quality instrumental tracks. ShutterStock and Pond5 are commercial stock music platforms from which Meta obtained explicit licenses for training. The training data is acknowledged to have limitations, including a bias toward Western-style music and English-language metadata.

### Is AudioCraft open source?

Yes. AudioCraft's research framework and training code are released under the permissive MIT license, while the pretrained model weights are distributed under a Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0) license, which restricts the weights to non-commercial research and development with attribution.[7][9] In its launch announcement Meta wrote that the code was released "under the MIT license to enable the broader community to reproduce and build on top of our work," framing the dual-license split as a balance between open research and responsible use of generative-audio weights.[9] The full library, including model definitions, training recipes, and inference code, is hosted at `github.com/facebookresearch/audiocraft`.[7]

## Comparison with Other AI Music Generation Systems

The AI music generation landscape has grown significantly since AudioCraft's release. The following table compares AudioCraft's MusicGen with other prominent systems:

| Feature | MusicGen (AudioCraft) | [Suno](/wiki/suno) | [Udio](/wiki/udio) | [Stable Audio](/wiki/stable_audio) (Stability AI) |
|---|---|---|---|---|
| Developer | [Meta AI](/wiki/meta_ai) (FAIR) | Suno Inc. | Udio Inc. | [Stability AI](/wiki/stability_ai) |
| Release | June 2023 | December 2023 | April 2024 | March 2024 |
| Open Source | Yes (MIT license for code) | No | No | Partially (Stable Audio Open) |
| Vocals | No (instrumental only) | Yes (full songs with lyrics) | Yes (full songs with lyrics) | Limited (Stable Audio Open: no) |
| Max Duration | 30 seconds (default) | 4 minutes+ | 2+ minutes | 47 seconds (Open) / longer (commercial) |
| Architecture | Autoregressive Transformer | Proprietary | Proprietary | Latent Diffusion (DiT) |
| Melody Conditioning | Yes (chromagram-based) | No | No | No |
| Training Data | Licensed (ShutterStock, Pond5, Meta-owned) | Disputed (RIAA lawsuit filed July 2024) | Disputed (RIAA lawsuit filed July 2024) | Licensed (AudioSparx, Freesound for Open) |
| Local Deployment | Yes (full offline use) | No (cloud API only) | No (cloud API only) | Yes (Open version) |
| Stereo Output | Yes | Yes | Yes | Yes |
| Sample Rate | 32 kHz | Proprietary | Proprietary | 44.1 kHz |

MusicGen's primary strengths are its open-source availability, melody conditioning capability, and clean training data provenance. Its main limitations relative to commercial platforms like Suno and Udio are the lack of vocal generation and shorter default output duration. Suno and Udio produce full songs with vocals and lyrics, but both faced lawsuits from the Recording Industry Association of America (RIAA) in July 2024 alleging unauthorized use of copyrighted music in training.

Stable Audio Open, released by [Stability AI](/wiki/stability_ai) in July 2024, uses a latent diffusion architecture rather than an autoregressive Transformer and generates variable-length stereo audio at 44.1 kHz. Its training data comes from Freesound and the Free Music Archive under Creative Commons licenses.

## Open-Source Release and Hugging Face Integration

AudioCraft was released as an open-source project to encourage research and community development. The codebase is hosted at [github.com/facebookresearch/audiocraft](https://github.com/facebookresearch/audiocraft) under the MIT license for code and CC-BY-NC 4.0 for model weights.[7]

### Hugging Face Support

Since version 4.31.0 of the [Hugging Face](/wiki/hugging_face) Transformers library, MusicGen and EnCodec have been available as first-class model implementations.[8] This integration provides:

- Pretrained model hosting on the Hugging Face Model Hub under the `facebook/` namespace (e.g., `facebook/musicgen-small`, `facebook/musicgen-medium`, `facebook/musicgen-large`, `facebook/musicgen-melody`)
- Standardized APIs compatible with the Transformers `pipeline` and `AutoModel` interfaces
- Easy model loading with minimal dependencies
- Support for inference on both CPU and GPU
- Integration with Hugging Face Spaces for interactive demos

The `AUDIOCRAFT_CACHE_DIR` environment variable can be set to control where model weights are cached locally.

### Installation and Usage

AudioCraft can be installed via pip:

```
pip install audiocraft
```

Basic usage for text-to-music generation with MusicGen:

```python
from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=8)

descriptions = ['upbeat electronic dance track with heavy bass']
wav = model.generate(descriptions)
```

## Research Papers

The following papers describe the models and techniques within AudioCraft:

| Paper | Authors | Year | Venue |
|---|---|---|---|
| High Fidelity Neural Audio Compression (EnCodec) | Defossez, Copet, Synnaeve, Adi | 2022 | TMLR 2023 |
| AudioGen: Textually Guided Audio Generation | Kreuk, Synnaeve, Polyak, Singer, Defossez, Copet, Parikh, Taigman, Adi | 2022 | ICLR 2023 |
| Simple and Controllable Music Generation (MusicGen) | Copet, Kreuk, Gat, Remez, Kant, Synnaeve, Adi, Defossez | 2023 | NeurIPS 2023 |
| From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion | San Roman, Adi, Deleforge, Serizel, Synnaeve, Defossez | 2023 | NeurIPS 2023 |
| Masked Audio Generation using a Single Non-Autoregressive Transformer (MAGNeT) | Ziv, Gat, Le Lan, Remez, Kreuk, Defossez, Copet, Synnaeve, Adi | 2024 | ICLR 2024 |

## Future Directions

Meta's research team has indicated several areas of ongoing and future work:

- Improving controllability of generative models for audio, including finer-grained conditioning on rhythm, chord progressions, and dynamics
- Exploring additional conditioning modalities beyond text and melody
- Extending models to capture longer-range dependencies for generating full-length compositions
- Increasing generation speed and computational efficiency
- Addressing dataset limitations around diversity and bias toward Western music traditions
- Investigating vocal generation capabilities

The MusicGen-Style variant, trained on 16,000 hours of licensed music, explores style-conditioned generation where users can provide a reference track to guide the overall musical style rather than just melody.

## References

1. Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., & Defossez, A. (2023). "Simple and Controllable Music Generation." *Advances in Neural Information Processing Systems (NeurIPS 2023)*. arXiv:2306.05284.
2. Defossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). "High Fidelity Neural Audio Compression." *Transactions on Machine Learning Research (2023)*. arXiv:2210.13438.
3. Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Defossez, A., Copet, J., Parikh, D., Taigman, Y., & Adi, Y. (2022). "AudioGen: Textually Guided Audio Generation." *International Conference on Learning Representations (ICLR 2023)*. arXiv:2209.15352.
4. San Roman, R., Adi, Y., Deleforge, A., Serizel, R., Synnaeve, G., & Defossez, A. (2023). "From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion." *Advances in Neural Information Processing Systems (NeurIPS 2023)*. arXiv:2308.02560.
5. Ziv, A., Gat, I., Le Lan, G., Remez, T., Kreuk, F., Defossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2024). "Masked Audio Generation using a Single Non-Autoregressive Transformer." *International Conference on Learning Representations (ICLR 2024)*. arXiv:2401.04577.
6. Meta AI. "AudioCraft: A simple one-stop shop for audio modeling." Meta AI Blog, August 2, 2023. https://ai.meta.com/blog/audiocraft-musicgen-audiogen-encodec-generative-ai-audio/
7. AudioCraft GitHub Repository. https://github.com/facebookresearch/audiocraft
8. Hugging Face MusicGen Documentation. https://huggingface.co/docs/transformers/model_doc/musicgen
9. Meta. "Introducing AudioCraft: A Generative AI Tool For Audio and Music." Meta Newsroom, August 2, 2023. https://about.fb.com/news/2023/08/audiocraft-generative-ai-for-music-and-audio/
10. Meta AI. "AudioCraft" (models and libraries page) and MAGNeT model card, facebookresearch/audiocraft. https://ai.meta.com/resources/models-and-libraries/audiocraft/