# MusicGen

> Source: https://aiwiki.ai/wiki/musicgen
> Updated: 2026-06-23
> Categories: Generative AI, Meta AI, Music & Audio Generation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**MusicGen** is an open-weights text-to-music generation model from [Meta AI](/wiki/meta_ai)'s Fundamental AI Research (FAIR) team, released on 8 June 2023, that generates roughly 30-second music clips from a text prompt, a reference melody, or both. It is a single-stage autoregressive [transformer](/wiki/transformer) [language model](/wiki/language_model) that predicts discrete audio tokens produced by Meta's neural audio codec [EnCodec](/wiki/encodec), and its design contribution is to show that high-quality, controllable [music generation](/wiki/ai_music_generation) needs only one transformer rather than a cascade of models.[^1][^2] The original release shipped three model sizes (300M, 1.5B, and 3.3B parameters), code under the MIT license, and weights under CC-BY-NC 4.0, making it one of the most widely used open music models of 2023-2024.[^5][^6]

MusicGen was described in the paper "Simple and Controllable Music Generation" (Copet et al., arXiv:2306.05284, accepted to [NeurIPS](/wiki/neurips) 2023), whose abstract states that the model "is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models."[^1] It was released alongside model weights on Hugging Face and a demo, and was subsequently incorporated into the [AudioCraft](/wiki/audiocraft) PyTorch library that Meta open-sourced on 2 August 2023.[^3][^4] In contrast to several contemporaneous music systems, including [Google](/wiki/google)'s [MusicLM](/wiki/musiclm), which relied on cascaded models and self-supervised semantic representations, MusicGen achieved comparable quality with a single transformer through an efficient "delay" interleaving pattern over EnCodec's residual codebooks.[^1][^2] Meta emphasised that the music used for training had been licensed from Shutterstock, Pond5, and Meta's own internal sound collection.[^5][^6]

MusicGen became one of the most influential open-weights music generation systems of the 2023-2024 period. It established the template of "EnCodec tokens plus an autoregressive transformer" that subsequent Meta models such as MAGNeT, MusicGen-Style, and JASCO built upon, and the released weights are widely used as a baseline in academic work and as the basis for derivative tools, fine-tunes, and community projects.[^7][^8][^9][^10]

## Key facts

| | |
|---|---|
| **Developer** | [Meta AI](/wiki/meta_ai) (FAIR) |
| **Initial release** | 8 June 2023 (paper and weights); included in AudioCraft library 2 August 2023[^1][^3] |
| **Open weights** | Yes: model weights under CC-BY-NC 4.0; code under MIT[^5][^6] |
| **Architecture** | Single-stage autoregressive [transformer](/wiki/transformer) decoder over discrete EnCodec audio tokens, with text conditioning from a frozen [T5](/wiki/t5) encoder[^1][^11] |
| **Model sizes** | 300M (Small), 1.5B (Medium), 3.3B (Large); 1.5B and 3.3B melody variants; later stereo and style variants[^5][^12] |
| **Audio tokenizer** | EnCodec at 32 kHz, four residual codebooks at 50 Hz (50 autoregressive steps per second of audio)[^1][^5] |
| **Training data** | Approximately 20,000 hours (~400,000 tracks) of licensed music from Shutterstock, Pond5, and the Meta Music Initiative Sound Collection[^4][^5] |
| **MusicCaps (Large)** | Frechet Audio Distance 5.48; KL divergence 1.37; CLAP text consistency 0.28[^5] |
| **Paper** | Copet et al., "Simple and Controllable Music Generation," arXiv:2306.05284, [NeurIPS](/wiki/neurips) 2023[^1] |
| **Repository** | `facebookresearch/audiocraft` on GitHub[^6] |

## What is MusicGen used for?

MusicGen generates short instrumental music clips, typically up to about 30 seconds, from natural-language descriptions of genre, mood, instrumentation, and tempo, and it can additionally follow the harmonic and melodic outline of a reference audio clip.[^1][^2] Typical uses are research on controllable audio generation, prototyping background or stock-style music, building derivative tools and fine-tunes, and serving as a standard baseline against which newer text-to-music models are compared.[^11][^23] Because vocals were removed from its training data, MusicGen produces instrumental music and, per Meta's model card, "is not able to generate realistic vocals."[^5]

## Background

MusicGen sits in a research lineage at Meta that runs from neural audio compression to general-purpose audio generation. The immediate technical predecessor is [EnCodec](/wiki/encodec), a high-fidelity neural audio codec introduced in October 2022 by Alexandre Defossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, all of whom are also authors of MusicGen.[^13] EnCodec uses a streaming convolutional encoder-decoder with residual vector quantisation, achieving roughly ten times the compression of MP3 at comparable perceived quality, and it produces the discrete tokens that MusicGen learns to predict.[^13]

A second strand is AudioGen, Meta's text-to-sound model for environmental and effect sounds, which paired EnCodec tokens with a transformer language model. AudioGen and EnCodec together motivated the question of whether a similar recipe could scale to music, a domain that requires longer-range structure, harmonic coherence, and finer control over the generation process.[^14]

The most immediate external precedent was [MusicLM](/wiki/musiclm), released by [Google](/wiki/google) in January 2023, which generated music from text using a cascade of self-supervised audio representations (drawn from a separate model called MuLan) and multiple language models operating at different abstraction levels. MusicLM was widely covered but Google initially declined to release weights, citing copyright and ethical concerns.[^15] MusicGen explicitly framed itself against this design: the authors argued that "simple and controllable" generation could be achieved with one transformer over EnCodec tokens, dispensing with the semantic stage and cascade.[^1] Other contemporaries the paper benchmarks against include Riffusion, Mousai, and Google's Noise2Music.[^1]

## How does MusicGen work?

The MusicGen system has two main components: a frozen EnCodec audio tokenizer and a transformer decoder language model that predicts EnCodec tokens autoregressively.[^1][^2]

### EnCodec tokenization

For music, MusicGen uses a 32 kHz monaural EnCodec model with four codebooks sampled at 50 Hz. Each second of audio is therefore represented as 50 time steps and, at each time step, four discrete tokens (one per residual quantizer codebook) for a total of 200 tokens per second.[^5][^11] This makes the audio representation roughly two orders of magnitude denser per second than a typical [language model](/wiki/language_model) sequence but two orders of magnitude sparser than the raw waveform.

### Single-stage transformer with codebook interleaving

A naive autoregressive language model over the four codebooks would either flatten them (predicting 200 tokens per second sequentially) or rely on a cascade of models predicting one codebook level at a time. The MusicGen paper introduces and systematically compares several "codebook interleaving patterns," including a parallel pattern, a flattened pattern, a coarse-first pattern, and a "delay" pattern that staggers prediction of successive codebooks by one step.[^1][^2]

The delay pattern is MusicGen's default. As the paper puts it, "by introducing a small delay between the codebooks, we show we can predict them in parallel, thus having only 50 auto-regressive steps per second of audio."[^1][^5] It allows all four codebooks for a given time step to be predicted in parallel from the perspective of decoding throughput while still respecting a strict causal dependency: codebook k at step t is conditioned on codebook k-1 at step t and on all preceding time steps. The result is that MusicGen needs only fifty autoregressive steps per second of generated audio, rather than two hundred, while retaining a single transformer.[^1][^2] The paper presents ablations on these patterns, finding that delay matches the quality of the flattened pattern at a fraction of the compute.[^1]

### Conditioning

For text conditioning, MusicGen passes the prompt through a frozen [T5](/wiki/t5) (or Flan-T5) text encoder and uses the resulting hidden states as a key/value source for cross-attention layers inserted between the decoder's self-attention layers.[^11] For melody conditioning, the system extracts a chromagram, a time-frequency representation that captures harmonic and melodic content while abstracting away timbre, from a reference audio clip; the chromagram is quantised and supplied as a conditioning prefix to the transformer.[^16][^11] When both text and melody conditioning are used, the model can be instructed to follow a reference melody while rendering it in a different style described by the text prompt.[^2]

### Generating beyond the training horizon

The MusicGen models were trained on 30-second audio chunks, but inference supports longer outputs through a sliding window: the model generates 30 seconds, then continues from a 20-second overlap, producing a smooth extension up to the limits of compute budget and coherence.[^2]

## Model variants

MusicGen was initially released in three sizes for text-to-music, plus two melody-conditioned variants, with later additions for stereo output and style conditioning. The base models were trained between April and May 2023, and the stereophonic variants were later fine-tuned for 200,000 updates starting from the corresponding mono models.[^5][^7]

| Variant | Parameters | Conditioning | Notes |
|---|---|---|---|
| MusicGen Small | 300M | Text | Smallest text-to-music model[^5][^12] |
| MusicGen Medium | 1.5B | Text | Frequently cited as the best quality/compute trade-off[^12] |
| MusicGen Large | 3.3B | Text | Highest-quality original variant[^12] |
| MusicGen Melody | 1.5B | Text + melody | Adds chromagram-based melody conditioning[^5] |
| MusicGen Melody Large | 3.3B | Text + melody | Released later; large-scale melody model[^17] |
| MusicGen Stereo (Small/Medium/Large/Melody/Melody Large) | as above | Stereo output | Added late 2023; two EnCodec token streams interleaved with the delay pattern, with a multi-band diffusion decoder[^7] |
| MusicGen-Style | 1.5B | Text + style audio | Allows matching an audio excerpt's style; trained September-December 2023[^9] |

The official model card reports MusicCaps benchmark numbers for the three base text-to-music sizes: Frechet Audio Distance (FAD) of 4.88, 5.14, and 5.48 for Small, Medium, and Large respectively; KL divergence of 1.42, 1.38, and 1.37; and CLAP-based text consistency of 0.27, 0.28, and 0.28.[^5] The non-monotonic FAD trend with model size reflects known limitations of FAD as a perceptual metric rather than larger models being subjectively worse.[^5]

## When was MusicGen released?

The "Simple and Controllable Music Generation" paper was first posted to arXiv on 8 June 2023, the same day Meta published the model weights on Hugging Face and a public demo.[^1][^2] On 2 August 2023 Meta folded MusicGen, AudioGen, and EnCodec into the open-source [AudioCraft](/wiki/audiocraft) library.[^3][^4] Multi-band-diffusion decoding and stereo variants followed in late 2023, and the paper was presented at [NeurIPS](/wiki/neurips) 2023.[^1][^7]

## Training data

A distinctive aspect of MusicGen, frequently highlighted in news coverage at the time, is that Meta paid to license the music it trained on rather than scraping the open web. The model card states that 20,000 hours of music (roughly 400,000 tracks) were drawn from three sources: an internal "Meta Music Initiative Sound Collection," the Shutterstock music collection, and the Pond5 music collection.[^4][^5]

Tracks with vocals had their vocal stems removed in pre-processing using the open-source Hybrid Transformer for Music Source Separation (HT-Demucs) together with metadata tags. Each clip is paired with a textual description derived from metadata such as genre, mood, instrumentation, and tempo tags. The released models therefore cannot generate convincing sung vocals, a property both reflected in the model card and noted by reviewers.[^5]

Meta's licensing approach was singled out by [generative-AI](/wiki/generative_ai) industry observers as a meaningful counterpoint to the prevailing "train first, litigate later" pattern in music AI, although Meta has not published a full breakdown of the licensing terms or per-track payments.[^4][^18] The model card explicitly acknowledges that the training mix skews toward Western, English-described music and that performance is correspondingly uneven across cultures and musical traditions.[^5]

## How is MusicGen evaluated?

The MusicGen paper evaluates models on the MusicCaps benchmark, a 5.5K-clip evaluation set of ten-second musical excerpts paired with expert text captions originally introduced for MusicLM, using both objective metrics and human listening studies.[^1][^19]

The objective metrics are:

- **Frechet Audio Distance (FAD)**, computed on VGGish embeddings, which compares the distribution of generated audio to ground-truth audio.[^1][^5]
- **KL divergence**, computed on PaSST-based audio tag distributions, which measures how closely the labels predicted for a generated clip match those of the reference.[^5]
- **CLAP score**, which measures alignment between the text prompt and the generated audio in the CLAP joint embedding space.[^5]

In the original benchmarks, MusicGen was competitive with MusicLM, Mousai, Riffusion, and Noise2Music on FAD and produced higher text-relevance scores. The paper notes that Noise2Music achieved the lowest FAD on MusicCaps at the time, with MusicGen-text close behind, while MusicGen led on subjective measures.[^1]

The human studies asked listeners to rate samples on overall quality, text relevance, and, for melody-conditioned models, melody adherence. MusicGen was preferred to MusicLM, Riffusion, and Mousai by raters on quality and text adherence in the original paper's experiments, with the differences widening on the melody-conditioned setting.[^1] Subsequent independent comparisons in the literature, including the 2025 "Benchmarking Music Generation Models and Metrics via Human Preference Studies" paper, found that human ratings and FAD do not always align, and that MusicGen remains a strong baseline.[^20]

## Is MusicGen open source?

Yes. MusicGen's code is released under the permissive MIT license and its model weights under the non-commercial CC-BY-NC 4.0 license, with both distributed through the `facebookresearch/audiocraft` repository.[^5][^6] Announcing the AudioCraft release, Meta wrote that it was "open-sourcing these models, giving researchers and practitioners access so they can train their own models with their own datasets for the first time."[^25] The weights' non-commercial clause means MusicGen is intended for research and experimentation; commercial deployment of the released checkpoints is restricted by that license.[^5]

## AudioCraft framework

On 2 August 2023, Meta released **AudioCraft**, a PyTorch library that consolidates training and inference code for MusicGen, AudioGen, and EnCodec into a single repository.[^3][^4] The library was released under the MIT license, and weights for the bundled models under CC-BY-NC 4.0.[^6] As Meta described it, "AudioCraft consists of three models: MusicGen, AudioGen and EnCodec."[^25]

AudioCraft has since grown to include several additional models from the same FAIR research line:

- **Multi Band Diffusion (MBD)**: an EnCodec-compatible diffusion-based decoder introduced in August 2023 that improves audio quality by generating different frequency bands independently before recombining them.[^7]
- **MAGNeT**: a non-autoregressive masked-prediction transformer over EnCodec tokens, published in January 2024 by Ziv et al. MAGNeT achieves quality comparable to MusicGen while running approximately seven times faster than the autoregressive baseline.[^8][^21]
- **MusicGen-Style**: a 1.5B style-conditioned variant trained between September and December 2023, accompanied by a paper published in mid-2024.[^9]
- **JASCO**: a flow-matching-based text-to-music model with temporal control over chords, melody, and drum tracks, published in June 2024.[^22]
- **AudioSeal**: an audio watermarking system intended to mark generated content.[^6]

The AudioCraft repository, distributed via `facebookresearch/audiocraft` on GitHub, became the canonical home for Meta's open audio-generation research and the practical entry point used by most researchers and hobbyists experimenting with MusicGen.[^6]

## Reception and uptake

MusicGen received broad coverage in technical and music-industry press at launch. TechCrunch, in its August 2023 coverage of the AudioCraft release, described MusicGen as making it possible to "create halfway decent songs in a range of styles without having to play an instrument or read sheet music," while also flagging copyright tensions inherent to any large-scale music model.[^4] Subsequent coverage emphasised three themes: that the weights were openly released, that the training data was licensed, and that the model held up well against MusicLM, which Google had at that point only made available in a closed preview.[^4][^18]

In the research community the model became a standard baseline. The Hugging Face Transformers library added a native MusicGen implementation, the model appears as a baseline in subsequent text-to-music papers, and the CC-BY-NC weights made it the basis for many community fine-tunes (for example for genre-specific generation), wrappers (such as the popular "Replicate" deployments), and pedagogical tutorials.[^11][^23] By 2024, MusicGen-Large remained one of the most-downloaded text-to-music models on the [Meta AI](/wiki/meta_ai) Hugging Face organisation page.[^12]

Meta also acknowledged limitations clearly in its own materials: the model card lists inability to produce realistic vocals, English-only prompt support, uneven performance across musical cultures, and a tendency for generated clips to "collapse to silence" near the end of long generations. The release was made available "for research purposes," with downstream commercial use restricted by the non-commercial weight license.[^5]

## Successor work at Meta

MusicGen's recipe, using EnCodec tokens, a transformer over discrete codes, and conditioning from a frozen text encoder, was carried forward in several subsequent Meta projects.

**MAGNeT** (January 2024) keeps the EnCodec representation and the four-codebook structure but replaces autoregressive prediction with a single non-autoregressive masked transformer that decodes audio in a few parallel passes. The authors include several of the MusicGen authors and report a roughly 7x speed-up at comparable quality.[^8][^21]

**Audiobox** (November 2023) is a broader audio foundation model targeting speech, sound effects, and soundscapes, building on Meta's Voicebox speech model. Audiobox is more focused on speech and general audio than MusicGen, but it represents Meta's push to extend controllable audio generation beyond music; it was released to a hand-picked group of researchers rather than as open weights, with audio watermarking and voice authentication safeguards.[^24]

**JASCO** (June 2024) targets fine-grained temporal control of music using flow-matching rather than autoregressive language modelling, with conditioning on chords, melody, and drum tracks. JASCO is offered as a complement rather than a replacement to MusicGen, and weights for several JASCO variants were released openly via the AudioCraft repository under a CC license.[^22]

**MusicGen-Style** (paper 2024) extends the original model with the ability to follow a short reference audio clip's style in addition to (or instead of) text.[^9]

## How does MusicGen differ from MusicLM, Suno, and Stable Audio?

MusicGen's openness, technical clarity, and licensed training data made it a reference point for both academic and commercial AI music systems that emerged in 2023-2024. The defining contrast with [MusicLM](/wiki/musiclm) is architectural: MusicLM uses a cascade of self-supervised semantic and acoustic models, whereas MusicGen uses one transformer over EnCodec tokens and released its weights openly, where Google kept MusicLM closed at launch.[^1][^15]

The commercial systems that captured the most public attention during this period, notably Suno and Udio, use different architectures (broadly described as hybrid transformer-and-diffusion systems built on latent representations of full songs rather than instrumental clips) and target full songs with vocals, which MusicGen does not.[^25][^26] These products did not derive directly from MusicGen, but they competed in a market in which open-weights MusicGen had already established expectations for fidelity, controllability, and the role of training data licensing.[^18][^27]

[Stable Audio](/wiki/stable_audio), released by [Stability AI](/wiki/stability_ai) in September 2023, is closer in spirit to MusicGen (instrumental music, text conditioning, open-research artefacts) but uses a latent-diffusion architecture and a U-Net (later replaced by a diffusion transformer in Stable Audio 2.0) rather than an autoregressive transformer over discrete codes. Stable Audio's training was also done exclusively on licensed music, in this case via a partnership with the AudioSparx library, mirroring the licensing posture Meta had publicly adopted for MusicGen.[^28]

For research, MusicGen's release effectively cemented the EnCodec-plus-transformer pattern as a baseline of comparison: a long tail of follow-up papers fine-tune MusicGen, use it as a starting point for new conditioning mechanisms (such as MusiConGen for rhythm and chord control or Instruct-MusicGen for instruction-tuned editing), or evaluate new models against MusicGen variants on MusicCaps.[^16][^29]

## Status and updates through 2026

After the initial June 2023 paper and August 2023 AudioCraft release, Meta shipped a series of incremental MusicGen improvements through 2023, most prominently multi-band-diffusion decoding and stereo variants in late 2023, and complemented MusicGen with MAGNeT, MusicGen-Style, and JASCO in 2024.[^7][^8][^9][^22] As of 2026, Meta has not released a "MusicGen 2," and the AudioCraft repository remains the primary distribution channel for the original models and their immediate successors; community projects and downstream products continue to use MusicGen weights as a base model.[^6][^10]

## See also

- [AI Music Generation](/wiki/ai_music_generation)
- [EnCodec](/wiki/encodec)
- [MusicLM](/wiki/musiclm)
- [AudioCraft](/wiki/audiocraft)
- [Audiobox](/wiki/audiobox)
- [Stable Audio](/wiki/stable_audio)

## References

[^1]: Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., Defossez, A. "Simple and Controllable Music Generation." arXiv:2306.05284, submitted 8 June 2023; accepted to NeurIPS 2023. https://arxiv.org/abs/2306.05284

[^2]: AudioCraft project, "MusicGen: Simple and Controllable Music Generation." https://audiocraft.metademolab.com/musicgen.html

[^3]: Meta AI, "AudioCraft: A simple one-stop shop for audio modeling." 2 August 2023. https://ai.meta.com/blog/audiocraft-musicgen-audiogen-encodec-generative-ai-audio/

[^4]: TechCrunch, "Meta open sources framework for generating sounds and music." 2 August 2023. https://techcrunch.com/2023/08/02/meta-open-sources-models-for-generating-sounds-and-music/

[^5]: AudioCraft, "MUSICGEN_MODEL_CARD.md," `facebookresearch/audiocraft` GitHub repository. https://github.com/facebookresearch/audiocraft/blob/main/model_cards/MUSICGEN_MODEL_CARD.md

[^6]: `facebookresearch/audiocraft` GitHub repository. https://github.com/facebookresearch/audiocraft

[^7]: Hugging Face, `facebook/MusicGen` Space, Stereo demo update discussion, November 2023. https://huggingface.co/spaces/facebook/MusicGen

[^8]: Ziv, A., Gat, I., Le Lan, G., Remez, T., Kreuk, F., Defossez, A., Copet, J., Synnaeve, G., Adi, Y. "Masked Audio Generation using a Single Non-Autoregressive Transformer." arXiv:2401.04577, January 2024. https://arxiv.org/abs/2401.04577

[^9]: Hugging Face, `facebook/musicgen-style` model card. https://huggingface.co/facebook/musicgen-style

[^10]: Max Hilsdorf, "MusicGen Reimagined: Meta's Under-the-Radar Advances in AI Music," Towards Data Science archive, 2024. https://medium.com/data-science/musicgen-reimagined-metas-under-the-radar-advances-in-ai-music-36c1adfd13b7

[^11]: Hugging Face Transformers documentation, "MusicGen." https://huggingface.co/docs/transformers/model_doc/musicgen

[^12]: Hugging Face, `facebook/musicgen-large` model card. https://huggingface.co/facebook/musicgen-large

[^13]: Defossez, A., Copet, J., Synnaeve, G., Adi, Y. "High Fidelity Neural Audio Compression." arXiv:2210.13438, October 2022. https://arxiv.org/abs/2210.13438

[^14]: AudioCraft project, "AudioGen." https://audiocraft.metademolab.com/

[^15]: Agostinelli, A. et al. "MusicLM: Generating Music From Text." arXiv:2301.11325, January 2023. https://arxiv.org/abs/2301.11325

[^16]: Hugging Face Transformers documentation, "MusicGen Melody." https://huggingface.co/docs/transformers/en/model_doc/musicgen_melody

[^17]: Hugging Face, `facebook/musicgen-melody-large` model card. https://huggingface.co/facebook/musicgen-melody-large

[^18]: VentureBeat coverage of Meta audio generation releases, 2023-2024. https://venturebeat.com/ai/meta-unveils-audiobox-an-ai-that-clones-voices-and-generates-ambient-sounds/

[^19]: MusicCaps benchmark (Papers With Code SOTA tracking). https://paperswithcode.com/sota/text-to-music-generation-on-musiccaps

[^20]: "Benchmarking Music Generation Models and Metrics via Human Preference Studies." arXiv:2506.19085, 2025. https://arxiv.org/abs/2506.19085

[^21]: AudioCraft, "MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer." https://facebookresearch.github.io/audiocraft/docs/MAGNET.html

[^22]: Tal, O. et al. "Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation" (JASCO). arXiv:2406.10970, June 2024. https://arxiv.org/abs/2406.10970

[^23]: Replicate blog, "Fine-tune MusicGen to generate music in any style." https://replicate.com/blog/fine-tune-musicgen

[^24]: Meta AI, "Audiobox: Generating audio from voice and natural language prompts." November 2023. https://ai.meta.com/blog/audiobox-generating-audio-voice-natural-language-prompts/

[^25]: Meta, "Introducing AudioCraft: A Generative AI Tool For Audio and Music." 2 August 2023. https://about.fb.com/news/2023/08/audiocraft-generative-ai-for-music-and-audio/

[^26]: Music Business Worldwide, coverage of Suno, Udio, and music-industry settlements. https://www.musicbusinessworldwide.com/

[^27]: Stability AI press release, "Introducing Stable Audio 2.0." https://stability.ai/news/stable-audio-2-0

[^28]: Stability AI research, "Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion." https://stability.ai/research/stable-audio-efficient-timing-latent-diffusion

[^29]: Lan, Y.-H. et al. "MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation." arXiv:2407.15060, 2024. https://arxiv.org/abs/2407.15060

