MusicGen
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,063 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,063 words
Add missing citations, update stale details, or suggest a clearer explanation.
MusicGen is a text-to-music generation model developed by Meta AI's Fundamental AI Research (FAIR) team and first described in the paper "Simple and Controllable Music Generation," posted to arXiv on 8 June 2023.[1] The model is a single-stage autoregressive transformer language model that operates over discrete audio tokens produced by Meta's neural audio codec EnCodec, and it can generate music samples up to about thirty seconds long conditioned on a text prompt, an input melody, or both.[1][2]
MusicGen was released alongside a research paper, model weights on Hugging Face, and a demo, and was subsequently incorporated into the AudioCraft PyTorch library that Meta open-sourced on 2 August 2023.[3][4] In contrast to several contemporaneous music systems, including Google's MusicLM, which relied on cascaded models and self-supervised semantic representations, MusicGen demonstrated that comparable quality could be achieved with a single transformer through an efficient "delay" interleaving pattern over EnCodec's residual codebooks.[1][2] The model's code was released under the MIT license and its weights under CC-BY-NC 4.0, while Meta emphasised that the music used for training had been licensed from Shutterstock, Pond5, and Meta's own internal sound collection.[5][6]
MusicGen became one of the most influential open-weights music generation systems of the 2023-2024 period. It established the template of "EnCodec tokens plus an autoregressive transformer" that subsequent Meta models such as MAGNeT, MusicGen-Style, and JASCO built upon, and the released weights are widely used as a baseline in academic work and as the basis for derivative tools, fine-tunes, and community projects.[7][8][9][10]
| Developer | Meta AI (FAIR) |
| Initial release | 8 June 2023 (paper and weights); included in AudioCraft library 2 August 2023[1][3] |
| Open weights | Yes: model weights under CC-BY-NC 4.0; code under MIT[5][6] |
| Architecture | Single-stage autoregressive transformer decoder over discrete EnCodec audio tokens, with text conditioning from a frozen T5 encoder[1][11] |
| Model sizes | 300M (Small), 1.5B (Medium), 3.3B (Large); 1.5B and 3.3B melody variants; later stereo and style variants[5][12] |
| Audio tokenizer | EnCodec at 32 kHz, four residual codebooks at 50 Hz[5] |
| Training data | Approximately 20,000 hours (~400,000 tracks) of licensed music from Shutterstock, Pond5, and the Meta Music Initiative Sound Collection[4][5] |
| Paper | Copet et al., "Simple and Controllable Music Generation," arXiv:2306.05284, NeurIPS 2023[1] |
| Repository | facebookresearch/audiocraft on GitHub[6] |
MusicGen sits in a research lineage at Meta that runs from neural audio compression to general-purpose audio generation. The immediate technical predecessor is EnCodec, a high-fidelity neural audio codec introduced in October 2022 by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, all of whom are also authors of MusicGen.[13] EnCodec uses a streaming convolutional encoder-decoder with residual vector quantisation, achieving roughly ten times the compression of MP3 at comparable perceived quality, and it produces the discrete tokens that MusicGen learns to predict.[13]
A second strand is AudioGen, Meta's text-to-sound model for environmental and effect sounds, which paired EnCodec tokens with a transformer language model. AudioGen and EnCodec together motivated the question of whether a similar recipe could scale to music, a domain that requires longer-range structure, harmonic coherence, and finer control over the generation process.[14]
The most immediate external precedent was MusicLM, released by Google in January 2023, which generated music from text using a cascade of self-supervised audio representations (drawn from a separate model called MuLan) and multiple language models operating at different abstraction levels. MusicLM was widely covered but Google initially declined to release weights, citing copyright and ethical concerns.[15] MusicGen explicitly framed itself against this design: the authors argued that "simple and controllable" generation could be achieved with one transformer over EnCodec tokens, dispensing with the semantic stage and cascade.[1] Other contemporaries the paper benchmarks against include Riffusion, Mousai, and Google's Noise2Music.[1]
The MusicGen system has two main components: a frozen EnCodec audio tokenizer and a transformer decoder language model that predicts EnCodec tokens autoregressively.[1][2]
For music, MusicGen uses a 32 kHz monaural EnCodec model with four codebooks sampled at 50 Hz. Each second of audio is therefore represented as 50 time steps and, at each time step, four discrete tokens (one per residual quantizer codebook) for a total of 200 tokens per second.[5][11] This makes the audio representation roughly two orders of magnitude denser per second than a typical language model sequence but two orders of magnitude sparser than the raw waveform.
A naive autoregressive language model over the four codebooks would either flatten them (predicting 200 tokens per second sequentially) or rely on a cascade of models predicting one codebook level at a time. The MusicGen paper introduces and systematically compares several "codebook interleaving patterns," including a parallel pattern, a flattened pattern, a coarse-first pattern, and a "delay" pattern that staggers prediction of successive codebooks by one step.[1][2]
The delay pattern is MusicGen's default. It allows all four codebooks for a given time step to be predicted in parallel from the perspective of decoding throughput while still respecting a strict causal dependency: codebook k at step t is conditioned on codebook k-1 at step t and on all preceding time steps. The result is that MusicGen needs only fifty autoregressive steps per second of generated audio, rather than two hundred, while retaining a single transformer.[1][2] The paper presents ablations on these patterns, finding that delay matches the quality of the flattened pattern at a fraction of the compute.[1]
For text conditioning, MusicGen passes the prompt through a frozen T5 (or Flan-T5) text encoder and uses the resulting hidden states as a key/value source for cross-attention layers inserted between the decoder's self-attention layers.[11] For melody conditioning, the system extracts a chromagram, a time-frequency representation that captures harmonic and melodic content while abstracting away timbre, from a reference audio clip; the chromagram is quantised and supplied as a conditioning prefix to the transformer.[16][11] When both text and melody conditioning are used, the model can be instructed to follow a reference melody while rendering it in a different style described by the text prompt.[2]
The MusicGen models were trained on 30-second audio chunks, but inference supports longer outputs through a sliding window: the model generates 30 seconds, then continues from a 20-second overlap, producing a smooth extension up to the limits of compute budget and coherence.[2]
MusicGen was initially released in three sizes for text-to-music, plus two melody-conditioned variants, with later additions for stereo output and style conditioning.
| Variant | Parameters | Conditioning | Notes |
|---|---|---|---|
| MusicGen Small | 300M | Text | Smallest text-to-music model[5][12] |
| MusicGen Medium | 1.5B | Text | Frequently cited as the best quality/compute trade-off[12] |
| MusicGen Large | 3.3B | Text | Highest-quality original variant[12] |
| MusicGen Melody | 1.5B | Text + melody | Adds chromagram-based melody conditioning[5] |
| MusicGen Melody Large | 3.3B | Text + melody | Released later; large-scale melody model[17] |
| MusicGen Stereo (Small/Medium/Large/Melody/Melody Large) | as above | Stereo output | Added November 2023 with a multi-band diffusion decoder[7] |
| MusicGen-Style | 1.5B | Text + style audio | Allows matching an audio excerpt's style; trained September-December 2023[9] |
The official model card reports MusicCaps benchmark numbers for the three base text-to-music sizes: Frechet Audio Distance (FAD) of 4.88, 5.14, and 5.48 for Small, Medium, and Large respectively; KL divergence around 1.37-1.42; and CLAP-based text consistency of 0.27-0.28. The non-monotonic FAD trend with model size reflects known limitations of FAD as a perceptual metric rather than larger models being subjectively worse.[5]
A distinctive aspect of MusicGen, frequently highlighted in news coverage at the time, is that Meta paid to license the music it trained on rather than scraping the open web. The model card states that 20,000 hours of music (roughly 400,000 tracks) were drawn from three sources: an internal "Meta Music Initiative Sound Collection," the Shutterstock music collection, and the Pond5 music collection.[4][5]
Tracks with vocals had their vocal stems removed in pre-processing using the open-source HT-Demucs source-separation tool together with metadata tags. Each clip is paired with a textual description derived from metadata such as genre, mood, instrumentation, and tempo tags. The released models therefore cannot generate convincing sung vocals, a property both reflected in the model card and noted by reviewers.[5]
Meta's licensing approach was singled out by generative-AI industry observers as a meaningful counterpoint to the prevailing "train first, litigate later" pattern in music AI, although Meta has not published a full breakdown of the licensing terms or per-track payments.[4][18] The model card explicitly acknowledges that the training mix skews toward Western, English-described music and that performance is correspondingly uneven across cultures and musical traditions.[5]
The MusicGen paper evaluates models on the MusicCaps benchmark, a 5.5K-clip evaluation set of ten-second musical excerpts paired with expert text captions originally introduced for MusicLM, using both objective metrics and human listening studies.[1][19]
The objective metrics are:
In the original benchmarks, MusicGen was competitive with MusicLM, Mousai, Riffusion, and Noise2Music on FAD and produced higher text-relevance scores. The paper notes that Noise2Music achieved the lowest FAD on MusicCaps at the time, with MusicGen-text close behind, while MusicGen led on subjective measures.[1]
The human studies asked listeners to rate samples on overall quality, text relevance, and, for melody-conditioned models, melody adherence. MusicGen was preferred to MusicLM, Riffusion, and Mousai by raters on quality and text adherence in the original paper's experiments, with the differences widening on the melody-conditioned setting.[1] Subsequent independent comparisons in the literature, including the 2025 "Benchmarking Music Generation Models and Metrics via Human Preference Studies" paper, found that human ratings and FAD do not always align, and that MusicGen remains a strong baseline.[20]
On 2 August 2023, Meta released AudioCraft, a PyTorch library that consolidates training and inference code for MusicGen, AudioGen, and EnCodec into a single repository.[3][4] The library was released under the MIT license, and weights for the bundled models under CC-BY-NC 4.0.[6]
AudioCraft has since grown to include several additional models from the same FAIR research line:
The AudioCraft repository, distributed via facebookresearch/audiocraft on GitHub, became the canonical home for Meta's open audio-generation research and the practical entry point used by most researchers and hobbyists experimenting with MusicGen.[6]
MusicGen received broad coverage in technical and music-industry press at launch. TechCrunch, in its August 2023 coverage of the AudioCraft release, described MusicGen as making it possible to "create halfway decent songs in a range of styles without having to play an instrument or read sheet music," while also flagging copyright tensions inherent to any large-scale music model.[4] Subsequent coverage emphasised three themes: that the weights were openly released, that the training data was licensed, and that the model held up well against MusicLM, which Google had at that point only made available in a closed preview.[4][18]
In the research community the model became a standard baseline. The Hugging Face Transformers library added a native MusicGen implementation, the model appears as a baseline in subsequent text-to-music papers, and the CC-BY-NC weights made it the basis for many community fine-tunes (for example for genre-specific generation), wrappers (such as the popular "Replicate" deployments), and pedagogical tutorials.[11][23] By 2024, MusicGen-Large remained one of the most-downloaded text-to-music models on the Meta AI Hugging Face organisation page.[12]
Meta also acknowledged limitations clearly in its own materials: the model card lists inability to produce realistic vocals, English-only prompt support, uneven performance across musical cultures, and a tendency for generated clips to "collapse to silence" near the end of long generations. The release was made available "for research purposes," with downstream commercial use restricted by the non-commercial weight license.[5]
MusicGen's recipe, using EnCodec tokens, a transformer over discrete codes, and conditioning from a frozen text encoder, was carried forward in several subsequent Meta projects.
MAGNeT (January 2024) keeps the EnCodec representation and the four-codebook structure but replaces autoregressive prediction with a single non-autoregressive masked transformer that decodes audio in a few parallel passes. The authors include several of the MusicGen authors and report a roughly 7× speed-up at comparable quality.[8][21]
Audiobox (November 2023) is a broader audio foundation model targeting speech, sound effects, and soundscapes, building on Meta's Voicebox speech model. Audiobox is more focused on speech and general audio than MusicGen, but it represents Meta's push to extend controllable audio generation beyond music; it was released to a hand-picked group of researchers rather than as open weights, with audio watermarking and voice authentication safeguards.[24]
JASCO (June 2024) targets fine-grained temporal control of music using flow-matching rather than autoregressive language modelling, with conditioning on chords, melody, and drum tracks. JASCO is offered as a complement rather than a replacement to MusicGen, and weights for several JASCO variants were released openly via the AudioCraft repository under a CC license.[22]
MusicGen-Style (paper 2024) extends the original model with the ability to follow a short reference audio clip's style in addition to (or instead of) text.[9]
MusicGen's openness, technical clarity, and licensed training data made it a reference point for both academic and commercial AI music systems that emerged in 2023-2024. The commercial systems that captured the most public attention during this period, notably Suno and Udio, use different architectures (broadly described as hybrid transformer-and-diffusion systems built on latent representations of full songs rather than instrumental clips) and target full songs with vocals, which MusicGen does not.[25][26] These products did not derive directly from MusicGen, but they competed in a market in which open-weights MusicGen had already established expectations for fidelity, controllability, and the role of training data licensing.[18][27]
Stable Audio, released by Stability AI in September 2023, is closer in spirit to MusicGen (instrumental music, text conditioning, open-research artefacts) but uses a latent-diffusion architecture and a U-Net (later replaced by a diffusion transformer in Stable Audio 2.0) rather than an autoregressive transformer over discrete codes. Stable Audio's training was also done exclusively on licensed music, in this case via a partnership with the AudioSparx library, mirroring the licensing posture Meta had publicly adopted for MusicGen.[28]
For research, MusicGen's release effectively cemented the EnCodec-plus-transformer pattern as a baseline of comparison: a long tail of follow-up papers fine-tune MusicGen, use it as a starting point for new conditioning mechanisms (such as MusiConGen for rhythm and chord control or Instruct-MusicGen for instruction-tuned editing), or evaluate new models against MusicGen variants on MusicCaps.[16][29]
After the initial June 2023 paper and August 2023 AudioCraft release, Meta shipped a series of incremental MusicGen improvements through 2023, most prominently multi-band-diffusion decoding and stereo variants in late 2023, and complemented MusicGen with MAGNeT, MusicGen-Style, and JASCO in 2024.[7][8][9][22] As of 2026, Meta has not released a "MusicGen 2," and the AudioCraft repository remains the primary distribution channel for the original models and their immediate successors; community projects and downstream products continue to use MusicGen weights as a base model.[6][10]