MusicLM
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v2 · 1,547 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v2 · 1,547 words
Add missing citations, update stale details, or suggest a clearer explanation.
MusicLM is a text-to-music generation model from Google Research that generates high-fidelity music at 24 kHz from natural language descriptions and keeps that audio consistent over several minutes. Introduced in a paper posted to arXiv on 26 January 2023, MusicLM casts conditional music generation as a "hierarchical sequence-to-sequence modeling task," learning to produce raw audio token by token rather than splicing pre-recorded loops, so a prompt like "a calming violin melody backed by a distorted guitar riff" renders as a coherent instrumental piece [1][2]. It became one of the most widely cited results in the early wave of AI music generation and shipped to the public through Google's AI Test Kitchen in May 2023.
By late 2022 and early 2023, generative models had moved well beyond images and text into audio. Google's own AudioLM had shown that speech and piano music could be generated as a language-modeling problem over discrete audio tokens, and a separate line of work on joint music-text embeddings had matured enough to connect written prompts to sound. MusicLM sits at the intersection of those two ideas. The paper was authored by a thirteen-person team across Google: Andrea Agostinelli, Timo I. Denk, Zalan Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank, with the goal of generating "high-fidelity music from text descriptions" while keeping the output musically consistent over long durations [1][2].
The core technical claim is that conditional music generation can be cast as a hierarchical sequence-to-sequence modeling task. In plain terms, the model breaks the hard problem of "make music that matches this sentence" into ordered stages, each handled by its own Transformer, and each producing a different layer of audio detail before the layers are combined into sound [1][2].
MusicLM is built on top of AudioLM and inherits its strategy of treating audio as a stream of discrete tokens. The paper describes three additions that turn AudioLM into a text-conditioned music model: conditioning generation on a descriptive text prompt, extending that conditioning to other signals such as a melody, and modeling a wide variety of long music sequences rather than just piano [3].
The bridge from words to sound is MuLan, a joint music-text embedding model published by Google in 2022. MuLan is a two-tower network trained on roughly 44 million music recordings, about 370,000 hours, paired with weakly associated free-form text. It learns to place a piece of music and a description of that music close together in a shared embedding space. Because MuLan can map either text or audio into the same space, MusicLM trains its generative stages on audio alone and only needs text at inference time, when a user's prompt is converted into a MuLan embedding that steers the generation [3][4].
Two other components fill out the pipeline. Semantic tokens come from a model called w2v-BERT and capture high-level structure such as melody and rhythm. Acoustic tokens come from SoundStream, a neural audio codec that compresses sound at a low bitrate while preserving fidelity; the paper uses a SoundStream configuration that yields 50 Hz embeddings with twelve quantizers at a 6 kbps bitrate. The model then runs in stages: a semantic stage maps MuLan audio tokens to semantic tokens, and an acoustic stage predicts the SoundStream acoustic tokens conditioned on both the MuLan tokens and the semantic tokens. SoundStream finally decodes those acoustic tokens back into a waveform [3].
The pieces and their roles:
| Component | Role in MusicLM |
|---|---|
| MuLan | Joint music-text embedding that turns a prompt into a conditioning signal |
| w2v-BERT | Produces semantic tokens capturing high-level musical structure |
| SoundStream | Neural codec providing acoustic tokens for high-fidelity synthesis |
| AudioLM framework | Hierarchical token-modeling approach MusicLM extends |
To support evaluation and future research, the team released MusicCaps, described in the paper as "a hand-curated, high-quality dataset of 5.5k music-text pairs prepared by musicians." The released dataset contains exactly 5,521 examples, each pairing a 10-second music clip drawn from Google's AudioSet with an English caption written by professional musicians. The paper notes that ten musicians wrote the descriptions and that captions run to about four sentences on average, going beyond simple genre labels to describe instrumentation, mood, tempo, and production detail [3][5]. The dataset was published openly, and it has since been reused as a benchmark by many other text-to-music systems, including Meta's MusicGen [1][5].
MusicLM generates audio at a 24 kHz sample rate and, according to the paper, keeps that output consistent over several minutes, which was a notable departure from earlier systems that tended to drift or lose structure quickly [1][3]. The paper reports that MusicLM "outperforms previous systems both in audio quality and adherence to the text description" [1]. The prompts can be detailed: the model responds to descriptions of genre, instruments, mood, and even an imagined setting.
A second mode is melody conditioning. The paper shows that MusicLM can be conditioned on both a text description and a melody supplied "in the form of humming, singing, whistling, or playing an instrument," so a person can hum a tune and ask the model to render it in a described style [3]. The authors also studied a risk specific to generative audio: memorization of training data. Adapting a methodology developed for text language models, they reported that only a tiny fraction of examples were reproduced exactly, while for about 1 percent of examples they could identify an approximate match [3].
When the paper appeared in January 2023, Google said it had no immediate plans to release MusicLM, citing ethical challenges including the model's tendency to incorporate copyrighted material from its training data into generated songs [6].
That position softened a few months later. On 10 May 2023, Google opened a version of MusicLM to the public through its AI Test Kitchen app on the web, Android, and iOS, describing it as "an experimental text-to-music model that can generate unique songs based on your ideas or descriptions" [2]. Users type a prompt, receive two generated versions to compare, and "give a trophy to the track that you like better, which will help improve the model" [2][7]. The public clips are short, reported at 20 seconds each and downloadable as MP3 files, and the AI Test Kitchen version deliberately refuses to produce vocals or to imitate specific artists or named individuals as a guardrail against the copyright issues flagged earlier [7][6].
Key facts:
| Aspect | Detail |
|---|---|
| Developer | Google Research |
| Paper posted | 26 January 2023 (arXiv 2301.11325) |
| Sample rate | 24 kHz |
| Duration (paper) | Consistent over several minutes |
| MusicCaps | 5,521 music-text pairs, 10-second clips, written by 10 musicians |
| Public preview | AI Test Kitchen, 10 May 2023 |
| Public clip length | About 20 seconds, MP3 download |
The AI Test Kitchen experiment was later upgraded and rebranded as MusicFX, which rolled out to select users in December 2023 and reached broader availability in early February 2024. MusicFX was explicitly described as an upgrade to MusicLM, generating clips up to 70 seconds and adding loop and extension features, with audio watermarked using DeepMind's SynthID technology [8]. Google's music-generation work subsequently consolidated under Google DeepMind and its Lyria family of models, with Lyria 2 (announced April 2025) powering newer creation tools and a real-time interface that integrates with the MusicFX product line.
MusicLM mattered less as a finished product than as a demonstration. It showed that a single text prompt could drive minutes of structured, multi-instrument audio at usable fidelity, and it did so by reusing a tokenization-and-Transformer recipe that had already proved itself on speech and images. The decision to release MusicCaps gave the wider field a shared evaluation set, which helped subsequent systems measure themselves against a common reference [1][3][5].
The launch also surfaced the legal and ethical tension that has shadowed generative music ever since. Google's own authors acknowledged the risk that models trained on large music corpora can reproduce copyrighted material, and the company's initial reluctance to release MusicLM, followed by a public version stripped of vocals and artist imitation, reflected an attempt to manage that risk before wider deployment [6][7]. Those concerns about training data, attribution, and the rights of working musicians have remained central to debates over later tools such as Suno and Udio built on the same broad lineage.