MusicLM

Generative AI Google Music & Audio Generation

8 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 1,547 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MusicLM is a text-to-music generation model from Google Research that generates high-fidelity music at 24 kHz from natural language descriptions and keeps that audio consistent over several minutes. Introduced in a paper posted to arXiv on 26 January 2023, MusicLM casts conditional music generation as a "hierarchical sequence-to-sequence modeling task," learning to produce raw audio token by token rather than splicing pre-recorded loops, so a prompt like "a calming violin melody backed by a distorted guitar riff" renders as a coherent instrumental piece ^[1]^[2]. It became one of the most widely cited results in the early wave of AI music generation and shipped to the public through Google's AI Test Kitchen in May 2023.

Background

By late 2022 and early 2023, generative models had moved well beyond images and text into audio. Google's own AudioLM had shown that speech and piano music could be generated as a language-modeling problem over discrete audio tokens, and a separate line of work on joint music-text embeddings had matured enough to connect written prompts to sound. MusicLM sits at the intersection of those two ideas. The paper was authored by a thirteen-person team across Google: Andrea Agostinelli, Timo I. Denk, Zalan Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank, with the goal of generating "high-fidelity music from text descriptions" while keeping the output musically consistent over long durations ^[1]^[2].

The core technical claim is that conditional music generation can be cast as a hierarchical sequence-to-sequence modeling task. In plain terms, the model breaks the hard problem of "make music that matches this sentence" into ordered stages, each handled by its own Transformer, and each producing a different layer of audio detail before the layers are combined into sound ^[1]^[2].

How does MusicLM work (AudioLM and MuLan)?

MusicLM is built on top of AudioLM and inherits its strategy of treating audio as a stream of discrete tokens. The paper describes three additions that turn AudioLM into a text-conditioned music model: conditioning generation on a descriptive text prompt, extending that conditioning to other signals such as a melody, and modeling a wide variety of long music sequences rather than just piano ^[3].

The bridge from words to sound is MuLan, a joint music-text embedding model published by Google in 2022. MuLan is a two-tower network trained on roughly 44 million music recordings, about 370,000 hours, paired with weakly associated free-form text. It learns to place a piece of music and a description of that music close together in a shared embedding space. Because MuLan can map either text or audio into the same space, MusicLM trains its generative stages on audio alone and only needs text at inference time, when a user's prompt is converted into a MuLan embedding that steers the generation ^[3]^[4].

Two other components fill out the pipeline. Semantic tokens come from a model called w2v-BERT and capture high-level structure such as melody and rhythm. Acoustic tokens come from SoundStream, a neural audio codec that compresses sound at a low bitrate while preserving fidelity; the paper uses a SoundStream configuration that yields 50 Hz embeddings with twelve quantizers at a 6 kbps bitrate. The model then runs in stages: a semantic stage maps MuLan audio tokens to semantic tokens, and an acoustic stage predicts the SoundStream acoustic tokens conditioned on both the MuLan tokens and the semantic tokens. SoundStream finally decodes those acoustic tokens back into a waveform ^[3].

The pieces and their roles:

Component	Role in MusicLM
MuLan	Joint music-text embedding that turns a prompt into a conditioning signal
w2v-BERT	Produces semantic tokens capturing high-level musical structure
SoundStream	Neural codec providing acoustic tokens for high-fidelity synthesis
AudioLM framework	Hierarchical token-modeling approach MusicLM extends

What is the MusicCaps dataset?

To support evaluation and future research, the team released MusicCaps, described in the paper as "a hand-curated, high-quality dataset of 5.5k music-text pairs prepared by musicians." The released dataset contains exactly 5,521 examples, each pairing a 10-second music clip drawn from Google's AudioSet with an English caption written by professional musicians. The paper notes that ten musicians wrote the descriptions and that captions run to about four sentences on average, going beyond simple genre labels to describe instrumentation, mood, tempo, and production detail ^[3]^[5]. The dataset was published openly, and it has since been reused as a benchmark by many other text-to-music systems, including Meta's MusicGen ^[1]^[5].

What can MusicLM do?

MusicLM generates audio at a 24 kHz sample rate and, according to the paper, keeps that output consistent over several minutes, which was a notable departure from earlier systems that tended to drift or lose structure quickly ^[1]^[3]. The paper reports that MusicLM "outperforms previous systems both in audio quality and adherence to the text description" ^[1]. The prompts can be detailed: the model responds to descriptions of genre, instruments, mood, and even an imagined setting.

A second mode is melody conditioning. The paper shows that MusicLM can be conditioned on both a text description and a melody supplied "in the form of humming, singing, whistling, or playing an instrument," so a person can hum a tune and ask the model to render it in a described style ^[3]. The authors also studied a risk specific to generative audio: memorization of training data. Adapting a methodology developed for text language models, they reported that only a tiny fraction of examples were reproduced exactly, while for about 1 percent of examples they could identify an approximate match ^[3].

When was MusicLM released to the public?

When the paper appeared in January 2023, Google said it had no immediate plans to release MusicLM, citing ethical challenges including the model's tendency to incorporate copyrighted material from its training data into generated songs ^[6].

That position softened a few months later. On 10 May 2023, Google opened a version of MusicLM to the public through its AI Test Kitchen app on the web, Android, and iOS, describing it as "an experimental text-to-music model that can generate unique songs based on your ideas or descriptions" ^[2]. Users type a prompt, receive two generated versions to compare, and "give a trophy to the track that you like better, which will help improve the model" ^[2]^[7]. The public clips are short, reported at 20 seconds each and downloadable as MP3 files, and the AI Test Kitchen version deliberately refuses to produce vocals or to imitate specific artists or named individuals as a guardrail against the copyright issues flagged earlier ^[7]^[6].

Key facts:

Aspect	Detail
Developer	Google Research
Paper posted	26 January 2023 (arXiv 2301.11325)
Sample rate	24 kHz
Duration (paper)	Consistent over several minutes
MusicCaps	5,521 music-text pairs, 10-second clips, written by 10 musicians
Public preview	AI Test Kitchen, 10 May 2023
Public clip length	About 20 seconds, MP3 download

How does MusicLM relate to MusicFX and Lyria?

The AI Test Kitchen experiment was later upgraded and rebranded as MusicFX, which rolled out to select users in December 2023 and reached broader availability in early February 2024. MusicFX was explicitly described as an upgrade to MusicLM, generating clips up to 70 seconds and adding loop and extension features, with audio watermarked using DeepMind's SynthID technology ^[8]. Google's music-generation work subsequently consolidated under Google DeepMind and its Lyria family of models, with Lyria 2 (announced April 2025) powering newer creation tools and a real-time interface that integrates with the MusicFX product line.

Why does MusicLM matter, and what about copyright?

MusicLM mattered less as a finished product than as a demonstration. It showed that a single text prompt could drive minutes of structured, multi-instrument audio at usable fidelity, and it did so by reusing a tokenization-and-Transformer recipe that had already proved itself on speech and images. The decision to release MusicCaps gave the wider field a shared evaluation set, which helped subsequent systems measure themselves against a common reference ^[1]^[3]^[5].

The launch also surfaced the legal and ethical tension that has shadowed generative music ever since. Google's own authors acknowledged the risk that models trained on large music corpora can reproduce copyrighted material, and the company's initial reluctance to release MusicLM, followed by a public version stripped of vocals and artist imitation, reflected an attempt to manage that risk before wider deployment ^[6]^[7]. Those concerns about training data, attribution, and the rights of working musicians have remained central to debates over later tools such as Suno and Udio built on the same broad lineage.

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Audio Classification Models Audio Models AudioCraft AudioLM Best AI Music Generators Lyria Lyria 2 Magenta (project)Music MusicGen Suno

Background

How does MusicLM work (AudioLM and MuLan)?

What is the MusicCaps dataset?

What can MusicLM do?

When was MusicLM released to the public?

How does MusicLM relate to MusicFX and Lyria?

Why does MusicLM matter, and what about copyright?

References

Improve this article

Related Articles

AudioLM

Magenta (project)

Suno

Udio

Stable Audio

Lyria

What links here

Related Articles

AudioLM

Magenta (project)

Suno

Udio

Stable Audio

Lyria

What links here