AudioLM

Google Music & Audio Generation Speech & Audio AI

7 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v1 · 1,450 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AudioLM is a framework from Google Research for generating high-quality audio by treating the problem as a language-modeling task over discrete tokens. Introduced in 2022, it converts raw audio into sequences of tokens and then learns to predict those sequences much as a text language model predicts words, producing speech and piano music that stay coherent over long stretches without relying on transcripts or musical notation. The same paper that introduced it laid out the semantic-plus-acoustic token recipe that several later Google audio systems, including MusicLM and SoundStorm, would build on directly. ^[1]^[2]

Background

By the early 2020s, neural audio synthesis had advanced considerably. Autoregressive waveform models such as WaveNet could produce natural-sounding speech sample by sample, and neural codecs were learning compact discrete representations of audio. At the same time, large language models had shown that a single objective, next-token prediction over discrete symbols, could capture rich structure in text. AudioLM brought those two threads together. Its authors observed that audio tokenizers force a trade-off: codecs tuned for faithful reconstruction tend to lose long-range structure, while representations that capture high-level meaning discard the fine detail needed for natural-sounding output. The framework's central contribution was a hybrid tokenization scheme designed to get both at once. ^[1]^[2]

The work was published as an arXiv preprint, "AudioLM: a Language Modeling Approach to Audio Generation" (arXiv:2209.03143), submitted on 7 September 2022, with an accompanying Google Research blog post on 6 October 2022. A revised version appeared in the IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 31, pages 2523 to 2533, in 2023. The authors were Zalan Borsos, Raphael Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. ^[1]^[2]^[3]

How AudioLM works (semantic and acoustic tokens)

AudioLM represents each audio clip with two complementary kinds of tokens that target different properties of the signal.

Semantic tokens come from w2v-BERT, a self-supervised speech model. AudioLM takes the activations of a pretrained w2v-BERT XL model (about 0.6 billion parameters), specifically from the seventh layer of its masked-language-modeling module, and discretizes them with k-means clustering into 1,024 clusters. These tokens capture both local dependencies, such as phonetics in speech or local melody in piano, and global long-term structure, such as syntax and semantic content in speech or harmony and rhythm in music. Because they run at roughly 25 Hz (one token every 40 ms), they heavily downsample the signal, which keeps the modeled sequences short enough to express long-range structure. Their weakness is fidelity: reconstructing audio from semantic tokens alone yields poor quality. ^[1]^[2]

Acoustic tokens come from SoundStream, a neural audio codec. SoundStream encodes 16 kHz waveforms into embeddings at 50 Hz and quantizes them with a residual vector quantizer of 12 layers, each with a codebook of 1,024 entries, giving roughly 600 tokens per second at about 6,000 bits per second. These tokens preserve the details of the waveform, including speaker characteristics and recording conditions, and so allow high-quality synthesis. On their own, though, they do not enforce long-term coherence. ^[1]^[2]

The two token types are complementary: semantic tokens score well on phonetic discriminability but poorly on reconstruction, while acoustic tokens are the reverse. AudioLM exploits this by modeling them together in a hierarchy.

Token type	Source model	Approx. rate	Captures	Limitation
Semantic	w2v-BERT XL (~0.6B params), layer 7, k-means (1,024 clusters)	~25 Hz	Linguistic content, long-term structure, melody/harmony	Low reconstruction fidelity
Acoustic	SoundStream codec (12 RVQ layers, codebook 1,024)	~50 Hz embeddings, ~600 tokens/s	Speaker identity, recording conditions, waveform detail	Weak long-term structure

Generation proceeds through three stages, each handled by a decoder-only Transformer:

Semantic modeling. The model autoregressively predicts the sequence of semantic tokens, establishing the high-level structure of the output.
Coarse acoustic modeling. Conditioned on the semantic tokens, the model predicts the coarse SoundStream tokens (the first four quantizer layers), which add speaker or instrument characteristics.
Fine acoustic modeling. Conditioned on the coarse acoustic tokens, the model predicts the remaining fine quantizer layers, sharpening waveform detail. The fine stage operates on short audio chunks (about three seconds).

After the final stage, the full set of acoustic tokens is fed to the SoundStream decoder, which reconstructs the waveform. ^[1]^[2]

Capabilities (speech and piano continuation)

Trained on speech alone and given a short spoken prompt, AudioLM generates continuations that are syntactically and semantically plausible while preserving the original speaker's identity, prosody, accent, and recording conditions, even for speakers it had not encountered during training. In the authors' listening tests, raters distinguished real speech from AudioLM continuations only 51.2 percent of the time, close to the 50 percent chance level, indicating the generated speech was nearly indistinguishable from genuine recordings. Crucially, none of this depended on text: the model never saw transcripts. ^[1]^[2]

The same framework generalized beyond speech. Trained on piano recordings without any symbolic score, AudioLM extended short piano prompts into longer passages that kept melody, harmony, rhythm, and a consistent style, despite working only from raw waveforms. This demonstrated that the semantic-plus-acoustic approach was not specific to language and could capture musical structure too. ^[1]^[2]

Because near-perfect synthetic speech raises misuse concerns, the authors also trained a classifier to detect AudioLM-generated audio. It identified synthetic samples with 98.6 percent accuracy, suggesting that such audio remains detectable even when humans cannot tell it apart by ear. ^[2]

Training

All components of AudioLM for the speech experiments, the SoundStream codec, the w2v-BERT model, the k-means quantizer over the w2v-BERT embeddings, and the three decoder-only Transformers, were trained on the unlab-60k split of Libri-Light, about 60,000 hours of English speech derived from audiobooks. For music, the team used an internal dataset of roughly 40,000 hours of piano recordings spanning players from beginner to expert and a wide range of repertoire and acoustic conditions. In both cases the training signal was audio only, with no transcripts, labels, or notation, which is what makes the model's grasp of linguistic and musical structure notable: it learned that structure purely from listening. ^[1]^[2]

Influence (MusicLM, SoundStorm)

AudioLM's decomposition of audio generation into a semantic stage and an acoustic stage became a template for subsequent Google audio research.

MusicLM, introduced in early 2023, casts text-to-music generation as a hierarchical sequence-to-sequence task built on AudioLM's framework. It keeps the semantic and acoustic modeling stages but conditions generation on text through an additional component (MuLan audio and text embeddings), reusing SoundStream and w2v-BERT tokens to produce music at 24 kHz that stays consistent over minutes. In effect, MusicLM adds text control on top of the AudioLM recipe. ^[4]

SoundStorm, released in 2023, targets the speed of the acoustic stage. AudioLM generates acoustic tokens autoregressively, one at a time, which is slow for long sequences. SoundStorm instead takes AudioLM's semantic tokens as input and produces the SoundStream tokens in parallel using a bidirectional Conformer with confidence-based parallel decoding. It is designed as a drop-in replacement for both of AudioLM's acoustic modeling stages (coarse and fine), matching their quality with improved consistency while generating audio about two orders of magnitude faster. MusicLM in turn adopted SoundStorm to synthesize longer outputs more efficiently. ^[5]

System	Year	Builds on AudioLM by	Key change
MusicLM	2023	Reusing the semantic/acoustic hierarchy	Adds text conditioning for music generation
SoundStorm	2023	Consuming AudioLM's semantic tokens	Replaces both acoustic stages with parallel decoding (~100x faster)

Significance

AudioLM showed that a tokenize-then-predict pipeline, the same paradigm driving large language models, could be applied directly to raw audio and yield coherent, high-fidelity results without text or symbolic supervision. Its hybrid tokenization, pairing semantic tokens for long-term structure with acoustic tokens for fidelity, resolved a long-standing trade-off in audio generation and proved general enough to span both speech and music. By establishing the two-stage semantic and acoustic recipe, it provided the architectural foundation for a family of follow-on systems and helped frame audio as another modality amenable to language-modeling techniques. ^[1]^[2]^[4]^[5]

References

Borsos, Z. et al. "AudioLM: a Language Modeling Approach to Audio Generation." arXiv:2209.03143 (2022). https://arxiv.org/abs/2209.03143 ↩
Borsos, Z. and Zeghidour, N. "AudioLM: a Language Modeling Approach to Audio Generation." Google Research Blog, 6 October 2022. https://research.google/blog/audiolm-a-language-modeling-approach-to-audio-generation/ ↩
Borsos, Z. et al. "AudioLM: A Language Modeling Approach to Audio Generation." IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2523-2533 (2023). https://dl.acm.org/doi/10.1109/TASLP.2023.3288409 ↩
Agostinelli, A. et al. "MusicLM: Generating Music From Text." arXiv:2301.11325 (2023). https://arxiv.org/abs/2301.11325 ↩
Borsos, Z. et al. "SoundStorm: Efficient Parallel Audio Generation." Google Research Blog, 22 June 2023. https://research.google/blog/soundstorm-efficient-parallel-audio-generation/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

MusicLM Universal Speech Model WaveNet

Background

How AudioLM works (semantic and acoustic tokens)

Capabilities (speech and piano continuation)

Training

Influence (MusicLM, SoundStorm)

Significance

References

Improve this article

Related Articles

Audio-to-Audio Models

Suno

Lyria

Suno v5

ElevenLabs Music

Stable Audio 2.5

What links here

Related Articles

Audio-to-Audio Models

Suno

Lyria

Suno v5

ElevenLabs Music

Stable Audio 2.5

What links here