AudioLM
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,450 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,450 words
Add missing citations, update stale details, or suggest a clearer explanation.
AudioLM is a framework from Google Research for generating high-quality audio by treating the problem as a language-modeling task over discrete tokens. Introduced in 2022, it converts raw audio into sequences of tokens and then learns to predict those sequences much as a text language model predicts words, producing speech and piano music that stay coherent over long stretches without relying on transcripts or musical notation. The same paper that introduced it laid out the semantic-plus-acoustic token recipe that several later Google audio systems, including MusicLM and SoundStorm, would build on directly. [1][2]
By the early 2020s, neural audio synthesis had advanced considerably. Autoregressive waveform models such as WaveNet could produce natural-sounding speech sample by sample, and neural codecs were learning compact discrete representations of audio. At the same time, large language models had shown that a single objective, next-token prediction over discrete symbols, could capture rich structure in text. AudioLM brought those two threads together. Its authors observed that audio tokenizers force a trade-off: codecs tuned for faithful reconstruction tend to lose long-range structure, while representations that capture high-level meaning discard the fine detail needed for natural-sounding output. The framework's central contribution was a hybrid tokenization scheme designed to get both at once. [1][2]
The work was published as an arXiv preprint, "AudioLM: a Language Modeling Approach to Audio Generation" (arXiv:2209.03143), submitted on 7 September 2022, with an accompanying Google Research blog post on 6 October 2022. A revised version appeared in the IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 31, pages 2523 to 2533, in 2023. The authors were Zalan Borsos, Raphael Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. [1][2][3]
AudioLM represents each audio clip with two complementary kinds of tokens that target different properties of the signal.
Semantic tokens come from w2v-BERT, a self-supervised speech model. AudioLM takes the activations of a pretrained w2v-BERT XL model (about 0.6 billion parameters), specifically from the seventh layer of its masked-language-modeling module, and discretizes them with k-means clustering into 1,024 clusters. These tokens capture both local dependencies, such as phonetics in speech or local melody in piano, and global long-term structure, such as syntax and semantic content in speech or harmony and rhythm in music. Because they run at roughly 25 Hz (one token every 40 ms), they heavily downsample the signal, which keeps the modeled sequences short enough to express long-range structure. Their weakness is fidelity: reconstructing audio from semantic tokens alone yields poor quality. [1][2]
Acoustic tokens come from SoundStream, a neural audio codec. SoundStream encodes 16 kHz waveforms into embeddings at 50 Hz and quantizes them with a residual vector quantizer of 12 layers, each with a codebook of 1,024 entries, giving roughly 600 tokens per second at about 6,000 bits per second. These tokens preserve the details of the waveform, including speaker characteristics and recording conditions, and so allow high-quality synthesis. On their own, though, they do not enforce long-term coherence. [1][2]
The two token types are complementary: semantic tokens score well on phonetic discriminability but poorly on reconstruction, while acoustic tokens are the reverse. AudioLM exploits this by modeling them together in a hierarchy.
| Token type | Source model | Approx. rate | Captures | Limitation |
|---|---|---|---|---|
| Semantic | w2v-BERT XL (~0.6B params), layer 7, k-means (1,024 clusters) | ~25 Hz | Linguistic content, long-term structure, melody/harmony | Low reconstruction fidelity |
| Acoustic | SoundStream codec (12 RVQ layers, codebook 1,024) | ~50 Hz embeddings, ~600 tokens/s | Speaker identity, recording conditions, waveform detail | Weak long-term structure |
Generation proceeds through three stages, each handled by a decoder-only Transformer:
After the final stage, the full set of acoustic tokens is fed to the SoundStream decoder, which reconstructs the waveform. [1][2]
Trained on speech alone and given a short spoken prompt, AudioLM generates continuations that are syntactically and semantically plausible while preserving the original speaker's identity, prosody, accent, and recording conditions, even for speakers it had not encountered during training. In the authors' listening tests, raters distinguished real speech from AudioLM continuations only 51.2 percent of the time, close to the 50 percent chance level, indicating the generated speech was nearly indistinguishable from genuine recordings. Crucially, none of this depended on text: the model never saw transcripts. [1][2]
The same framework generalized beyond speech. Trained on piano recordings without any symbolic score, AudioLM extended short piano prompts into longer passages that kept melody, harmony, rhythm, and a consistent style, despite working only from raw waveforms. This demonstrated that the semantic-plus-acoustic approach was not specific to language and could capture musical structure too. [1][2]
Because near-perfect synthetic speech raises misuse concerns, the authors also trained a classifier to detect AudioLM-generated audio. It identified synthetic samples with 98.6 percent accuracy, suggesting that such audio remains detectable even when humans cannot tell it apart by ear. [2]
All components of AudioLM for the speech experiments, the SoundStream codec, the w2v-BERT model, the k-means quantizer over the w2v-BERT embeddings, and the three decoder-only Transformers, were trained on the unlab-60k split of Libri-Light, about 60,000 hours of English speech derived from audiobooks. For music, the team used an internal dataset of roughly 40,000 hours of piano recordings spanning players from beginner to expert and a wide range of repertoire and acoustic conditions. In both cases the training signal was audio only, with no transcripts, labels, or notation, which is what makes the model's grasp of linguistic and musical structure notable: it learned that structure purely from listening. [1][2]
AudioLM's decomposition of audio generation into a semantic stage and an acoustic stage became a template for subsequent Google audio research.
MusicLM, introduced in early 2023, casts text-to-music generation as a hierarchical sequence-to-sequence task built on AudioLM's framework. It keeps the semantic and acoustic modeling stages but conditions generation on text through an additional component (MuLan audio and text embeddings), reusing SoundStream and w2v-BERT tokens to produce music at 24 kHz that stays consistent over minutes. In effect, MusicLM adds text control on top of the AudioLM recipe. [4]
SoundStorm, released in 2023, targets the speed of the acoustic stage. AudioLM generates acoustic tokens autoregressively, one at a time, which is slow for long sequences. SoundStorm instead takes AudioLM's semantic tokens as input and produces the SoundStream tokens in parallel using a bidirectional Conformer with confidence-based parallel decoding. It is designed as a drop-in replacement for both of AudioLM's acoustic modeling stages (coarse and fine), matching their quality with improved consistency while generating audio about two orders of magnitude faster. MusicLM in turn adopted SoundStorm to synthesize longer outputs more efficiently. [5]
| System | Year | Builds on AudioLM by | Key change |
|---|---|---|---|
| MusicLM | 2023 | Reusing the semantic/acoustic hierarchy | Adds text conditioning for music generation |
| SoundStorm | 2023 | Consuming AudioLM's semantic tokens | Replaces both acoustic stages with parallel decoding (~100x faster) |
AudioLM showed that a tokenize-then-predict pipeline, the same paradigm driving large language models, could be applied directly to raw audio and yield coherent, high-fidelity results without text or symbolic supervision. Its hybrid tokenization, pairing semantic tokens for long-term structure with acoustic tokens for fidelity, resolved a long-standing trade-off in audio generation and proved general enough to span both speech and music. By establishing the two-stage semantic and acoustic recipe, it provided the architectural foundation for a family of follow-on systems and helped frame audio as another modality amenable to language-modeling techniques. [1][2][4][5]