SoundStream

Google Speech & Audio AI

25 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v5 · 5,014 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SoundStream is an end-to-end neural audio codec introduced by Google Research in July 2021 that compresses speech, music, and general audio at low-to-medium bitrates ranging from 3 kbps to 18 kbps on 24 kHz signals.^[1]^[2] The system pairs a fully convolutional encoder and decoder with a residual vector quantizer (RVQ) trained jointly with the rest of the network under a mix of adversarial and reconstruction losses, producing a single model that can operate at multiple bitrates through a "quantizer dropout" training scheme.^[1] In subjective listening tests, SoundStream at 3 kbps outperformed the Opus codec at 12 kbps and approached the Enhanced Voice Services (EVS) codec at 9.6 kbps while running in real time on a single smartphone CPU thread.^[1]^[2] Beyond its standalone use as a codec, SoundStream became the prototype for the modern recipe of "convolutional encoder + RVQ bottleneck + adversarial decoder" that was later adopted by Meta's EnCodec, Descript's DAC, and Google's Lyra v2, and its discrete acoustic tokens underpin downstream Google audio language models such as AudioLM and MusicLM.^[3]^[4]^[5]^[6]^[7]

Infobox

Field	Value
Name	SoundStream
Type	End-to-end neural audio codec
Developer	Google Research
Authors	Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, Marco Tagliasacchi
First release (arXiv)	7 July 2021
Journal publication	IEEE/ACM TASLP, vol. 30, pp. 495 to 507, 2022 (online 23 Nov 2021)
Sample rate	24 kHz
Bitrate range	3 kbps to 18 kbps (single bitrate-scalable model)
Architecture	SEANet-style convolutional encoder, residual vector quantizer, convolutional decoder, STFT and waveform discriminators
Default striding	(2, 4, 5, 8), giving an embedding every 320 samples (13.3 ms)
Default capacity	C = 32 channels, roughly 8.4 M parameters
Real-time factor	Above 2.3x encoder and decoder on a Pixel 4 CPU thread
Joint denoising	Optional FiLM-conditioned, switchable at inference
Paper	arXiv:2107.03312; DOI 10.1109/TASLP.2021.3129994

What is SoundStream?

SoundStream is a neural audio codec: a deep network that learns, directly from audio waveforms, how to compress sound into a compact stream of discrete tokens and decode those tokens back into a high-quality waveform. It was the first published system to combine a fully learned convolutional encoder and decoder, a learnable residual vector quantizer, and an adversarial (GAN-style) training objective in a single end-to-end model.^[1] In the words of its own abstract, SoundStream "can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs," with "a single model" able to "operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates."^[1]

The headline quantitative claim is that one SoundStream model, running in real time on a smartphone CPU, can match or beat both a general-purpose waveform codec (Opus) and a dedicated speech codec (EVS) while using a fraction of the bits: SoundStream at 3 kbps outperforms Opus at 12 kbps and approaches EVS at 9.6 kbps.^[1]^[2] That result, established with a crowd-sourced MUSHRA-style listening test, is what made SoundStream the template for the neural codecs and audio language models that followed.^[3]^[4]^[5]^[6]^[7]

When was SoundStream released and who built it?

Audio coding before SoundStream was dominated by two engineering traditions. Waveform codecs such as Opus apply an invertible time-frequency transform to the input, quantize the resulting coefficients under the control of a perceptual model, and entropy-code the result; they preserve general audio faithfully at medium-to-high bitrates but introduce audible artifacts when squeezed below roughly 8 kbps.^[1] Parametric codecs such as Enhanced Voice Services (EVS) instead encode the parameters of a speech production model (typically LPC and CELP residual codebooks), trading sample-level fidelity for perceptual plausibility at low bitrates, but at the cost of strong assumptions about the input signal.^[1] Both traditions rely on hand-engineered signal processing pipelines tuned for psychoacoustics and speech synthesis.^[1]

A first wave of machine-learning audio codecs treated neural networks as bolt-on modules: WaveNet was used as the decoder in low-rate speech coders, WaveRNN powered LPCNet, and Google's first Lyra codec used a WaveGRU decoder driven by quantized mel-spectrogram features.^[1] These systems achieved impressive low-bitrate speech quality but were narrow: they assumed speech input, decoded one sample at a time, and quantized fixed, non-learned features. A separate body of work in image and music modelling, anchored by VQ-VAE and VQ-VAE-2, demonstrated that latent codebooks could be learned jointly with autoencoder networks, but vanilla vector quantization scaled poorly at audio bitrates because a single codebook covering 80 bits per frame would need $2^{80}$ entries.^[1]

SoundStream was the first system to combine three ingredients simultaneously: an end-to-end-learned convolutional encoder and decoder, a learnable residual (multi-stage) vector quantizer trained jointly with the network, and an adversarial training objective borrowed from the generative speech synthesis literature.^[1] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi posted the paper to arXiv on 7 July 2021 and published it formally in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495 to 507, with online publication on 23 November 2021 under DOI 10.1109/TASLP.2021.3129994.^[1]^[8] A companion blog post on the Google AI blog, dated 12 August 2021, summarised the system for a broader audience and showcased subjective comparisons against Opus, EVS, and Lyra.^[2]

The paper's central claim, that one model could outperform both Opus and EVS across a broad range of bitrates and content types while running in real time on a smartphone CPU, was a sharp departure from prior neural codecs and turned out to be the seed of a new sub-field that produced EnCodec, the Descript Audio Codec, Lyra v2, and a wave of "audio language models" that treat codec tokens as the alphabet for generative modelling.^[3]^[4]^[5]^[6]^[7]

How does SoundStream work?

What is the overall architecture?

SoundStream processes a single-channel waveform x sampled at f_s = 24 kHz through three blocks trained jointly end-to-end: an encoder that maps x to a sequence of D-dimensional embeddings at a lower sampling rate, a residual vector quantizer that replaces each embedding by a sum of vectors drawn from a small stack of codebooks, and a decoder that reconstructs an approximation of the original waveform from the quantized embeddings.^[1] One or more discriminators are trained jointly with the generator, both to supply the adversarial loss and to provide an internal feature space in which a perceptual reconstruction loss can be measured.^[1]

The full model is fully convolutional and uses only causal convolutions, so the architectural latency is determined solely by the temporal resampling ratio between the input waveform and the encoder output rather than by future context.^[1] At inference time the same network can run in either streaming or offline mode; deployment splits the encoder and quantizer onto a transmitter device and the decoder onto a receiver, with codebook indices forming the on-wire bitstream.^[1]

What does the encoder do?

The encoder follows the streaming SEANet design without skip connections.^[1] It begins with a 1D convolution that lifts the waveform to C_enc channels and then applies B_enc convolution blocks. Each block consists of three residual units, each containing a dilated convolution with dilation rates of 1, 3, and 9, followed by a strided downsampling convolution. The number of channels doubles at every downsampling step, starting from C_enc. A final 1D convolution with kernel length 3 and stride 1 sets the embedding dimension to D.^[1]

In the paper's default configuration B_enc = 4 with strides (2, 4, 5, 8), so the encoder produces one embedding for every $M = 2 \times 4 \times 5 \times 8 = 320$ input samples; at 24 kHz this is one embedding per 13.3 ms.^[1] The encoder uses the ELU activation, no normalisation, and only past padding to preserve causality. The default capacity is C_enc = 32, but the paper also explores asymmetric configurations with C_enc as low as 8, which substantially speeds up the encoder at almost no quality cost.^[1]

What is residual vector quantization (RVQ)?

At 6 kbps and M = 320, each second of audio yields S = 75 frames and each frame must carry $r = 6000 / 75 = 80$ bits.^[1] A flat vector quantizer would need a codebook of $N = 2^{80}$ entries, which is infeasible. SoundStream's residual vector quantizer instead cascades N_q stages of much smaller codebooks: the first stage quantizes the unquantized embedding, subsequent stages each quantize the residual error of the previous stage, and the rate budget is split uniformly so each codebook has size $N = 2^{r / N_q}$ .^[1] With N_q = 8 this gives $N = 2^{10} = 1024$ entries per codebook, a quantity comparable to those used in classical CELP coders.^[1]

Each codebook is updated with exponential moving averages of the assignments, following VQ-VAE-2.^[1] To stabilise training the codebooks are initialised by running k-means on the first training batch rather than randomly, and any code whose moving-average usage drops below 2 is reseeded with a random embedding from the current batch, an idea borrowed from OpenAI's Jukebox.^[1]

The residual structure has an important architectural consequence: because each layer adds rather than concatenates its output, the embedding dimension D does not change with the number of quantizers, so the encoder and decoder do not need to be modified to accommodate different bitrates.^[1]

How does one model serve multiple bitrates (quantizer dropout)?

Naively, a separate model would be needed for every target bitrate, since the encoder and decoder are co-trained with a fixed set of codebooks. SoundStream solves this with quantizer dropout, a form of structured dropout applied to quantization layers: for each training example, an integer n_q is sampled uniformly at random from {1, ..., N_q} and only the first n_q stages of the RVQ are used in that step.^[1] The model therefore learns to reconstruct audio for every bitrate corresponding to the range n_q = 1 to N_q, and at inference time the user simply chooses how many codebooks to transmit.^[1]

This single trick is what gives SoundStream its variable-bitrate behaviour: one model can switch between 3 kbps, 6 kbps, 12 kbps, and 18 kbps without retraining or changing architecture. The paper measures the cost of this scalability and finds it is essentially zero. In ViSQOL terms the bitrate-scalable model is only slightly worse than a bitrate-specific model at 3 kbps and matches it exactly at 6 kbps and 12 kbps, and in some configurations the dropout-trained model marginally outperforms the bitrate-specific one, suggesting that quantizer dropout acts as a useful regulariser in addition to enabling scalability.^[1]

What does the decoder do?

The decoder mirrors the encoder.^[1] A 1D convolution lifts the quantized embeddings back to C_dec channels, and B_dec convolution blocks each consist of a transposed convolution for upsampling followed by three residual units. The strides are the encoder strides in reverse order, so the decoder restores the original 24 kHz resolution. The number of channels halves at every upsampling step. A final 1D convolution with a single filter, kernel size 7, and stride 1 projects back into the waveform domain. The paper studies the asymmetric C_enc vs C_dec case in detail and finds, in line with image-compression literature, that a larger decoder benefits quality more than a larger encoder does.^[1]

What are the discriminators for?

To compute the adversarial loss SoundStream uses two complementary discriminator families.^[1] A waveform-based discriminator follows the multi-resolution design from MelGAN, applying three identical convolutional models to the input at original resolution, 2x downsampled, and 4x downsampled. Each single-scale model uses an initial plain convolution followed by four grouped convolutions with group size 4, downsampling factor 4, and a channel multiplier of 4 up to 1024 channels, ending with two more plain convolutions that produce the time-distributed logits.^[1]

A second STFT-based discriminator operates on the complex-valued short-time Fourier transform of the input with a window length of 1024 samples and a hop length of 256 samples, treating real and imaginary parts as separate channels.^[1] A 7 x 7 convolution with 32 channels feeds into six residual blocks that alternate between (1, 2) and (2, 2) time-frequency strides. A final 1 x F/2^6 convolution aggregates across the downsampled frequency axis to produce a one-dimensional time-domain logit signal.^[1]

What is the training objective?

The full training objective combines an adversarial term and two reconstruction terms.^[1] Both discriminators are trained with a hinge loss to classify decoded versus original audio, and the generator is trained to fool them with a corresponding hinge term. A feature loss takes the mean absolute difference between intermediate discriminator activations for real and decoded audio, providing a learned perceptual reconstruction signal. A multi-scale spectral reconstruction loss penalises the L1 distance between 64-bin mel-spectrograms of the original and decoded waveforms computed at six window lengths from 2^6 to 2^11 samples (with corresponding hop lengths of one quarter), plus a log-mel L2 term scaled by $\sqrt{s / 2}$ following the spectral energy distance formulation.^[1] The overall generator loss is a weighted sum

L_G = \lambda_{\text{adv}} L_{\text{adv}} + \lambda_{\text{feat}} L_{\text{feat}} + \lambda_{\text{rec}} L_{\text{rec}}

with $\lambda_{\text{adv}} = 1$ , $\lambda_{\text{feat}} = 100$ , and $\lambda_{\text{rec}} = 1$ in all reported experiments.^[1]

Can SoundStream denoise while it compresses?

Yes. A distinctive feature of SoundStream is that the same model can optionally perform background-noise suppression as part of compression, without any extra latency.^[1] The paper states that "we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech."^[1] Training data is augmented with tuples (input, target, denoise): when the denoise flag is true and the input is noisy speech, the target is the clean component; otherwise the target equals the input. At inference time the denoise flag is exposed as a Feature-wise Linear Modulation (FiLM) conditioning input that scales and shifts the activations between residual units, and the flag can be flipped on or off at any time, so the codec can encode acoustic scenes with denoising disabled and switch on speech denoising only when desired.^[1] In MUSHRA and ViSQOL evaluations, conditioning at the encoder side achieves quality comparable to applying conditioning at the decoder side, and both match a model that always denoises, demonstrating that switchable denoising costs nothing in quality.^[1] Combined with an existing SEANet denoiser, a single joint codec achieves ViSQOL scores essentially on par with a two-model pipeline of compress-then-denoise or denoise-then-compress while incurring roughly half the compute and no extra architectural latency.^[1]

What configurations and variants does SoundStream support?

The paper documents several internal configurations that have since become the de facto knobs for descendant codecs.

How do capacity trade-offs affect speed and quality?

SoundStream's default model has C_enc = C_dec = 32 channels and roughly 8.4 M parameters and achieves a real-time factor (RTF) above 2.3x for encode and decode on a single thread of a Pixel 4 CPU.^[1] Halving the channel count to C_enc = C_dec = 16 brings the parameter count down to 2.4 M and the RTF up to roughly 7x with essentially no measurable quality loss (ViSQOL of 3.98 vs 4.01 at 6 kbps).^[1] An asymmetric configuration with C_enc = 8 and C_dec = 32 pushes encoder RTF to 18.6x at the cost of a quality drop from 4.01 to 3.99, validating the "small encoder, larger decoder" pattern that DAC and EnCodec would later adopt.^[1]

How many quantizers does SoundStream use?

For a fixed 6 kbps budget, the same total bitrate can be achieved with very different (N_q, N) pairs. SoundStream measures three: (N_q = 8, N = 1024), (N_q = 16, N = 32), and (N_q = 80, N = 2). All three reach within 0.1 of each other on ViSQOL, with the 80-stage 1-bit-per-stage configuration only slightly behind.^[1] This is a notable empirical result: very deep RVQs can be trained end-to-end without optimisation collapse, and codebook size can be traded against codebook depth almost freely.^[1]

What determines architectural latency?

Architectural latency is the product of the encoder strides. The default (2, 4, 5, 8) yields a 13.3 ms frame, but the paper evaluates (1, 4, 5, 8) at 7.5 ms and (4, 4, 5, 8) at 26.6 ms; all three settings reach the same ViSQOL at 6 kbps, with the longer-frame configuration running faster because each forward pass amortises over more audio.^[1]

Bitrate-scalable vs bitrate-specific

In the paper's MUSHRA subjective evaluation, the bitrate-scalable model trained with quantizer dropout matches its bitrate-specific counterparts at 6 kbps and 12 kbps and is only marginally worse at 3 kbps, confirming that the practical penalty for serving all bitrates from one model is negligible.^[1]

How well does SoundStream perform?

What did the subjective listening tests show?

The main result reproduced in Figure 5 of the paper is a MUSHRA-inspired crowd-sourced listening test on 200 clips totalling clean speech, noisy speech, reverberant speech, and music.^[1] At 3 kbps, SoundStream significantly outperforms both Opus at 6 kbps and EVS at 5.9 kbps, the lowest rates at which those codecs operate, despite SoundStream using half the bits.^[1] To match SoundStream's 3 kbps perceptual quality, EVS needs at least 9.6 kbps and Opus at least 12 kbps, a 3.2x to 4x reduction in bit budget.^[1]^[2] At medium bitrates SoundStream still saves 2.2x to 2.6x, and at high bitrates 1.3x to 1.6x.^[1] At 3 kbps SoundStream also outperforms Lyra v1, the prior Google neural codec.^[1]

A break-out by content type shows the quality remains consistent across clean speech and noisy speech and demonstrates, for the first time in a published neural codec, that 3 kbps audio is feasible for music while still beating Opus at 12 kbps and EVS at 5.9 kbps on the same material.^[1]

What do the objective (ViSQOL) metrics show?

The paper reports rate-quality curves measured by ViSQOL across 3 kbps to 18 kbps.^[1] ViSQOL stays above 3.7 even at 3 kbps and degrades gracefully as the rate decreases. By estimating the empirical entropy of the quantization symbols (treated as discrete memoryless sources), the authors estimate that an additional 7 percent to 20 percent of bits could be saved by applying entropy coding on top of the fixed-rate stream, although the released model operates at constant bitrate.^[1] Content-wise, ViSQOL is highest on clean speech and lowest on music, reflecting the inherent diversity of musical content.^[1]

Why does the learned encoder matter (ablations)?

Replacing SoundStream's learnable encoder with a fixed mel-filterbank in the style of Lyra v1 (while still learning the quantizer and decoder) drops ViSQOL from 3.96 to 3.33 at 6 kbps, far worse than the quality at half the bitrate with a learned encoder (3.76 at 3 kbps).^[1] This single ablation justifies the additional encoder compute and quantifies the perceived-quality cost of the prior generation of fixed-feature neural codecs.^[1]

What models are based on SoundStream?

EnCodec (Meta AI, 2022)

EnCodec, introduced by Alexandre Defossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi at Meta AI on 24 October 2022 in the arXiv preprint "High Fidelity Neural Audio Compression," is a direct architectural descendant of SoundStream.^[3] It keeps the streaming SEANet-style convolutional encoder and decoder, the residual vector quantizer, and the adversarial-plus-reconstruction objective, and adds a multiscale STFT discriminator, a relative-feature matching loss, and a loss balancer that decouples gradient weight from loss scale.^[3] EnCodec extends the rate range to 1.5 kbps and adds 48 kHz stereo operation, and is supplied with optional Transformer-based entropy models that can shave a further 25 percent to 40 percent off the bitstream.^[3] It is the codec that ships inside Meta's AudioCraft toolkit and powers MusicGen.^[9]

Descript Audio Codec (DAC, 2023)

The Descript Audio Codec, presented in "High-Fidelity Audio Compression with Improved RVQGAN" by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs (a SoundStream co-author who joined Descript), Ishaan Kumar, and Kundan Kumar on 11 June 2023 and accepted as a NeurIPS 2023 spotlight, retains the SoundStream blueprint of convolutional encoder, RVQ bottleneck, convolutional decoder, and adversarial training but pushes the operating point to 44.1 kHz audio at 8 kbps, a roughly 90x compression ratio.^[4]^[10] DAC keeps SoundStream's mel-spectrogram reconstruction loss, adds a multi-band discriminator that improves high-frequency reconstruction, and proposes a series of fixes (snake activations, factorised codes, L2-normalised codes) to the codebook-collapse problem that plagued earlier RVQ training.^[4] It is widely used as a universal acoustic tokeniser for downstream generative audio models.^[4]

Lyra v2 (Google, 2022)

Google's Lyra v2 codec, released on 30 September 2022 under the Apache 2.0 license, is explicitly described in Google's announcement as "based on an end-to-end neural audio codec called SoundStream," and uses the SoundStream RVQ to expose three switchable bitrates of 3.2 kbps, 6 kbps, and 9.2 kbps from a single model.^[5]^[11] Lyra v2 cut end-to-end latency from 100 ms in Lyra v1 to 20 ms, ran five times faster than its predecessor, and shipped a TensorFlow Lite export that runs at roughly 0.57 ms per 20 ms frame on a Pixel 6 Pro, about 35x real time.^[5] It is the production-grade engineering descendant of SoundStream and the codec that has reached the largest number of end users through products like Google Duo.^[5]

AudioLM (Google, 2022)

AudioLM, introduced by Zalán Borsos, Neil Zeghidour, and colleagues at Google Research and published in IEEE/ACM TASLP, treats audio generation as language modelling over discrete tokens.^[6] It uses semantic tokens from a w2v-BERT model to capture coarse content and SoundStream "acoustic tokens" (the RVQ indices) to capture fine-grained acoustic detail such as speaker identity and recording conditions.^[6] The fact that SoundStream produces compact, low-bitrate, hierarchical token sequences that can be decoded back to high-quality waveforms is what makes audio language modelling tractable; AudioLM was the proof of concept that codec tokens could serve the same role for audio that subword tokens serve for language models.^[6]

MusicLM (Google, 2023)

MusicLM extends the AudioLM approach to text-conditioned music generation, again using SoundStream acoustic tokens as its synthesis bottleneck.^[7] It demonstrates that the SoundStream representation is rich enough to support 24 kHz music synthesis from text prompts at minute-long durations, a task that would have been intractable with sample-level autoregression.^[7] Other downstream Google audio systems including SoundStorm and the Universal Speech Model family also build on SoundStream-style tokens.^[7]

How did SoundStream influence neural codec language models?

Beyond Google's own stack, SoundStream's "discrete acoustic tokens" paradigm has been adopted across the field. Microsoft's VALL-E framed text-to-speech as conditional language modelling over EnCodec tokens, an architecture line that ultimately traces back to SoundStream's residual codebooks.^[12] Moshi from Kyutai uses a SoundStream/EnCodec-style RVQ codec for real-time full-duplex speech.^[13] The broader pattern, an audio model in which a small RVQ codec sits at the bottom of a larger generative or discriminative stack, is essentially the SoundStream pattern.^[9]

How does SoundStream compare with Opus, EVS, Lyra v1, and EnCodec?

Codec	Sample rate	Bitrate range	Architecture	Latency	Notes
Opus (RFC 6716)	8 to 48 kHz	6 to 510 kbps	SILK plus CELT, traditional	About 5 to 60 ms	Standardised by IETF, 2012; widely deployed in WebRTC^[1]
EVS	8 to 48 kHz	5.9 to 128 kbps	LPC plus CELP plus MDCT, traditional	About 32 ms	3GPP standard for VoLTE^[1]
Lyra v1	16 kHz speech only	3 kbps	Fixed mel features, WaveGRU decoder	About 100 ms	Google, 2021^[1]^[5]
SoundStream	24 kHz	3 to 18 kbps	Learned conv encoder, RVQ, conv decoder, GAN	13.3 ms architectural	Google, 2021; single model multi-bitrate^[1]
Lyra v2	16 kHz speech	3.2, 6, 9.2 kbps	SoundStream-based	20 ms	Google, 2022; production-grade^[5]
EnCodec	24 kHz mono, 48 kHz stereo	1.5 to 24 kbps	Conv encoder, RVQ, conv decoder, multiscale STFT D	Streaming	Meta, 2022; ships in AudioCraft^[3]
DAC	16, 24, 44.1 kHz	About 8 kbps	Improved RVQGAN with codebook-collapse fixes	Streaming	Descript, 2023; NeurIPS 2023 spotlight^[4]

Note that the latency figures listed are architectural rather than total system latency, which also depends on framing and platform-dependent buffering.

What is SoundStream used for?

The most immediate application of SoundStream is bitstream compression for voice and audio over networks: encoder and quantizer on a transmitting device, codebook indices on the wire, decoder on the receiving device.^[1] The 13.3 ms architectural latency and real-time-factor margins on smartphone CPUs make it suitable for two-way communication, and the bitrate-scalable model lets clients adjust on the fly to network conditions without any retraining.^[1] Google's Lyra v2 ships this use case at production scale in WebRTC pipelines.^[5]

The optional joint denoising mode lets the same model perform background-noise suppression for speech without any added latency, useful in noisy mobile or call-centre environments where compressing then denoising would otherwise be two stages each with its own buffering.^[1]

A second application class, foreseen in the original paper only obliquely but soon realised by downstream work, is discrete-token representation learning for generative modelling.^[6] Because SoundStream collapses a 24 kHz audio stream into a few hundred discrete codebook indices per second, those indices can be modelled by the same Transformer architectures used for text, and downstream systems can synthesise novel audio by sampling RVQ tokens autoregressively (AudioLM, MusicLM) or in parallel (SoundStorm).^[6]^[7]

A third application class is dataset compression and search: storing large audio corpora as RVQ tokens reduces storage and bandwidth and enables nearest-neighbour and clustering operations directly in the discrete code space.

What are the limitations of SoundStream?

The SoundStream paper acknowledges several limitations.^[1] The model operates at a fixed constant bitrate per chosen n_q; while the empirical entropy of the quantization symbols suggests 7 percent to 20 percent of additional bits could be removed by entropy coding the index stream, the released configuration does not perform that step.^[1]

Music represents a more challenging content type than clean speech in the rate-quality curves, and the gap between the joint-compression-and-denoising configuration and a two-model compress-plus-denoise pipeline shrinks but does not vanish at higher input SNRs.^[1]

The 24 kHz sampling rate is limiting for some music applications where 44.1 kHz or 48 kHz reproduction is expected; this is part of the gap that DAC at 44.1 kHz and EnCodec at 48 kHz stereo were designed to fill.^[3]^[4] SoundStream itself does not handle stereo audio out of the box.^[1]

The codec was trained on LibriTTS speech, Freesound noise mixtures, and MagnaTagATune music, all in English or instrumental.^[1] Generalisation to other languages, far-field reverberant audio, and unusual content categories (children's speech, singing, environmental sounds with rare timbres) is not exhaustively quantified in the paper.^[1]

A more general criticism levelled at neural codecs as a class is that, unlike standardised traditional codecs, SoundStream is not bit-exact across implementations: small floating-point differences between encoder and decoder builds, or between training and deployment platforms, can produce different decoded waveforms.^[9] Interoperability with hardware codec accelerators, well-defined error concealment behaviour, and formal psychoacoustic guarantees are all areas where neural codecs still lag standardised codecs such as Opus and EVS.^[1]

SoundStream draws on three intellectual streams that predate it. From the GAN vocoder literature it inherits the adversarial training scheme and the multi-scale waveform discriminator; the wave-based discriminator is taken directly from MelGAN and the broader losses are influenced by HiFi-GAN.^[1] From the autoencoder and self-supervised representation learning tradition it inherits the VQ-VAE and VQ-VAE-2 discrete bottleneck idea and the codebook moving-average update rule.^[1] From traditional speech coding it inherits the residual (multi-stage) vector quantizer structure familiar from CELP-style coders, lifted into a learnable end-to-end setting.^[1]

It is in turn the direct ancestor of every subsequent "RVQGAN" neural codec, including Descript's DAC, Meta's EnCodec, Google's Lyra v2, the Kyutai Mimi codec used in Moshi, and the long line of VALL-E-style codec-token language models.^[3]^[4]^[5]^[12]^[13]

A separate but adjacent line of research uses self-supervised representations such as Wav2Vec for speech-only low-rate coding; SoundStream and its descendants differ in that they make no signal-specific assumptions and quantize spectrally rich features directly from waveforms.^[1]

References

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, Marco Tagliasacchi, "SoundStream: An End-to-End Neural Audio Codec", arXiv, 2021-07-07. https://arxiv.org/abs/2107.03312. Accessed 2026-05-20. ↩
Neil Zeghidour and Marco Tagliasacchi, "SoundStream: An End-to-End Neural Audio Codec", Google Research Blog, 2021-08-12. https://research.google/blog/soundstream-an-end-to-end-neural-audio-codec/. Accessed 2026-05-20. ↩
Alexandre Defossez, Jade Copet, Gabriel Synnaeve, Yossi Adi, "High Fidelity Neural Audio Compression", arXiv, 2022-10-24. https://arxiv.org/abs/2210.13438. Accessed 2026-05-20. ↩
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar, "High-Fidelity Audio Compression with Improved RVQGAN", arXiv, 2023-06-11. https://arxiv.org/abs/2306.06546. Accessed 2026-05-20. ↩
Google Open Source, "Lyra V2 - a better, faster, and more versatile speech codec", Google Open Source Blog, 2022-09-30. https://opensource.googleblog.com/2022/09/lyra-v2-a-better-faster-and-more-versatile-speech-codec.html. Accessed 2026-05-20. ↩
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matthew Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour, "AudioLM: A Language Modeling Approach to Audio Generation", Google Research Blog, 2022-10-07. https://research.google/blog/audiolm-a-language-modeling-approach-to-audio-generation/. Accessed 2026-05-20. ↩
Andrea Agostinelli et al., "MusicLM: Generating Music From Text", arXiv, 2023-01-26. https://arxiv.org/abs/2301.11325. Accessed 2026-05-20. ↩
IEEE Signal Processing Society, "IEEE/ACM Transactions on Audio, Speech, and Language Processing, Volume 30, 2022", IEEE, 2022. https://signalprocessingsociety.org/publications-resources/ieee-transactions-audio-speech-and-language-processing/2022/01. Accessed 2026-05-20. ↩
Facebook AI Research, "EnCodec: High Fidelity Neural Audio Compression", AudioCraft Documentation, 2022. https://facebookresearch.github.io/audiocraft/docs/ENCODEC.html. Accessed 2026-05-20. ↩
Descript, "descript-audio-codec on GitHub", GitHub, 2023. https://github.com/descriptinc/descript-audio-codec. Accessed 2026-05-20. ↩
Hengchin Yeh, "Lyra V2 open-source audio codec gets faster, higher quality and compatible with more platforms", CNX Software, 2022-10-03. https://www.cnx-software.com/2022/10/03/lyra-v2-open-source-audio-codec-gets-faster-higher-quality-and-compatible-with-more-platforms/. Accessed 2026-05-20. ↩
Chengyi Wang et al., "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers", arXiv, 2023-01-05. https://arxiv.org/abs/2301.02111. Accessed 2026-05-20. ↩
Kyutai, "Moshi: a speech-text foundation model for real-time dialogue", Kyutai Technical Report, 2024-09. https://kyutai.org/Moshi.pdf. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

Kyutai MusicLM Text-to-Speech Models VQ-VAE (Vector Quantized Variational Autoencoder)XTTS (Coqui XTTS)

Infobox

What is SoundStream?

When was SoundStream released and who built it?

How does SoundStream work?

What is the overall architecture?

What does the encoder do?

What is residual vector quantization (RVQ)?

How does one model serve multiple bitrates (quantizer dropout)?

What does the decoder do?

What are the discriminators for?

What is the training objective?

Can SoundStream denoise while it compresses?

What configurations and variants does SoundStream support?

How do capacity trade-offs affect speed and quality?

How many quantizers does SoundStream use?

What determines architectural latency?

Bitrate-scalable vs bitrate-specific

How well does SoundStream perform?

What did the subjective listening tests show?

What do the objective (ViSQOL) metrics show?

Why does the learned encoder matter (ablations)?

What models are based on SoundStream?

EnCodec (Meta AI, 2022)

Descript Audio Codec (DAC, 2023)

Lyra v2 (Google, 2022)

AudioLM (Google, 2022)

MusicLM (Google, 2023)

How did SoundStream influence neural codec language models?

How does SoundStream compare with Opus, EVS, Lyra v1, and EnCodec?

What is SoundStream used for?

What are the limitations of SoundStream?

Related work and intellectual lineage

See also

References

Improve this article

Related Articles

DolphinGemma

AudioLM

Audio-to-Audio Models

AudioCraft

Audio Classification Models

Audio Models

What links here

Related Articles

DolphinGemma

AudioLM

Audio-to-Audio Models

AudioCraft

Audio Classification Models

Audio Models

What links here