EnCodec
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,044 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,044 words
Add missing citations, update stale details, or suggest a clearer explanation.
EnCodec is a real-time neural audio codec developed by Meta AI's FAIR (Fundamental AI Research) team. It compresses speech, ambient sound, and music into a compact stream of discrete tokens using a streaming convolutional encoder-decoder with a Residual Vector Quantization (RVQ) bottleneck, adversarial training against a multi-scale STFT discriminator, and an optional lightweight Transformer entropy model. The reference implementation, released October 2022 as facebookresearch/encodec under the MIT license, ships two variants: a causal 24 kHz mono model targeting 1.5, 3, 6, 12 or 24 kbps and a non-causal 48 kHz stereo model targeting 3, 6, 12 or 24 kbps.[1][2] Beyond compression, EnCodec became the de facto audio tokenizer for a generation of language models over sound, including MusicGen, AudioGen and Suno's Bark, which all autoregress over its RVQ codes rather than raw waveforms.[3][4][5]
| Field | Value |
|---|---|
| Developer | Meta AI, FAIR Team (Paris and Tel-Aviv) |
| Lead authors | Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi |
| First arXiv release | 2022-10-24 (arXiv:2210.13438) |
| Journal version | Transactions on Machine Learning Research (TMLR), 2023-09-17 |
| Reference repository | github.com/facebookresearch/encodec |
| License (code) | MIT |
| Supported sample rates | 24 kHz mono, 48 kHz stereo (32 kHz mono variant for MusicGen, 16 kHz variant for AudioGen) |
| Bitrate range | 1.5 to 24 kbps |
| Quantizer | Residual Vector Quantization, 1024 entries per codebook, up to 32 codebooks (16 at 48 kHz) |
| Encoder latent rate | 75 Hz at 24 kHz; 150 Hz at 48 kHz |
| Streaming latency | 13 ms (24 kHz model); 1 s (48 kHz model) |
| Entropy model | 5-layer, 8-head Transformer, 200 channels, 800-dim FFN |
By the early 2020s, lossy audio compression had matured around hand-engineered signal-processing pipelines such as Opus, standardised by the IETF in 2012, and Enhanced Voice Services (EVS), standardised by 3GPP in 2014. Both spanned narrowband telephony up through fullband stereo by trading off perceptually weighted components of an audio frame.[1] These traditional codecs scaled gracefully but had hit a quality floor at very low bitrates, particularly for music and noisy or reverberant speech, where parametric models simply could not reconstruct convincing detail below roughly 6 kbps.[1]
A parallel research thread had been exploring learned codecs based on neural vocoders, VQ-VAE bottlenecks, and end-to-end optimisation. WaveNet-based coders from Google demonstrated that an autoregressive generative model could synthesise speech from extremely compressed bitstreams, but inference was orders of magnitude slower than real time.[1] The most direct ancestor of EnCodec was SoundStream, published in 2021 by Neil Zeghidour and colleagues at Google, which introduced a fully convolutional encoder-decoder trained with Residual Vector Quantization, reconstruction losses, and adversarial perceptual losses.[1] SoundStream powered Google's Lyra v2 deployment and showed that neural codecs could outperform Opus at 6 kbps on speech.
EnCodec was conceived as the continuation of this line of work, with three specific contributions: a simplified discriminator design built around a single multi-scale STFT network, a gradient balancer that decouples loss weighting from the natural scale of each loss term, and an optional lightweight Transformer entropy model that can shave a further 25 to 40 percent off the bitstream while remaining faster than real time on a single CPU core.[1] The first arXiv preprint appeared on 24 October 2022 with four core authors: Alexandre Défossez and Jade Copet sharing first authorship, and Gabriel Synnaeve and Yossi Adi serving as senior authors, all at Meta AI FAIR.[1] The journal version was accepted to Transactions on Machine Learning Research and published on 17 September 2023.[6]
The codebase was released on the same day as the preprint at github.com/facebookresearch/encodec under the MIT license, with pretrained weights available via Torch Hub for both the 24 kHz causal model and the 48 kHz non-causal stereo model.[2] Within weeks the work was already powering follow-on systems: Meta's own MusicGen and the open-source Bark project from Suno both adopted EnCodec as their acoustic token vocabulary, treating audio synthesis as next-token prediction over RVQ codes.[3][5]
EnCodec models an audio signal $\mathbf{x} \in [-1, 1]^{C_a \times T}$ with $C_a$ channels and $T$ samples through three stages: a convolutional encoder $E$ that produces a latent $\mathbf{z}$, a Residual Vector Quantization layer $Q$ that maps $\mathbf{z}$ to a discrete-valued code grid $\mathbf{z}_q$, and a decoder $G$ that reconstructs the waveform $\hat{\mathbf{x}}$ from $\mathbf{z}_q$. The whole system is trained end-to-end against a sum of time-domain, frequency-domain, adversarial, feature-matching, and commitment losses.[1]
The encoder is a streaming, fully convolutional network drawn from the SEANet family of audio enhancement architectures.[1] It opens with a 1D convolution with $C = 32$ channels and kernel size 7, followed by $B = 4$ convolution blocks. Each block contains a residual unit, made of two kernel-3 convolutions with a skip connection, and a strided down-sampling convolution with kernel size $K = 2S$. The strides used in sequence are $(2, 4, 5, 8)$, doubling the channel count whenever a stride is applied. After the four blocks come a two-layer LSTM for short-range sequence modelling and a final 1D convolution with kernel size 7 producing $D$ output channels. ELU activations and either layer normalization or weight normalization are used throughout.[1]
The decoder mirrors the encoder, replacing each strided convolution with a transposed convolution and reversing the order of the strides. The product of the stride list, $2 \cdot 4 \cdot 5 \cdot 8 = 320$, fixes the temporal downsampling factor of the encoder. At 24 kHz this gives 75 latent steps per second; at 48 kHz it gives 150 steps per second.[1] Because the encoder consumes 320 samples to produce one latent frame, the streaming model can emit its first frame after receiving only 320 samples (about 13.3 ms of audio), which sets the codec's initial algorithmic latency.[1]
EnCodec exposes two configurations of this architecture. In the streamable setup, all convolutional padding is shifted to before the first time step, transposed convolutions emit only their first $s$ samples per step and buffer the remainder, and layer normalization is replaced with weight normalization, which is better suited to a causal compute pattern.[1] In the non-streamable setup used for the 48 kHz model, padding is split symmetrically around each window, the audio is processed in one-second chunks with a 10 ms overlap, each chunk is normalized to remove level differences, and the normalization scale is transmitted as a small bandwidth overhead so the decoder can invert it.[1]
The bottleneck of EnCodec is a Residual Vector Quantization layer following the formulation introduced by SoundStream.[1] A single vector quantizer projects each latent vector onto its nearest entry in a codebook of size $K$. RVQ stacks several such quantizers: the first quantizer produces a code $q_1(\mathbf{z})$, the residual $\mathbf{z} - q_1(\mathbf{z})$ is fed to a second quantizer with its own codebook, and the process repeats.[1] After $N_q$ stages, the reconstructed vector is the sum of the $N_q$ chosen codebook entries.
EnCodec uses codebooks of 1024 entries each, which is exactly 10 bits per code. The 24 kHz model has up to 32 codebooks and the 48 kHz stereo model has up to 16.[1] Codebook entries are updated via an exponential moving average with decay 0.99, and entries that are never selected within a batch are replaced with a randomly sampled latent from the current batch, a trick that prevents codebook collapse. Gradients flow through the quantizer using a straight-through estimator.[1]
A single trained model supports multiple bitrates because the number of active codebooks is varied during training. With a stride product of 320, a 24 kHz signal produces 75 latent frames per second, and each codebook contributes 10 bits per frame; using 2 codebooks yields $2 \cdot 75 \cdot 10 = 1500$ bits per second (1.5 kbps), 4 codebooks yield 3 kbps, 8 yield 6 kbps, 16 yield 12 kbps and 32 yield 24 kbps. Training samples a number of codebooks as a multiple of 4 from this set, so a single set of weights serves the entire grid of supported rates.[1] At 48 kHz the supported set is 3, 6, 12 and 24 kbps.
EnCodec optionally trains a small Transformer language model over the RVQ codes to leverage the residual statistical redundancy that the bare quantizer does not remove. The model has 5 layers, 8 heads, 200 channels, and an 800-dimensional feed-forward block; no dropout is used.[1] At each time step it consumes the discrete codes from the previous step, embedded codebook by codebook and summed, and predicts the next step's distribution over each codebook independently via $N_q$ linear heads with 1024 output logits each. Treating the codebooks as conditionally independent at a single time step lets the network predict one entire frame in a single forward pass, accepting a small bound-cross-entropy penalty in exchange for a major speed-up.[1] The attention layers have a causal receptive field of about 3.5 seconds, and the model is trained on 5-second sequences with a randomised offset in the sinusoidal position embedding so it generalises beyond its training horizon.[1]
These probabilities feed a range-based arithmetic coder following Pasco and Rissanen.[1] To make the coder deterministic across hardware and across batched-versus-streaming evaluation, the estimated probabilities are first rounded to a precision of $10^{-6}$, with a total range width of $2^{24}$ and a minimum range width of 2. The resulting bitstream is up to 40 percent shorter on average; the paper reports the 3 kbps configuration falling to 1.9 kbps with entropy coding, and the stereo 48 kHz model at 6 kbps falling to 4.2 kbps.[1]
The reconstruction loss has a time term, defined as the L1 distance between the original and reconstructed waveforms, $\ell_t(\mathbf{x}, \hat{\mathbf{x}}) = |\mathbf{x} - \hat{\mathbf{x}}|_1$, and a frequency term that sums L1 and L2 distances over a multi-scale mel-spectrogram. The frequency loss uses 64-bin mel spectrograms with STFT window sizes $2^i$ for $i$ in ${5, 6, 7, 8, 9, 10, 11}$ and matching hop length $2^i / 4$, weighting the L2 component by a fixed coefficient $\alpha_i = 1$.[1] This multi-scale construction lets the model match coarse spectral envelopes and fine harmonic detail simultaneously, which is important for both speech intelligibility and the timbre of music.
For the perceptual loss, EnCodec departs from prior work that combined several discriminator families. Multi-Period Discriminators from HiFi-GAN and Multi-Scale Discriminators from MelGAN had been standard in neural vocoders, and SoundStream had stacked an MSD with a Mono-STFT discriminator. EnCodec replaces all of these with a single Multi-Scale STFT Discriminator (MS-STFTD).[1] The discriminator operates on the complex-valued STFT of the audio with real and imaginary parts concatenated as separate channels. Each sub-network is built from a 2D convolution with kernel $3 \times 8$ and 32 channels, followed by 2D convolutions with dilation rates of 1, 2 and 4 in the time dimension and stride 2 on the frequency axis, and a final $3 \times 3$ convolution with stride 1. Five different scales are used, with STFT window lengths $[2048, 1024, 512, 256, 128]$; at 48 kHz each window length is doubled.[1] For stereo audio the discriminator processes the left and right channels independently.
The generator's adversarial loss is the hinge loss $\ell_g(\hat{\mathbf{x}}) = \frac{1}{K}\sum_k \max(0, 1 - D_k(\hat{\mathbf{x}}))$ over the $K$ sub-discriminators, and the discriminators themselves minimise a complementary hinge loss. Because the discriminators tend to overpower the encoder-decoder, EnCodec updates them with probability 2/3 at 24 kHz and 1/2 at 48 kHz.[1] A relative feature-matching loss is added on top, comparing intermediate activations of the discriminator on the real and reconstructed waveforms.[1]
The commitment loss is the standard VQ-VAE term, applied across every residual stage: for each residual $\mathbf{z}_c$ and its quantization $q_c(\mathbf{z}_c)$,
$$\ell_w = \sum_{c=1}^{C} |\mathbf{z}_c - q_c(\mathbf{z}_c)|_2^2,$$
with gradients flowing only into the encoder, never into the codebook entries themselves.[1]
A novel contribution of the paper is the loss balancer, introduced to control the wildly differing gradient scales produced by reconstruction, mel, adversarial, and feature-matching losses. The balancer treats each loss weight $\lambda_i$ as a target fraction of the total gradient norm rather than as a raw multiplier. Given gradients $g_i = \partial \ell_i / \partial \hat{\mathbf{x}}$ and an exponential moving average of their norms with decay $\beta = 0.999$, the balancer backpropagates
$$\tilde{g}_i = R \frac{\lambda_i}{\sum_j \lambda_j} \cdot \frac{g_i}{\langle|g_i|2\rangle\beta}$$
with reference norm $R = 1$.[1] If the weights sum to 1, each $\lambda_i$ becomes the literal fraction of the gradient coming from that loss, making hyper-parameter tuning interpretable irrespective of the natural scale of each term.[1] The commitment loss bypasses the balancer because it is defined directly on the encoder output rather than on $\hat{\mathbf{x}}$. The paper's ablation reports significantly more stable training with the balancer engaged.[1]
The published training recipe uses 300 epochs of 2,000 updates each on 8 NVIDIA A100 GPUs with the Adam optimizer, batch size 64 of 1-second clips, learning rate $3 \cdot 10^{-4}$, $\beta_1 = 0.5$ and $\beta_2 = 0.9$.[1] The balancer weights are $\lambda_t = 0.1$, $\lambda_f = 1$, $\lambda_g = 3$, $\lambda_{\text{feat}} = 3$ for the 24 kHz model and $\lambda_g = \lambda_{\text{feat}} = 4$ for the 48 kHz model.[1]
The 24 kHz causal model emits a new packet every 320 samples, which at 24 kHz is 13.3 ms.[1] No future context is required, so EnCodec can be inserted into voice-over-IP, low-latency conferencing, or text-to-speech streaming pipelines. The 48 kHz model, designed for music streaming and archival rather than conversational use, accepts a one-second initial latency in exchange for the symmetric padding and chunk-level normalization that yield better fidelity. Entropy coding adds another 13 ms of latency on the 24 kHz model because the arithmetic coder cannot flush each frame independently without inflating overhead.[1]
The 24 kHz model is trained on a multi-domain mix designed to cover speech, noisy and reverberant speech, music, and general audio. Speech comes from the DNS Challenge 4 corpus and Common Voice; general audio from AudioSet and FSD50K; and music from the MTG-Jamendo dataset along with a proprietary Meta music collection used only for evaluation.[1] On-the-fly augmentations combine sources with probabilities 0.32 for a single Jamendo clip, 0.32 for a single non-music source, 0.24 for a mix of two clips from any source, and 0.12 for a mix of three non-music clips. A random gain between -10 and +6 dB is applied, clipped samples are rejected, and reverberation is added with probability 0.2 using DNS Challenge room impulse responses with RT60 values between 0.3 and 1.3 seconds.[1] The 48 kHz stereo model is trained on music only.[1]
For each bitrate, a dedicated discriminator is maintained, so the per-bitrate adversary that is updated on a given batch matches the number of RVQ stages active for that batch. The paper reports this delivers measurable quality gains compared to a single shared discriminator.[1]
The paper relies on MUSHRA listening tests, conducted via a crowdsourcing platform with at least 10 ratings per sample on 5-second excerpts, and screens annotators by their ability to identify the hidden reference and low anchor.[1] Objective metrics include ViSQOL and Scale-Invariant SNR. The major baselines are Opus, EVS, and Google's Lyra v2 implementation of SoundStream.[1]
The headline results at 24 kHz, in the streamable setup, show EnCodec outperforming all baselines at every comparable bitrate.
| Model | Bandwidth | Clean Speech | Noisy Speech | Music Set-1 | Music Set-2 |
|---|---|---|---|---|---|
| Reference (uncompressed) | n/a | 95.5 +/- 1.6 | 93.9 +/- 1.8 | 93.2 +/- 2.5 | 97.1 +/- 1.3 |
| Opus | 6.0 kbps | 30.1 +/- 2.8 | 19.1 +/- 5.9 | 20.6 +/- 5.8 | 17.9 +/- 5.3 |
| Opus | 12.0 kbps | 76.5 +/- 2.3 | 61.9 +/- 2.1 | 77.8 +/- 3.2 | 65.4 +/- 2.7 |
| EVS | 9.6 kbps | 84.4 +/- 2.5 | 80.0 +/- 2.4 | 89.9 +/- 2.3 | 87.7 +/- 2.3 |
| Lyra v2 | 3.0 kbps | 53.1 +/- 1.9 | 52.0 +/- 4.7 | 69.3 +/- 3.3 | 42.3 +/- 3.5 |
| Lyra v2 | 6.0 kbps | 66.2 +/- 2.9 | 59.9 +/- 3.3 | 75.7 +/- 2.6 | 48.6 +/- 2.1 |
| EnCodec | 1.5 kbps (0.9 EC) | 49.2 +/- 2.4 | 41.3 +/- 3.6 | 68.2 +/- 2.2 | 66.5 +/- 2.3 |
| EnCodec | 3.0 kbps (1.9 EC) | 67.0 +/- 1.5 | 62.5 +/- 2.3 | 89.6 +/- 3.1 | 87.8 +/- 2.9 |
| EnCodec | 6.0 kbps (4.1 EC) | 83.1 +/- 2.7 | 69.4 +/- 2.3 | 92.9 +/- 1.8 | 91.3 +/- 2.1 |
| EnCodec | 12.0 kbps (8.9 EC) | 90.6 +/- 2.6 | 80.1 +/- 2.5 | 91.8 +/- 2.5 | 92.9 +/- 1.2 |
Scores are MUSHRA on a 0-100 scale with 95 percent confidence intervals; numbers in parentheses are the entropy-coded average bandwidths.[1] EnCodec at 3 kbps slightly outperforms Lyra v2 at 6 kbps and Opus at 12 kbps in the average over speech and music settings, an effective doubling of compression efficiency.[1]
For stereophonic music at 48 kHz, EnCodec at 6 kbps achieves MUSHRA 82.9, statistically tied with MP3 at 64 kbps (82.7) and far above Opus at 6 kbps (17.7), giving a 256x compression ratio with audible fidelity preserved.[1] At 12 kbps the codec reaches 88.0, and at 24 kbps it reaches 87.5 (where it ties with itself at 12 kbps, suggesting saturation).[1]
The MS-STFT discriminator ablation shows that the single multi-scale STFT family alone reaches MUSHRA 77.5 +/- 1.8 with the best ViSQOL of any tested setup, slightly above MS-STFT + MPD at 79.0 +/- 1.9 and well above the SoundStream-style MSD + Mono-STFT baseline at 62.91 +/- 2.62.[1] The streamable model loses a small but visible amount of fidelity relative to its non-streamable twin (SI-SNR 6.67 versus 7.46 at 6 kbps), confirming the expected latency-quality tradeoff.[1]
On a single thread of a 2019 MacBook Pro CPU, the 24 kHz EnCodec runs at a real-time factor of 9.8x for encoding and 10.4x for decoding at 6 kbps, dropping to 1.6x when entropy coding is included.[1] The 48 kHz model runs at 6.8x for encoding and 5.1x for decoding without entropy coding, and 0.68x with it, suggesting that arithmetic coding becomes the bottleneck at high sample rates with the small Transformer the authors chose.[1] Lyra v2, by comparison, runs at 27.4x for encoding and 67.2x for decoding on the same hardware, illustrating that EnCodec trades raw throughput for fidelity.[1]
The reference release in facebookresearch/encodec ships two models that exactly match the paper: a causal 24 kHz monophonic model and a non-causal 48 kHz stereophonic model, both with optional Transformer language models for entropy coding.[2] The codebase requires Python 3.8 or later and PyTorch 1.11 or later, with Torch Hub used for automatic model download.[2] All code is MIT-licensed.[2]
The broader facebookresearch/audiocraft repository, released in mid-2023 to house MusicGen and AudioGen alongside EnCodec, ships several additional EnCodec checkpoints fine-tuned for different downstream uses. A 32 kHz monophonic model (facebook/encodec_32khz) was trained specifically as the tokenizer for MusicGen, operating at the music-friendly sample rate that matches MusicGen's pre-training corpus.[7] A 16 kHz monophonic model with 4 codebooks at a 50 Hz frame rate was trained for AudioGen.[4] AudioCraft's code is MIT-licensed while the released model weights are licensed CC-BY-NC 4.0, restricting commercial reuse.[7]
Hugging Face Transformers integrated EnCodec inference in 2023, exposing the encoder, decoder, and quantizer as a standard model class and allowing the codec to be used from the Hugging Face Transformers API alongside other audio models.[8] Many independent ports exist, including TensorFlow conversions, ONNX exports, and a number of streaming WASM builds for web browsers.
A 32 kHz stereo variant trained on additional music data was released in 2023 to accompany the MusicGen-Stereo follow-on, doubling the channel count of the tokenizer without changing the underlying architecture.[7]
EnCodec's most consequential downstream role has not been compression per se, but discrete tokenization of audio for generative Transformer models. By turning a continuous waveform into a regular grid of integer codes, EnCodec lets any next-token language model trained on text be straightforwardly adapted to audio: the model predicts the next set of codebook indices, and the EnCodec decoder converts them back to sound.
MusicGen, announced by Meta in June 2023, is a text-conditioned music generation model that autoregresses over EnCodec tokens at 32 kHz with 4 codebooks per frame at a 50 Hz frame rate, yielding 200 RVQ codes per second.[3] The MusicGen paper, authored by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez, was accepted at NeurIPS 2023.[3] The Transformer learns one of several "code interleaving patterns" that flatten the 4-codebook x time grid into a single sequence the model can predict step by step, and conditioning is supplied by a text encoder.[3] EnCodec's decoder reconstructs the final waveform, meaning MusicGen's audible output quality is bounded above by EnCodec's reconstruction fidelity at the corresponding bitrate.
AudioGen, originally proposed by Felix Kreuk and colleagues in September 2022, generates environmental and Foley sounds from text descriptions. The audiocraft reimplementation that shipped publicly is "a single stage auto-regressive Transformer model trained over a 16 kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz."[4] The same discrete-token recipe used for music transfers to general sound effects with no architectural change to either the codec or the LM.
Bark, released by Suno in 2023, is an open-source text-to-audio system that produces multilingual speech, sound effects, and short music passages from text prompts.[5] It is a three-stage cascade: a Transformer turns text into "semantic tokens" inspired by AudioLM; a second Transformer turns semantic tokens into the first 2 EnCodec codebooks (its "coarse" tokens); and a third Transformer fills in the remaining 6 codebooks (the "fine" tokens) before EnCodec's decoder synthesises the final waveform.[5] All audio is therefore reconstructed via Meta's 24 kHz EnCodec model. Because Bark uses EnCodec at inference, the system inherits both its MIT-licensed code and the underlying acoustic fidelity. Suno later moved its commercial product line to proprietary codecs but Bark retains EnCodec.[5]
The EnCodec recipe became the de facto template for "audio LM" research through 2023 and 2024. VALL-E and VALL-E X from Microsoft framed zero-shot text-to-speech as next-token prediction over EnCodec-style codes; Meta's MAGNeT applied masked non-autoregressive generation to the same token grid; and the audiocraft library exposed EnCodec as a swappable tokenizer for community follow-on work.[3][7] The pattern of treating audio as a sequence of RVQ codes, popularised by EnCodec and SoundStream, became a foundation for the discrete-audio generative-model paradigm that competes with continuous diffusion approaches in Stable Audio and related systems.[9]
The intended application domain is general-purpose audio coding for streaming, archival, and constrained-bandwidth communications. The 24 kHz mono codec is suitable for podcasts, voice messaging, video-conferencing, and audiobook delivery at bitrates an order of magnitude below Opus while matching its perceptual quality. The 48 kHz stereo codec is targeted at music streaming and music archival; the paper points to 6 to 12 kbps stereo as a viable replacement for MP3 at 64 kbps.[1]
In generative modelling, EnCodec provides a small, finite token vocabulary for autoregressive language models over sound. This is particularly useful because language models trained with cross-entropy on RVQ codes inherit standard tooling: KV caching, beam search, top-k or top-p sampling, classifier-free guidance, and so on, all transfer directly from text generation to audio generation when the input modality is integers rather than continuous spectrograms. EnCodec's broad public release made it possible for academic groups and small companies to build text-to-music, text-to-sound, and zero-shot voice cloning systems without retraining a codec from scratch.
EnCodec has also been used as a learned, perceptually motivated front-end for downstream audio understanding tasks, replacing log-mel filterbanks as the input to classifiers or self-supervised models. The quantized codes form a compact alphabet that is more amenable to symbolic operations such as masking, splicing, or template-based editing than raw audio frames.
EnCodec inherits the standard pathologies of GAN-trained vocoders. At its lowest bitrates the codec can introduce "fizzy" or "underwater" artifacts on speech, and the multi-scale STFT discriminator does not entirely eliminate amplitude-modulation noise on sustained tones. The paper's own MUSHRA tables show that at 1.5 kbps the 24 kHz model achieves a clean-speech score of 49.2, well below the 95.5 reference, so the codec is not transparent at the bottom of its operating range.[1]
The 48 kHz model's one-second initial latency rules out conversational use: it is designed for music streaming and archival applications where the delay is amortised.[1] The entropy-coded variant at 48 kHz runs slower than real time at 0.68x for encoding on a single CPU thread, meaning that pure-CPU live streaming requires either skipping the language model or using GPU acceleration.[1]
EnCodec's pretrained checkpoints in facebookresearch/encodec are MIT-licensed; the AudioCraft-bundled music model is non-commercially licensed.[2][7] This bifurcation has caused friction for downstream developers who want to ship commercial products built on top of MusicGen or AudioGen but discover that their codec backbone is non-commercially licensed.
Discrete-token codecs like EnCodec also lose phase information at the codebook step, and EnCodec's commitment-loss objective does not directly constrain the perceptual smoothness of long sequences. Subsequent work, including Descript Audio Codec (DAC) in 2023 and HiFi-Codec in the same period, refined the RVQ training objective and showed measurable improvements on speech and singing voice; the Descript codec, in particular, claims higher reconstruction quality at comparable bitrates and has displaced EnCodec in some newer generative pipelines.[9]
Finally, the small Transformer entropy model has a 3.5-second receptive field and predicts all codebooks at a single time step independently. This independence assumption costs some compression efficiency relative to a fully joint autoregressive model over codes, but the paper makes a deliberate tradeoff for inference speed.[1] Higher compression ratios are possible by using larger, slower language models at the expense of the codec's real-time guarantee.
| Codec | Year | Type | Sample rates | Min bitrate (perceptual) | Streaming | Notes |
|---|---|---|---|---|---|---|
| Opus | 2012 | Hand-engineered | 8 to 48 kHz | 6 kbps narrowband, 24 kbps stereo wideband for music | Yes | IETF standard, widely deployed |
| EVS | 2014 | Hand-engineered | 8 to 48 kHz | 5.9 kbps speech | Yes | 3GPP standard, VoLTE |
| MP3 | 1993 | Hand-engineered | 32 to 48 kHz | 96 kbps for transparent music | No | ISO/IEC 11172-3 |
| SoundStream / Lyra v2 | 2021 | Neural (RVQ) | 16 kHz, 32 kHz | 3 kbps speech | Yes | Google, MS-STFT not used |
| EnCodec | 2022 | Neural (RVQ + Transformer entropy) | 24 kHz mono, 48 kHz stereo | 1.5 kbps speech, 6 kbps stereo music | Yes (24 kHz) | Meta, MIT license |
| Descript Audio Codec | 2023 | Neural (RVQ + improved training) | 16, 24, 44.1 kHz | Comparable or better than EnCodec at same bitrate | Limited | Open codec, used in newer generative pipelines |
The paper's own comparison emphasises three points. First, EnCodec strictly dominates Lyra v2 and Opus across every MUSHRA cell in the published table at bitrates from 3 to 12 kbps.[1] Second, EnCodec at 6 kbps for stereo music is MUSHRA-equivalent to MP3 at 64 kbps, a 10x reduction in bitrate.[1] Third, the simplification of the discriminator stack from MSD + Mono-STFT (the SoundStream recipe) to a single MS-STFT family delivered both faster training and equal or better quality, suggesting that prior work over-specified the adversary stack.[1]