Wav2Vec 2.0
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 ยท 3,515 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 ยท 3,515 words
Add missing citations, update stale details, or suggest a clearer explanation.
Wav2Vec 2.0 is a self-supervised learning framework for speech representation, developed by the Facebook AI Research (FAIR) group at Meta and introduced in 2020. It learns useful representations of spoken language directly from raw, untranscribed audio, and a model pretrained this way can then be adapted for automatic speech recognition (ASR) with a comparatively small amount of labeled data. The central result of the work is that powerful speech representations learned from audio alone, followed by light supervised fine-tuning, can outperform earlier semi-supervised systems while being conceptually simpler. The most widely used public checkpoint, distributed on Hugging Face as facebook/wav2vec2-base-960h, is a base-sized model pretrained and fine-tuned on 960 hours of the LibriSpeech corpus.
The approach was described in the paper "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations" by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli, posted to arXiv in June 2020 (arXiv:2006.11477) and presented at NeurIPS 2020. Wav2Vec 2.0 became one of the most influential speech models of its era. It demonstrated that the recipe that had reshaped natural language processing, namely large-scale self-supervised pretraining followed by task-specific fine-tuning, could be carried over to audio, and it set off a wave of follow-on work including the multilingual XLSR and XLS-R models, HuBERT, WavLM, and data2vec.
Building an accurate speech recognizer has traditionally required large quantities of transcribed audio, meaning recordings paired with the exact words that were spoken. Transcription is expensive and slow, and for most of the world's roughly 7,000 languages little or no transcribed audio exists. Wav2Vec 2.0 attacks this bottleneck by separating learning into two stages. In the first stage, the model is pretrained on raw audio with no transcripts at all, learning the statistical structure of speech sounds purely from listening. In the second stage, a small labeled dataset is used to fine-tune the pretrained network into a working transcription system.
The pretraining objective is a self-supervised one: the model is never told what words are present, but is instead trained to solve a puzzle constructed from the audio itself. Wav2Vec 2.0 masks spans of the speech signal in a latent (internal, learned) space and then trains the network to identify the correct latent content of each masked span from among a set of distractors. This is a contrastive task, closely related in spirit to masked language modeling in text but adapted to the continuous, high-rate nature of audio. Because the targets are quantized, that is, drawn from a learned, finite inventory of discrete speech units rather than from raw continuous values, the problem becomes a discrimination task with a well-defined answer.
The architecture has three parts that work in sequence. A convolutional neural network feature encoder turns the raw waveform into a sequence of latent feature vectors at a much lower frame rate. A transformer context network then builds contextualized representations that incorporate information from across the whole utterance. In parallel, a quantization module maps the encoder outputs to discrete units that serve as the prediction targets during pretraining. After pretraining, the quantization module is discarded and a single linear projection is added on top of the transformer; the whole network is then fine-tuned with a Connectionist Temporal Classification (CTC) objective to emit characters or phonemes.
The headline empirical claims are striking. Using all 960 hours of labeled LibriSpeech data, Wav2Vec 2.0 reaches a word error rate (WER) of 1.8 on the test-clean set and 3.3 on test-other. More importantly for low-resource settings, a large model pretrained on roughly 53,000 hours of unlabeled audio and fine-tuned on just ten minutes of transcribed speech still achieves 4.8 WER on test-clean and 8.2 on test-other, a level that would have been considered strong using vastly more labeled data only a year or two earlier.
The feature encoder is a multi-layer one-dimensional convolutional network that consumes the raw, normalized 16 kHz audio waveform and produces a sequence of latent speech representations. In the standard configuration it has seven convolutional blocks, each with 512 channels. The kernel widths are (10, 3, 3, 3, 3, 3, 3) and the strides are (5, 2, 2, 2, 2, 2, 2), with a GELU activation after each block and a layer-normalization step early in the stack. The cumulative striding reduces the sampling rate dramatically: the encoder emits one feature vector roughly every 20 milliseconds of audio, with each vector summarizing a receptive field of about 25 milliseconds. This compression is essential, because it converts an unwieldy stream of 16,000 samples per second into a far shorter sequence (about 50 vectors per second) that a transformer can process efficiently.
A subtle but important design choice is that the convolutions operate directly on the waveform rather than on hand-engineered spectral features such as log-mel filterbanks or MFCCs. This means the front end itself is learned end to end, and the model is free to discover whatever low-level acoustic cues best support the pretraining objective.
The quantization module is what gives the contrastive objective a concrete set of targets. Continuous encoder outputs are mapped to discrete units through product quantization, which selects entries from multiple small codebooks and concatenates them. In the published configuration there are two codebook groups, each containing 320 entries, so the model can in principle compose 320 times 320, or 102,400, distinct quantized speech units from a modest number of learned vectors. Each chosen pair of code entries is concatenated and passed through a linear layer to form the quantized representation.
Selecting a discrete entry is inherently non-differentiable, which would normally block gradient-based training. Wav2Vec 2.0 sidesteps this with the Gumbel-softmax, a technique that makes the categorical choice differentiable during training by using a temperature-controlled soft approximation in the backward pass while still committing to a single hard selection in the forward pass. The temperature is annealed over the course of training, starting high (closer to a soft, uniform choice) and decreasing toward a near-deterministic selection. This lets gradients flow back into the codebooks so that the discrete inventory is learned jointly with the rest of the network rather than fixed in advance.
Because a contrastive objective can collapse if the model simply uses a handful of codes for everything, the training loss includes a diversity term (described below) that pushes the model to make balanced use of all available code entries.
Pretraining begins by masking. After the feature encoder runs, a fraction of the resulting time steps are chosen as mask starting points, each with probability 0.065, and from every chosen point a span of ten consecutive time steps is masked. Because the spans can overlap, roughly half of all time steps (about 49 percent) end up masked in a typical utterance. The masked latent vectors are replaced with a shared, learned mask embedding before they enter the transformer. Crucially, the masking happens in latent space, on the encoder outputs, rather than on the raw waveform.
The transformer context network then processes the partly masked sequence and produces a contextualized output vector at every position, including the masked ones. Instead of the usual fixed sinusoidal or learned absolute position embeddings, Wav2Vec 2.0 uses a convolutional layer as a relative positional embedding, adding the convolution's output to the inputs before the transformer layers. The base model uses 12 transformer blocks with a model dimension of 768, and the large model uses 24 blocks with a dimension of 1,024.
The self-supervised objective asks, for each masked position, that the transformer's contextualized output be more similar to the true quantized representation of that span than to a set of distractors. The distractors, called negatives, are 100 quantized vectors sampled uniformly from other masked time steps within the same utterance. Similarity is measured with cosine similarity, scaled by a temperature of 0.1, and the contrastive loss is a cross-entropy over the true target plus the 100 negatives. This is denoted the contrastive loss, L_m.
The total pretraining loss adds a diversity loss, L_d, weighted at 0.1. The diversity term encourages the model to use all of the codebook entries roughly equally by maximizing the entropy of the average code-selection distribution over each batch, which prevents the quantizer from collapsing onto a few dominant codes. A small L2 penalty on the encoder outputs is also applied for stability. The combined objective trains the encoder, the transformer, and the quantizer simultaneously, with no transcripts involved at any point.
Once pretraining is complete, the quantization module is no longer needed and is set aside. To turn the model into a recognizer, a single randomly initialized linear layer is placed on top of the transformer to project each output frame onto a vocabulary, for example the set of characters plus a word boundary token and a special blank symbol. The model is then fine-tuned end to end on labeled audio using the Connectionist Temporal Classification loss.
CTC is well suited to speech because it does not require a frame-by-frame alignment between the audio and the transcript. It instead sums over all possible alignments of the output frame sequence to the target label sequence, using a blank token to absorb repeated frames and gaps, so the network can be trained with only the unaligned target text. During fine-tuning the feature encoder is usually kept frozen for the first updates, and a SpecAugment-style masking of time and feature dimensions is applied to the latent representations as a regularizer. At inference the frame-level predictions can be greedily decoded into text, or combined with an external language model and beam search for lower error rates. The fine-tuning stage is data-efficient precisely because the heavy lifting of learning acoustic structure was already done during self-supervised pretraining, so the labeled data only has to teach the mapping from learned representations to written symbols.
Wav2Vec 2.0 was released in two principal sizes, BASE and LARGE, and was later extended into multilingual variants. The original models were pretrained either on the 960 hours of LibriSpeech audio or on the much larger Libri-Light corpus (about 53,200 hours of unlabeled English audiobook speech, often abbreviated LV-60k), and then fine-tuned on labeled subsets ranging from ten minutes up to the full 960 hours.
| Model | Transformer layers | Model dimension | Approx. parameters | Pretraining data | Notes |
|---|---|---|---|---|---|
| Wav2Vec 2.0 BASE | 12 | 768 | about 95 million | LibriSpeech 960h | Reference base model; facebook/wav2vec2-base-960h is fine-tuned on all 960h |
| Wav2Vec 2.0 LARGE | 24 | 1,024 | about 317 million | LibriSpeech 960h or Libri-Light (LV-60k, ~53.2k h) | Best English results; wav2vec2-large-960h-lv60-self adds self-training |
| XLSR-53 | 24 | 1,024 | about 300 million | ~56k hours across 53 languages | Cross-lingual model; fine-tune per language for ASR |
| XLS-R | up to 48 | up to 1,920 | 0.3B, 1B, and 2B | ~436k hours across 128 languages | Scaled successor to XLSR-53 |
The public Hugging Face checkpoint facebook/wav2vec2-base-960h is the base model with roughly 94 million parameters, distributed in the Transformers library through classes such as Wav2Vec2Model, Wav2Vec2ForCTC for recognition, Wav2Vec2ForPreTraining for self-supervised training, and Wav2Vec2ForSequenceClassification for audio classification. A Wav2Vec2Processor bundles the feature extractor and the CTC tokenizer.
XLSR, short for cross-lingual speech representations, applies the same self-supervised recipe to many languages at once. Instead of pretraining on a single language, a single model is pretrained on raw audio drawn from dozens of languages, and the quantization codebook is shared across all of them. The shared discrete inventory encourages the model to discover speech units that generalize across languages, which is especially valuable for languages with little data, because they can benefit from acoustic regularities learned from high-resource languages.
The original XLSR work, "Unsupervised Cross-lingual Representation Learning for Speech Recognition" by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli (arXiv:2006.13979), introduced this idea. Its largest release, XLSR-53, has roughly 300 million parameters and was pretrained on about 56,000 hours of speech spanning 53 languages, drawing on the Multilingual LibriSpeech, CommonVoice, and BABEL datasets. The model is distributed as facebook/wav2vec2-large-xlsr-53 and is intended to be fine-tuned per language for transcription. The paper reported large gains over training each language from scratch, including a 72 percent relative reduction in phoneme error rate on CommonVoice and a 16 percent relative reduction in word error rate on BABEL compared with prior comparable systems.
XLS-R, described in "XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale" (arXiv:2111.09296), pushed the multilingual approach much further. It was trained on nearly half a million hours, about 436,000 hours, of publicly available speech in 128 languages, and was released in sizes of roughly 0.3 billion, 1 billion, and 2 billion parameters. XLS-R improved cross-lingual transcription, speech translation, and language identification, and it underpins later Meta speech efforts. These multilingual descendants, together with Meta's broader Massively Multilingual Speech (MMS) project, extended Wav2Vec 2.0 style pretraining to well over a thousand languages.
The results that made Wav2Vec 2.0 notable are its low-resource numbers on the LibriSpeech benchmark, where performance is reported as word error rate on the test-clean and test-other splits. Lower is better; test-other is the harder, noisier split. The table below summarizes representative figures from the original paper for the LARGE model pretrained on the 53,200-hour Libri-Light corpus and fine-tuned on different amounts of labeled data, with a language model used during decoding.
| Labeled fine-tuning data | WER test-clean | WER test-other |
|---|---|---|
| 10 minutes | 4.8 | 8.2 |
| 1 hour | 3.9 | 7.6 |
| 10 hours | 3.2 | 6.1 |
| 100 hours | 2.3 | 5.0 |
| 960 hours (full) | 1.8 | 3.3 |
The ten-minute result is the most cited: with only about ten minutes of transcribed audio, plus a great deal of unlabeled pretraining, the model reached 4.8 / 8.2 WER. With one hour of labeled data the model reached 3.9 / 7.6, which the authors noted outperformed the previous state of the art that had been trained on the full 100-hour clean subset, while using roughly 100 times less labeled data. Using all 960 hours of labels, the system set a then state-of-the-art 1.8 / 3.3.
It is worth distinguishing these research figures from the numbers reported for the specific public base checkpoint. The model card for facebook/wav2vec2-base-960h lists 3.4 WER on test-clean and 8.6 on test-other, which reflect the smaller base architecture and the evaluation setup used there, including decoding without the strongest external language model. Independent evaluations on out-of-domain corpora, such as conversational meeting audio, telephone speech, or earnings calls, report substantially higher error rates, which is expected because the model was trained on read audiobook speech and does not generalize perfectly to very different acoustic conditions.
The dominant application of Wav2Vec 2.0 is automatic speech recognition, and it is widely used as a starting point for building transcription systems in languages and domains where labeled data is scarce. Because the pretrained network already encodes rich acoustic structure, practitioners can fine-tune it on a few hours, or in extreme cases minutes, of in-domain transcribed audio and obtain a usable recognizer. This made it a practical foundation for low-resource and under-served languages, an area where the XLSR and XLS-R variants are especially relevant.
Beyond plain transcription, the representations learned by Wav2Vec 2.0 transfer to a broad set of downstream speech tasks. The contextualized features are commonly used, often with the backbone frozen and only a small head trained on top, for speaker identification and verification, spoken-language identification, emotion recognition, keyword spotting, intent classification in spoken-language understanding, phoneme recognition, and audio classification. The model also serves as a component in pipelines for speech translation and as a feature extractor in research benchmarks such as SUPERB that probe how well self-supervised speech models capture different kinds of information. In addition, the discrete units produced by the quantizer have been used as targets or tokens in generative and resynthesis systems, contributing to the broader move toward treating speech as a sequence of learned discrete tokens.
For developers, the model's presence in the Hugging Face Transformers library, combined with its permissive license, lowered the barrier to entry considerably. A handful of lines of code load the processor and the CTC model, run inference on a waveform, take the argmax over the output logits, and decode the result to text, which made Wav2Vec 2.0 a default teaching example and a common production baseline for ASR.
Wav2Vec 2.0 has several practical and conceptual limitations. The pretraining stage is computationally demanding: learning good representations from tens of thousands of hours of audio requires large compute budgets and many GPU days, which puts pretraining from scratch out of reach for many groups, even though using the released checkpoints is cheap. The contrastive objective is also sensitive to hyperparameters such as the masking probability, the number of negatives, the Gumbel-softmax temperature schedule, and the diversity loss weight, and unstable training or codebook collapse can occur if these are mis-set.
The quality of a fine-tuned recognizer depends heavily on how well the fine-tuning and deployment audio matches the pretraining distribution. The most popular English checkpoints were pretrained on clean read audiobook speech, so accuracy degrades on spontaneous conversation, accented speech, far-field or noisy recordings, telephone-bandwidth audio, and specialized vocabularies, as the elevated error rates on out-of-domain benchmarks show. The fixed CTC character or phoneme vocabulary used during fine-tuning also assumes a particular target language and writing system, and adapting to a new script generally means re-fitting the output layer and fine-tuning again.
There are inherent properties of the CTC head worth noting as well. CTC assumes conditional independence between output frames given the audio, so a fine-tuned Wav2Vec 2.0 model has no built-in language model and benefits noticeably from an external one at decode time for best accuracy. The model expects single-channel audio sampled at 16 kHz and operates on fixed-length latent frames, so it is not by itself a complete solution for problems like speaker diarization, overlapping speech, or long-form streaming transcription, each of which needs additional components. Finally, like other large speech models trained on web-scale or read-speech corpora, it can reflect demographic and linguistic biases present in its data, performing less well for some accents, dialects, and speaker groups than for others.
Wav2Vec 2.0 is the culmination of a line of self-supervised speech research at Facebook AI Research. The first wav2vec, introduced in 2019 by Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli, used a fully convolutional network and a contrastive predictive coding style objective to learn speech representations from raw audio, and it showed that such pretraining could reduce word error rates when labeled data was limited. It was followed by vq-wav2vec, which added a vector-quantization step to produce discrete units and connected speech representation learning to the discrete-token, BERT-style modeling that was dominant in text.
Wav2Vec 2.0 unified these threads into a single end-to-end model. Rather than learning discrete units in one model and then training a separate masked model on top, it masked latent representations, learned the quantized targets jointly, and solved one contrastive objective over the whole network. The paper was submitted to arXiv on 20 June 2020 as arXiv:2006.11477 and was published at the 34th Conference on Neural Information Processing Systems (NeurIPS) in December 2020. Code and pretrained models were released as part of Meta's fairseq toolkit, and the weights were later mirrored to the Hugging Face Hub. The English models and the public base checkpoint are distributed under the permissive Apache 2.0 license, which contributed to broad adoption in both research and industry.
The model's release was quickly followed by the multilingual XLSR work in mid-2020 and, in late 2021, by the much larger XLS-R models covering 128 languages. The architecture also influenced and was compared against a series of successor self-supervised speech models, including HuBERT, which replaced the contrastive objective with masked prediction of cluster targets; WavLM, which added denoising and speaker-aware pretraining; and data2vec, a unified self-supervised framework spanning speech, vision, and text. Today Wav2Vec 2.0 is regarded as a landmark that brought the pretraining-and-fine-tuning paradigm firmly into speech processing, and its base checkpoint remains one of the most downloaded speech models on the Hugging Face Hub.