Wav2Vec is a family of self-supervised learning models developed by Meta AI (formerly Facebook AI Research) for learning speech representations directly from raw audio. Introduced progressively between 2019 and 2023, the Wav2Vec model family demonstrated that powerful speech representations can be learned from unlabeled audio data, drastically reducing the need for expensive human-transcribed training corpora. The models have become foundational in automatic speech recognition (ASR), particularly for low-resource languages where transcribed data is scarce.
The Wav2Vec lineage spans several major releases: the original Wav2Vec (2019), vq-wav2vec (2019), Wav2Vec 2.0 (2020), XLSR-53 (2020), XLS-R (2021), and MMS (2023). Each iteration introduced architectural or methodological improvements that pushed the boundaries of what self-supervised speech models could achieve. Wav2Vec 2.0, the most influential release in the series, showed that a model pre-trained on 53,000 hours of unlabeled speech and fine-tuned on just 10 minutes of transcribed audio could achieve competitive word error rates (WER) on standard benchmarks.
Before the Wav2Vec series, state-of-the-art speech recognition systems relied heavily on large volumes of transcribed audio. Collecting and annotating speech data is both time-consuming and costly, and for the vast majority of the world's roughly 7,000 languages, sufficient labeled corpora simply do not exist. Traditional ASR pipelines used hand-crafted acoustic features such as mel-frequency cepstral coefficients (MFCCs) or log-mel spectrograms and required extensive supervised training.
The success of unsupervised and self-supervised pre-training in natural language processing, exemplified by models like BERT and GPT, inspired researchers to explore analogous strategies for speech. The central question driving the Wav2Vec project was straightforward: could models learn useful speech representations from raw waveforms alone, without any transcriptions, and then be fine-tuned with minimal labeled data to achieve strong recognition performance?
The original Wav2Vec model was introduced by Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli in the paper "wav2vec: Unsupervised Pre-training for Speech Recognition," presented at Interspeech 2019. It was the first model to apply unsupervised pre-training to speech recognition using a fully convolutional neural network architecture.
Wav2Vec 1.0 consisted of two cascaded convolutional networks:
The base model used approximately 34 million parameters. A larger variant, "wav2vec large," included two additional linear transformations in the encoder and a 12-layer context network with increasing kernel sizes (2, 3, ..., 13).
Wav2Vec 1.0 was trained using a contrastive loss based on noise-contrastive estimation. The model learned to distinguish a true future audio frame from a set of negative (distractor) samples drawn from the same sequence. This approach encouraged the context network to capture information predictive of upcoming audio frames.
Pre-trained on the LibriSpeech dataset and evaluated on the Wall Street Journal (WSJ) corpus, Wav2Vec 1.0 achieved a WER of 2.43% on the nov92 test set. This outperformed Deep Speech 2, the best character-based system reported at the time, while using two orders of magnitude less labeled training data. When only a few hours of transcribed data were available, the pre-trained representations reduced WER by up to 36% over a strong character-based log-mel filterbank baseline.
The vq-wav2vec model, introduced by Alexei Baevski, Steffen Schneider, and Michael Auli and presented at ICLR 2020, extended the Wav2Vec approach by learning discrete (quantized) speech representations. This was a pivotal step because it bridged the gap between continuous audio signals and discrete-input NLP algorithms.
vq-wav2vec applied vector quantization to the dense representations produced by the Wav2Vec encoder. Two quantization strategies were explored:
Both methods converted the continuous encoder outputs into sequences of discrete tokens, each drawn from a learned codebook.
The key innovation of vq-wav2vec was that, once speech had been converted to discrete tokens, standard NLP pre-training methods could be applied directly. The authors demonstrated that running BERT-style masked language model pre-training on the discrete speech tokens yielded strong improvements on downstream tasks. This two-stage pipeline (vq-wav2vec quantization followed by BERT pre-training) achieved state-of-the-art results on TIMIT phoneme classification and WSJ speech recognition at the time of publication.
Wav2Vec 2.0, introduced by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli at NeurIPS 2020, unified and substantially improved upon the ideas in Wav2Vec 1.0 and vq-wav2vec. It is the most widely cited and adopted model in the family, combining contrastive learning with masked prediction in a single end-to-end framework.
Wav2Vec 2.0 consists of three main components:
1. Convolutional Feature Encoder
The feature encoder is a 7-layer temporal CNN that processes raw 16 kHz audio waveforms. All seven layers use 512 channels, and each layer is followed by layer normalization and a GELU activation function. The specific kernel widths and strides for the seven layers are:
| Layer | Kernel Width | Stride |
|---|---|---|
| 1 | 10 | 5 |
| 2 | 3 | 2 |
| 3 | 3 | 2 |
| 4 | 3 | 2 |
| 5 | 3 | 2 |
| 6 | 2 | 2 |
| 7 | 2 | 2 |
The total stride of the encoder is 320 samples, producing one output vector every 20 milliseconds. The receptive field is 400 input samples, corresponding to 25 milliseconds of audio at 16 kHz.
2. Transformer Context Network
The output of the CNN encoder (512-dimensional vectors) passes through a linear feature projection layer that increases the dimensionality before feeding it into a Transformer encoder. The Transformer uses a convolutional layer for relative positional embeddings instead of fixed sinusoidal positional encodings. Two model configurations were released:
| Configuration | Transformer Layers | Hidden Size | Attention Heads | FFN Size | Parameters |
|---|---|---|---|---|---|
| BASE | 12 | 768 | 8 | 3,072 | ~95M |
| LARGE | 24 | 1,024 | 16 | 4,096 | ~317M |
3. Quantization Module
The quantization module discretizes the CNN encoder outputs using product quantization with a Gumbel-Softmax distribution. The default configuration uses G = 2 codebooks, each containing V = 320 entries (codewords). This yields a theoretical maximum of 320 x 320 = 102,400 possible quantized speech units. The quantized representations serve as targets for the contrastive learning objective.
During pre-training, Wav2Vec 2.0 masks spans of the latent feature encoder output before it is fed to the Transformer. The masking strategy selects starting positions with probability p = 0.065 and masks M = 10 consecutive time steps from each selected position. Approximately 50% of all time steps end up masked.
The training objective combines two loss functions:
Two pre-training data configurations were evaluated:
After pre-training, the model is fine-tuned for ASR by adding a randomly initialized linear projection on top of the Transformer output and training with the Connectionist Temporal Classification (CTC) loss. During fine-tuning, the quantization module is not used. The model can be fine-tuned on very small amounts of labeled data, and results were reported for 10 minutes, 1 hour, 10 hours, 100 hours, and 960 hours of transcribed LibriSpeech audio.
At inference time, decoded output can optionally be combined with a language model (4-gram or Transformer-based) to improve accuracy.
Wav2Vec 2.0 achieved striking results across different amounts of labeled training data. The following table summarizes key WER results on the LibriSpeech benchmark:
| Model | Pre-Training Data | Labeled Data | Test-Clean WER (%) | Test-Other WER (%) | Language Model |
|---|---|---|---|---|---|
| LARGE | LL-60k | 10 min | 4.8 | 8.2 | Transformer LM |
| LARGE | LL-60k | 10 min | 5.2 | 8.6 | None |
| LARGE | LL-60k | 1 hour | 2.7 | 5.2 | Transformer LM |
| LARGE | LL-60k | 1 hour | 3.9 | 7.6 | None |
| LARGE | LL-60k | 10 hours | 2.2 | 4.5 | Transformer LM |
| LARGE | LL-60k | 100 hours | 2.0 | 4.0 | Transformer LM |
| LARGE | LL-60k | 960 hours | 1.8 | 3.3 | Transformer LM |
| BASE | LS-960 | 10 min | 7.5 | 12.4 | Transformer LM |
| BASE | LS-960 | 960 hours | 2.7 | 5.4 | Transformer LM |
The most remarkable result was that the LARGE model, pre-trained on 53,000 hours of unlabeled audio and fine-tuned with only 10 minutes of labeled data, achieved a WER of 4.8% on test-clean and 8.2% on test-other with a Transformer language model. This demonstrated that self-supervised pre-training could reduce the need for labeled speech data by orders of magnitude.
When using all 960 hours of labeled LibriSpeech data, Wav2Vec 2.0 LARGE achieved 1.8% WER on test-clean and 3.3% on test-other, which was competitive with or superior to the best supervised systems available at the time.
A major contribution of the Wav2Vec research program was extending self-supervised speech representations to multilingual and cross-lingual settings, addressing the vast gap in ASR coverage across the world's languages.
XLSR-53 (Cross-Lingual Speech Representations for 53 languages) was introduced by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. The model extended Wav2Vec 2.0 to the multilingual domain by pre-training a single model on 56,000 hours of speech data across 53 languages drawn from the MLS, CommonVoice, and BABEL datasets.
XLSR-53 used product quantization shared across all languages, which encouraged the model to learn language-universal phonetic representations. The shared discrete units captured phonetic and prosodic structures common to many languages, enabling effective cross-lingual transfer. On the CommonVoice benchmark, XLSR-53 reduced phoneme error rates by 72% relative to the best known prior results. On BABEL, word error rate improved by 16% relative to comparable systems.
XLS-R (Cross-Lingual Speech Representations at Scale) was introduced by Arun Babu, Changhan Wang, Andros Tjandra, and colleagues in 2021. It scaled up the XLSR approach dramatically:
| Model | Parameters | Training Data | Languages |
|---|---|---|---|
| XLS-R 300M | 300 million | ~436,000 hours | 128 |
| XLS-R 1B | 1 billion | ~436,000 hours | 128 |
| XLS-R 2B | 2 billion | ~436,000 hours | 128 |
The training data came from multiple public sources, including BABEL, Multilingual LibriSpeech (MLS), CommonVoice, VoxPopuli, and VoxLingua107, totaling nearly 436,000 hours of speech. This represented roughly 10 times more data and 2.5 times more languages than XLSR-53.
XLS-R outperformed prior work on most of the 37 languages tested across BABEL, CommonVoice, MLS, and VoxPopuli benchmarks, reducing error rates by 14 to 34% relative on average. On the CoVoST-2 speech translation benchmark, the model improved BLEU scores by an average of 7.4 points over 21 translation directions into English. The model also set a new state of the art on the VoxLingua107 language identification benchmark.
A key finding was that model capacity matters significantly for multilingual representations. The 300M parameter model showed capacity dilution issues when representing 128 languages, but scaling to 1B and 2B parameters allowed the model to match or exceed monolingual English-only pre-training performance.
The Massively Multilingual Speech (MMS) project, published by Vineel Pratap, Andros Tjandra, Bowen Shi, and colleagues, pushed language coverage far beyond XLS-R. MMS expanded speech technology from approximately 100 languages to over 1,100 for speech recognition and text-to-speech, and over 4,000 for language identification.
| Capability | Number of Languages |
|---|---|
| Pre-trained wav2vec 2.0 representations | 1,406 |
| Automatic speech recognition (ASR) | 1,107 |
| Text-to-speech (TTS) | 1,107 |
| Language identification | 4,017 |
The project leveraged a novel training dataset based on readings of publicly available religious texts (primarily the New Testament), providing on average 32 hours of audio per language. By incorporating unlabeled recordings of various other religious readings, the team increased the number of languages for unsupervised pre-training to over 4,000.
On the FLEURS benchmark, the MMS multilingual ASR model more than halved the word error rate of Whisper on 54 languages while being trained on a substantially smaller fraction of labeled data. This result highlighted the effectiveness of self-supervised pre-training combined with broad language coverage.
The following table summarizes the key models in the Wav2Vec family:
| Model | Year | Authors | Architecture | Parameters | Pre-Training Data | Languages | Key Innovation |
|---|---|---|---|---|---|---|---|
| Wav2Vec 1.0 | 2019 | Schneider et al. | CNN encoder + CNN context | ~34M | LibriSpeech (960h) | English | Contrastive pre-training for speech |
| vq-wav2vec | 2019 | Baevski et al. | CNN + vector quantization | ~34M | LibriSpeech (960h) | English | Discrete speech tokens for NLP methods |
| Wav2Vec 2.0 BASE | 2020 | Baevski et al. | CNN encoder + Transformer | ~95M | LibriSpeech (960h) | English | Contrastive learning + masked prediction |
| Wav2Vec 2.0 LARGE | 2020 | Baevski et al. | CNN encoder + Transformer | ~317M | Libri-Light (53kh) | English | Scaled pre-training on unlabeled audio |
| XLSR-53 | 2020 | Conneau et al. | Wav2Vec 2.0 LARGE | ~317M | 56,000 hours | 53 | Cross-lingual shared quantization |
| XLS-R 300M | 2021 | Babu et al. | Wav2Vec 2.0 | 300M | ~436,000 hours | 128 | Massive multilingual scaling |
| XLS-R 1B | 2021 | Babu et al. | Wav2Vec 2.0 | 1B | ~436,000 hours | 128 | Billion-parameter speech model |
| XLS-R 2B | 2021 | Babu et al. | Wav2Vec 2.0 | 2B | ~436,000 hours | 128 | Largest self-supervised speech model |
| MMS | 2023 | Pratap et al. | Wav2Vec 2.0 | 1B | ~500,000 hours | 1,406 | 1,100+ language ASR and TTS |
Wav2Vec 2.0 exists within a broader ecosystem of self-supervised speech representation models. Several related models have built upon or diverged from the Wav2Vec approach.
HuBERT (Hidden Unit BERT), introduced by Wei-Ning Hsu and colleagues at Meta AI in 2021, follows the same encoder architecture as Wav2Vec 2.0 (7-layer CNN encoder plus Transformer) but uses a different pre-training objective. Instead of contrastive learning with quantized targets, HuBERT generates pseudo-labels through offline k-means clustering and trains with a masked prediction loss similar to BERT. This approach avoids the need for the Gumbel-Softmax quantization module, the diversity loss, and the careful temperature annealing schedule required by Wav2Vec 2.0.
HuBERT BASE and LARGE match the architectures of Wav2Vec 2.0 BASE and LARGE, respectively. An X-LARGE variant with approximately 1 billion parameters was also introduced. In ultra-low-resource settings with 10 minutes of labeled data, HuBERT LARGE achieved 4.7% WER on LibriSpeech test-clean and 7.6% on test-other, slightly outperforming Wav2Vec 2.0 LARGE (4.8% and 8.2%, respectively, without an external language model correction).
WavLM, introduced by Sanyuan Chen and colleagues at Microsoft in 2022, built upon the HuBERT framework with two key extensions. First, it introduced a gated relative position bias in the Transformer self-attention mechanism, improving performance on recognition tasks. Second, it used an utterance mixing training strategy that created overlapping speech samples during pre-training, helping the model learn to handle speaker overlap and noise.
WavLM was pre-trained on up to 94,000 hours of audio data. It achieved state-of-the-art results on the SUPERB benchmark, surpassing HuBERT Large on 14 out of 15 tasks. WavLM was particularly strong on non-ASR tasks like speaker verification and speaker diarization, reflecting its design emphasis on full-stack speech processing.
Whisper, released by OpenAI in September 2022, took a fundamentally different approach from the Wav2Vec family. Rather than self-supervised pre-training followed by fine-tuning, Whisper used a supervised encoder-decoder Transformer architecture trained on 680,000 hours of weakly labeled (web-crawled) multilingual audio data. The largest Whisper model (Large V3) has approximately 1.55 billion parameters and supports over 99 languages.
Whisper generally achieves lower WER than Wav2Vec 2.0 across multiple languages, particularly in noisy environments. However, Whisper's reliance on massive labeled datasets makes it less adaptable to truly low-resource languages without extensive annotation. In contrast, Wav2Vec 2.0 and its multilingual descendants (XLS-R, MMS) excel precisely in low-resource scenarios where labeled data is minimal. The MMS model more than halved Whisper's word error rate on 54 FLEURS languages while using far less labeled training data.
| Model | Organization | Year | Approach | Pre-Training Data | Parameters | Languages |
|---|---|---|---|---|---|---|
| Wav2Vec 2.0 | Meta AI | 2020 | Self-supervised (contrastive + masked) | 53,000h unlabeled | 95M / 317M | English |
| HuBERT | Meta AI | 2021 | Self-supervised (masked prediction) | 60,000h unlabeled | 95M / 317M / 1B | English |
| WavLM | Microsoft | 2022 | Self-supervised (masked prediction + mixing) | 94,000h unlabeled | 95M / 317M | English |
| Whisper | OpenAI | 2022 | Supervised (encoder-decoder) | 680,000h labeled | 38M to 1.55B | 99+ |
| XLS-R | Meta AI | 2021 | Self-supervised (contrastive + masked) | 436,000h unlabeled | 300M / 1B / 2B | 128 |
| MMS | Meta AI | 2023 | Self-supervised + fine-tuned | ~500,000h unlabeled | 1B | 1,406 |
The Wav2Vec model family has been applied to a wide range of speech processing tasks beyond standard ASR.
ASR remains the primary application of Wav2Vec models. The self-supervised pre-trained representations can be fine-tuned with CTC loss on small amounts of transcribed audio in any target language, making Wav2Vec 2.0 and its multilingual variants practical tools for building ASR systems in low-resource settings. Researchers have successfully fine-tuned Wav2Vec 2.0 and XLS-R for languages including Yoruba, Swahili, Mizo, and many others with limited digital resources.
Wav2Vec 2.0 embeddings have been widely adopted for speech emotion recognition (SER). The pre-trained representations capture prosodic, tonal, and spectral features that are informative for detecting emotions such as happiness, sadness, anger, and fear. Fine-tuned Wav2Vec 2.0 models have achieved strong results on benchmarks like IEMOCAP and other emotion datasets, often outperforming traditional hand-crafted feature approaches.
The learned representations also encode speaker-specific characteristics, making Wav2Vec 2.0 useful for speaker verification and identification tasks. Models fine-tuned on speaker verification datasets can distinguish between speakers based on their voice characteristics. The SUPERB benchmark evaluates these capabilities, and Wav2Vec 2.0 has shown competitive results on speaker-related tasks.
The multilingual Wav2Vec variants (XLSR-53, XLS-R, MMS) are well-suited for language identification. The MMS model can identify over 4,000 spoken languages, far exceeding the coverage of any previous system. XLS-R set state-of-the-art results on the VoxLingua107 language identification benchmark.
XLS-R demonstrated strong performance on speech translation tasks, particularly on the CoVoST-2 benchmark. The model's cross-lingual representations enabled it to improve BLEU scores substantially, with especially large gains on low-resource language directions such as Indonesian-to-English translation, where accuracy roughly doubled compared to prior work.
vq-wav2vec and Wav2Vec 2.0 have been used for phoneme recognition tasks, where the goal is to identify the sequence of phonemes in a speech utterance. This is useful for linguistic research, pronunciation assessment, and as a component of larger speech processing pipelines.
The Wav2Vec family of models has had a transformative impact on speech technology for low-resource languages. Before self-supervised speech models, building a functional ASR system for a language required hundreds or thousands of hours of transcribed audio, a resource available for only a handful of the world's languages.
Wav2Vec 2.0 demonstrated that competitive ASR performance could be achieved with as little as 10 minutes of labeled data when combined with large-scale unsupervised pre-training. The multilingual extensions (XLSR-53, XLS-R, MMS) further amplified this impact by sharing learned representations across languages, enabling positive cross-lingual transfer.
Specific examples of low-resource language improvements include:
For many communities speaking under-resourced languages, the Wav2Vec family opened the door to practical speech technology for the first time. The approach is especially valuable for languages where the Bible or other religious texts provide one of the few substantial sources of aligned text and audio data.
Several technical innovations in the Wav2Vec family have had lasting influence on the field of speech processing.
Wav2Vec 1.0 introduced the idea of contrastive predictive coding for speech, where the model learns to distinguish true future frames from negative samples. Wav2Vec 2.0 refined this by applying the contrastive loss to masked (rather than future) positions, making the task more similar to BERT-style masked language modeling.
Wav2Vec 2.0 jointly learns the quantization codebook and the Transformer representations during pre-training. This eliminates the need for a separate quantization step (as in vq-wav2vec) and allows the discrete targets to co-adapt with the contextual representations.
Rather than masking the raw audio waveform, Wav2Vec 2.0 masks the output of the CNN feature encoder in latent space. This design choice means the CNN encoder always sees the complete audio input, while only the Transformer must reconstruct masked positions from context. This proved more effective than masking the raw input.
The combination of self-supervised pre-training with CTC-based fine-tuning on minimal labeled data was a key practical innovation. CTC allows the model to be trained without precise alignment between audio frames and text characters, simplifying the fine-tuning pipeline.
XLSR-53 introduced shared quantization across languages, where the same set of discrete speech units represented phonetic content across all training languages. This encouraged the model to discover language-universal speech representations and enabled strong cross-lingual transfer.
All major Wav2Vec models are open source and available through Meta's fairseq library on GitHub. The models are also integrated into the Hugging Face Transformers library, which provides convenient APIs for loading pre-trained models, running inference, and fine-tuning on custom datasets.
Key Hugging Face model identifiers include:
| Model | Hugging Face Identifier |
|---|---|
| Wav2Vec 2.0 BASE (960h) | facebook/wav2vec2-base-960h |
| Wav2Vec 2.0 LARGE (960h) | facebook/wav2vec2-large-960h |
| Wav2Vec 2.0 LARGE (LV-60k + 960h) | facebook/wav2vec2-large-960h-lv60-self |
| XLSR-53 | facebook/wav2vec2-large-xlsr-53 |
| XLS-R 300M | facebook/wav2vec2-xls-r-300m |
| XLS-R 1B | facebook/wav2vec2-xls-r-1b |
| XLS-R 2B | facebook/wav2vec2-xls-r-2b |
All models expect 16 kHz sampled single-channel audio as input. The Hugging Face Wav2Vec2Processor handles audio preprocessing, and Wav2Vec2ForCTC provides the CTC fine-tuning head for ASR tasks.
The Wav2Vec series has had a profound impact on the speech processing community. It demonstrated convincingly that self-supervised learning could achieve results comparable to or better than fully supervised systems, particularly in low-resource settings. The core ideas of masking latent speech features, applying contrastive objectives, and fine-tuning with CTC have been adopted and extended by numerous subsequent models.
HuBERT, WavLM, data2vec, and other self-supervised speech models all build upon foundations laid by the Wav2Vec line of research. The multilingual extensions (XLSR-53, XLS-R, MMS) have significantly advanced the goal of universal speech technology that works across the full diversity of human languages.
As of 2025, Wav2Vec 2.0 remains one of the most widely used self-supervised speech models in both research and production, and its architecture continues to serve as a baseline for new developments in speech representation learning.