# Wav2Vec

> Source: https://aiwiki.ai/wiki/wav2vec
> Updated: 2026-06-22
> Categories: Deep Learning, Meta AI, Natural Language Processing, Speech & Audio AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Wav2Vec is a family of [self-supervised learning](/wiki/self_supervised_learning) models from [Meta AI](/wiki/meta_ai) (formerly Facebook AI Research) that learn speech representations directly from raw audio waveforms, then fine-tune for [automatic speech recognition](/wiki/automatic_speech_recognition_models) (ASR) with very little transcribed data. The flagship release, Wav2Vec 2.0 (2020), showed that a model pre-trained on 53,000 hours of unlabeled speech and fine-tuned on just 10 minutes of transcribed audio could reach a 4.8% word error rate (WER) on the LibriSpeech test-clean benchmark, demonstrating that competitive [speech recognition](/wiki/speech_recognition) no longer required thousands of hours of human transcription.[3] The Wav2Vec 2.0 authors stated their result "for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler."[3]

Introduced progressively between 2019 and 2023, the Wav2Vec lineage spans the original Wav2Vec (2019), vq-wav2vec (2019), Wav2Vec 2.0 (2020), XLSR-53 (2020), XLS-R (2021), and MMS (2023). Each iteration introduced architectural or methodological improvements that pushed the boundaries of what self-supervised speech models could achieve, and the family became foundational in ASR, particularly for low-resource languages where transcribed data is scarce.[3]

## What problem was Wav2Vec built to solve?

Before the Wav2Vec series, state-of-the-art speech recognition systems relied heavily on large volumes of transcribed audio. Collecting and annotating speech data is both time-consuming and costly, and for the vast majority of the world's roughly 7,000 languages, sufficient labeled corpora simply do not exist. Traditional ASR pipelines used hand-crafted acoustic features such as mel-frequency cepstral coefficients (MFCCs) or log-mel spectrograms and required extensive supervised training.

The success of unsupervised and self-supervised pre-training in [natural language processing](/wiki/natural_language_processing), exemplified by models like [BERT](/wiki/bert) and [GPT](/wiki/gpt), inspired researchers to explore analogous strategies for speech. The central question driving the Wav2Vec project was straightforward: could models learn useful speech representations from raw waveforms alone, without any transcriptions, and then be fine-tuned with minimal labeled data to achieve strong recognition performance?

## Wav2Vec 1.0 (2019)

The original Wav2Vec model was introduced by Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli in the paper "wav2vec: Unsupervised [Pre-training](/wiki/pre-training) for Speech Recognition," presented at Interspeech 2019.[1] It was the first model to apply unsupervised pre-training to speech recognition using a fully [convolutional neural network](/wiki/convolutional_neural_network) architecture.[1]

### Architecture

Wav2Vec 1.0 consisted of two cascaded convolutional networks:

- **Feature Encoder (Extractor):** A 5-layer CNN that processed raw audio waveforms. The layers used kernel sizes of (10, 8, 4, 4, 4) with strides of (5, 4, 2, 2, 2), producing a total stride of 160 samples. Each layer had 512 channels, followed by group normalization and a [ReLU](/wiki/relu) nonlinearity. The encoder covered approximately 30 milliseconds of audio per output frame.[1]
- **Context Network (Aggregator):** A 9-layer CNN with kernel size 3 and stride 1, providing a total receptive field of roughly 210 milliseconds. The context network combined encoder outputs into higher-level latent representations that captured semantic relationships among neighboring audio frames.[1]

The base model used approximately 34 million parameters. A larger variant, "wav2vec large," included two additional linear transformations in the encoder and a 12-layer context network with increasing kernel sizes (2, 3, ..., 13).[1]

### Training Objective

Wav2Vec 1.0 was trained using a contrastive loss based on noise-contrastive estimation. The model learned to distinguish a true future audio frame from a set of negative (distractor) samples drawn from the same sequence. This approach encouraged the context network to capture information predictive of upcoming audio frames.[1]

### Results

Pre-trained on the LibriSpeech dataset and evaluated on the Wall Street Journal (WSJ) corpus, Wav2Vec 1.0 achieved a WER of 2.43% on the nov92 test set.[1] This outperformed Deep Speech 2, the best character-based system reported at the time, while using two orders of magnitude less labeled training data. When only a few hours of transcribed data were available, the pre-trained representations reduced WER by up to 36% over a strong character-based log-mel filterbank baseline.[1]

## vq-wav2vec (2019)

The vq-wav2vec model, introduced by Alexei Baevski, Steffen Schneider, and Michael Auli and presented at ICLR 2020, extended the Wav2Vec approach by learning discrete (quantized) speech representations.[2] This was a pivotal step because it bridged the gap between continuous audio signals and discrete-input NLP algorithms.

### Vector Quantization Methods

vq-wav2vec applied [vector quantization](/wiki/vector_quantization) to the dense representations produced by the Wav2Vec encoder. Two quantization strategies were explored:

- **Gumbel-[Softmax](/wiki/softmax):** A differentiable relaxation of the discrete sampling process that allowed end-to-end training with backpropagation.[2]
- **Online k-Means [Clustering](/wiki/clustering):** A clustering approach that assigned each representation to its nearest codebook entry during the forward pass.[2]

Both methods converted the continuous encoder outputs into sequences of discrete tokens, each drawn from a learned codebook.

### Enabling NLP-Style Pre-Training on Speech

The key innovation of vq-wav2vec was that, once speech had been converted to discrete tokens, standard NLP pre-training methods could be applied directly. The authors demonstrated that running [BERT](/wiki/bert)-style masked language model pre-training on the discrete speech tokens yielded strong improvements on downstream tasks. This two-stage pipeline (vq-wav2vec quantization followed by BERT pre-training) achieved state-of-the-art results on TIMIT phoneme classification and WSJ speech recognition at the time of publication.[2]

## How does Wav2Vec 2.0 work?

Wav2Vec 2.0, introduced by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli at [NeurIPS](/wiki/neurips) 2020, unified and substantially improved upon the ideas in Wav2Vec 1.0 and vq-wav2vec.[3] It is the most widely cited and adopted model in the family, combining contrastive learning with masked prediction in a single end-to-end framework. As the paper describes it, "wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned."[3]

### Architecture

Wav2Vec 2.0 consists of three main components:

**1. Convolutional Feature Encoder**

The feature encoder is a 7-layer temporal CNN that processes raw 16 kHz audio waveforms. All seven layers use 512 channels, and each layer is followed by layer normalization and a GELU activation function. The specific kernel widths and strides for the seven layers are:

| Layer | Kernel Width | Stride |
|-------|-------------|--------|
| 1 | 10 | 5 |
| 2 | 3 | 2 |
| 3 | 3 | 2 |
| 4 | 3 | 2 |
| 5 | 3 | 2 |
| 6 | 2 | 2 |
| 7 | 2 | 2 |

The total stride of the encoder is 320 samples, producing one output vector every 20 milliseconds. The receptive field is 400 input samples, corresponding to 25 milliseconds of audio at 16 kHz.[3]

**2. [Transformer](/wiki/transformer) Context Network**

The output of the CNN encoder (512-dimensional vectors) passes through a linear feature projection layer that increases the dimensionality before feeding it into a [Transformer](/wiki/transformer) encoder. The Transformer uses a convolutional layer for relative positional embeddings instead of fixed sinusoidal positional encodings.[3] Two model configurations were released:

| Configuration | Transformer Layers | Hidden Size | Attention Heads | FFN Size | Parameters |
|--------------|-------------------|-------------|-----------------|----------|------------|
| BASE | 12 | 768 | 8 | 3,072 | ~95M |
| LARGE | 24 | 1,024 | 16 | 4,096 | ~317M |

**3. [Quantization](/wiki/quantization) Module**

The quantization module discretizes the CNN encoder outputs using product quantization with a Gumbel-Softmax distribution. The default configuration uses G = 2 codebooks, each containing V = 320 entries (codewords). This yields a theoretical maximum of 320 x 320 = 102,400 possible quantized speech units. The quantized representations serve as targets for the contrastive learning objective.[3]

### Pre-Training Procedure

During pre-training, Wav2Vec 2.0 masks spans of the latent feature encoder output before it is fed to the Transformer. The masking strategy selects starting positions with probability p = 0.065 and masks M = 10 consecutive time steps from each selected position. Approximately 50% of all time steps end up masked.[3]

The training objective combines two loss functions:

- **Contrastive Loss:** The model must identify the true quantized representation for a masked time step from a set of 100 negative distractors sampled uniformly from the same utterance. Similarity is measured using cosine similarity.[3]
- **Diversity Loss:** An entropy-based regularization term that encourages the model to use all entries in the codebooks equally, preventing codebook collapse where only a small subset of entries would be selected.[3]

Two pre-training data configurations were evaluated:

- **LS-960:** 960 hours of audio from LibriSpeech (clean audiobook recordings)
- **LL-60k:** Approximately 53,200 hours of audio from the Libri-Light dataset (a subset of LibriVox audiobooks)

### Fine-Tuning

After pre-training, the model is fine-tuned for ASR by adding a randomly initialized linear projection on top of the Transformer output and training with the Connectionist Temporal Classification ([CTC](/wiki/connectionist_temporal_classification)) loss. During fine-tuning, the quantization module is not used. The model can be fine-tuned on very small amounts of labeled data, and results were reported for 10 minutes, 1 hour, 10 hours, 100 hours, and 960 hours of transcribed LibriSpeech audio.[3]

At inference time, decoded output can optionally be combined with a language model (4-gram or Transformer-based) to improve accuracy.

### What WER did Wav2Vec 2.0 achieve on LibriSpeech?

Wav2Vec 2.0 achieved striking results across different amounts of labeled training data. The following table summarizes key WER results on the LibriSpeech benchmark:

| Model | Pre-Training Data | Labeled Data | Test-Clean WER (%) | Test-Other WER (%) | Language Model |
|-------|------------------|-------------|--------------------|--------------------|----------------|
| LARGE | LL-60k | 10 min | 4.8 | 8.2 | Transformer LM |
| LARGE | LL-60k | 10 min | 5.2 | 8.6 | None |
| LARGE | LL-60k | 1 hour | 2.7 | 5.2 | Transformer LM |
| LARGE | LL-60k | 1 hour | 3.9 | 7.6 | None |
| LARGE | LL-60k | 10 hours | 2.2 | 4.5 | Transformer LM |
| LARGE | LL-60k | 100 hours | 2.0 | 4.0 | Transformer LM |
| LARGE | LL-60k | 960 hours | 1.8 | 3.3 | Transformer LM |
| BASE | LS-960 | 10 min | 7.5 | 12.4 | Transformer LM |
| BASE | LS-960 | 960 hours | 2.7 | 5.4 | Transformer LM |

The most remarkable result was that the LARGE model, pre-trained on 53,000 hours of unlabeled audio and fine-tuned with only 10 minutes of labeled data, achieved a WER of 4.8% on test-clean and 8.2% on test-other with a Transformer language model.[3] In the authors' words, "Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER," a result that "demonstrates the feasibility of speech recognition with limited amounts of labeled data."[3] When the labeled budget was lowered to one hour, Wav2Vec 2.0 outperformed the previous state of the art on the 100 hour subset while using 100 times less labeled data.[3]

When using all 960 hours of labeled LibriSpeech data, Wav2Vec 2.0 LARGE achieved 1.8% WER on test-clean and 3.3% on test-other, which was competitive with or superior to the best supervised systems available at the time.[3]

## Multilingual Extensions

A major contribution of the Wav2Vec research program was extending self-supervised speech representations to multilingual and cross-lingual settings, addressing the vast gap in ASR coverage across the world's languages.

### XLSR-53 (2020)

XLSR-53 (Cross-Lingual Speech Representations for 53 languages) was introduced by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli.[4] The model extended Wav2Vec 2.0 to the multilingual domain by pre-training a single model on 56,000 hours of speech data across 53 languages drawn from the MLS, CommonVoice, and BABEL datasets.[4]

XLSR-53 used product quantization shared across all languages, which encouraged the model to learn language-universal phonetic representations. The shared discrete units captured phonetic and prosodic structures common to many languages, enabling effective cross-lingual transfer. On the CommonVoice benchmark, XLSR-53 reduced phoneme error rates by 72% relative to the best known prior results. On BABEL, word error rate improved by 16% relative to comparable systems.[4]

### XLS-R (2021)

XLS-R (Cross-Lingual Speech Representations at Scale) was introduced by Arun Babu, Changhan Wang, Andros Tjandra, and colleagues in 2021.[5] It scaled up the XLSR approach dramatically:

| Model | Parameters | Training Data | Languages |
|-------|-----------|--------------|----------|
| XLS-R 300M | 300 million | ~436,000 hours | 128 |
| XLS-R 1B | 1 billion | ~436,000 hours | 128 |
| XLS-R 2B | 2 billion | ~436,000 hours | 128 |

The training data came from multiple public sources, including BABEL, Multilingual LibriSpeech (MLS), CommonVoice, VoxPopuli, and VoxLingua107, totaling nearly 436,000 hours of speech. This represented roughly 10 times more data and 2.5 times more languages than XLSR-53.[5]

XLS-R outperformed prior work on most of the 37 languages tested across BABEL, CommonVoice, MLS, and VoxPopuli benchmarks, reducing error rates by 14 to 34% relative on average. On the CoVoST-2 speech translation benchmark, the model improved [BLEU](/wiki/bleu_bilingual_evaluation_understudy) scores by an average of 7.4 points over 21 translation directions into English. The model also set a new state of the art on the VoxLingua107 language identification benchmark.[5]

A key finding was that model capacity matters significantly for multilingual representations. The 300M parameter model showed capacity dilution issues when representing 128 languages, but scaling to 1B and 2B parameters allowed the model to match or exceed monolingual English-only pre-training performance.[5]

### MMS: Massively Multilingual Speech (2023)

The Massively Multilingual Speech (MMS) project, published by Vineel Pratap, Andros Tjandra, Bowen Shi, and colleagues, pushed language coverage far beyond XLS-R.[6] MMS expanded speech technology from approximately 100 languages to over 1,100 for speech recognition and text-to-speech, and over 4,000 for language identification.[6]

| Capability | Number of Languages |
|-----------|--------------------|
| Pre-trained wav2vec 2.0 representations | 1,406 |
| Automatic speech recognition (ASR) | 1,107 |
| Text-to-speech (TTS) | 1,107 |
| Language identification | 4,017 |

The project leveraged a novel training dataset based on readings of publicly available religious texts (primarily the New Testament), providing on average 32 hours of audio per language. By incorporating unlabeled recordings of various other religious readings, the team increased the number of languages for unsupervised pre-training to over 4,000.[6]

On the FLEURS benchmark, the MMS multilingual ASR model more than halved the word error rate of [Whisper](/wiki/whisper) on 54 languages, a relative reduction of roughly 58%, while supporting more than 11 times as many languages and using a substantially smaller fraction of labeled data.[6] This result highlighted the effectiveness of self-supervised pre-training combined with broad language coverage.

## Wav2Vec Model Family Overview

The following table summarizes the key models in the Wav2Vec family:

| Model | Year | Authors | Architecture | Parameters | Pre-Training Data | Languages | Key Innovation |
|-------|------|---------|-------------|-----------|------------------|-----------|----------------|
| Wav2Vec 1.0 | 2019 | Schneider et al. | CNN encoder + CNN context | ~34M | LibriSpeech (960h) | English | Contrastive pre-training for speech |
| vq-wav2vec | 2019 | Baevski et al. | CNN + vector quantization | ~34M | LibriSpeech (960h) | English | Discrete speech tokens for NLP methods |
| Wav2Vec 2.0 BASE | 2020 | Baevski et al. | CNN encoder + Transformer | ~95M | LibriSpeech (960h) | English | Contrastive learning + masked prediction |
| Wav2Vec 2.0 LARGE | 2020 | Baevski et al. | CNN encoder + Transformer | ~317M | Libri-Light (53kh) | English | Scaled pre-training on unlabeled audio |
| XLSR-53 | 2020 | Conneau et al. | Wav2Vec 2.0 LARGE | ~317M | 56,000 hours | 53 | Cross-lingual shared quantization |
| XLS-R 300M | 2021 | Babu et al. | Wav2Vec 2.0 | 300M | ~436,000 hours | 128 | Massive multilingual scaling |
| XLS-R 1B | 2021 | Babu et al. | Wav2Vec 2.0 | 1B | ~436,000 hours | 128 | Billion-parameter speech model |
| XLS-R 2B | 2021 | Babu et al. | Wav2Vec 2.0 | 2B | ~436,000 hours | 128 | Largest self-supervised speech model |
| MMS | 2023 | Pratap et al. | Wav2Vec 2.0 | 1B | ~500,000 hours | 1,406 | 1,100+ language ASR and TTS |

## How does Wav2Vec compare to HuBERT, WavLM, and Whisper?

Wav2Vec 2.0 exists within a broader ecosystem of self-supervised speech representation models. Several related models have built upon or diverged from the Wav2Vec approach.

### HuBERT

[HuBERT](/wiki/hubert) (Hidden Unit BERT), introduced by Wei-Ning Hsu and colleagues at Meta AI in 2021, follows the same encoder architecture as Wav2Vec 2.0 (7-layer CNN encoder plus Transformer) but uses a different pre-training objective.[7] Instead of contrastive learning with quantized targets, HuBERT generates pseudo-labels through offline k-means clustering and trains with a masked prediction loss similar to BERT. This approach avoids the need for the Gumbel-Softmax quantization module, the diversity loss, and the careful temperature annealing schedule required by Wav2Vec 2.0.[7]

HuBERT BASE and LARGE match the architectures of Wav2Vec 2.0 BASE and LARGE, respectively. An X-LARGE variant with approximately 1 billion parameters was also introduced. In ultra-low-resource settings with 10 minutes of labeled data, HuBERT LARGE achieved 4.7% WER on LibriSpeech test-clean and 7.6% on test-other, slightly outperforming Wav2Vec 2.0 LARGE (4.8% and 8.2%, respectively, without an external language model correction).[7]

### WavLM

WavLM, introduced by Sanyuan Chen and colleagues at [Microsoft](/wiki/microsoft) in 2022, built upon the HuBERT framework with two key extensions.[8] First, it introduced a gated relative position bias in the Transformer self-attention mechanism, improving performance on recognition tasks. Second, it used an utterance mixing training strategy that created overlapping speech samples during pre-training, helping the model learn to handle speaker overlap and noise.[8]

WavLM was pre-trained on up to 94,000 hours of audio data. It achieved state-of-the-art results on the SUPERB benchmark, surpassing HuBERT Large on 14 out of 15 tasks. WavLM was particularly strong on non-ASR tasks like speaker verification and speaker diarization, reflecting its design emphasis on full-stack speech processing.[8]

### Whisper

[Whisper](/wiki/whisper), released by [OpenAI](/wiki/openai) in September 2022, took a fundamentally different approach from the Wav2Vec family. Rather than self-supervised pre-training followed by fine-tuning, Whisper used a supervised encoder-decoder [Transformer](/wiki/transformer) architecture trained on 680,000 hours of weakly labeled (web-crawled) multilingual audio data.[9] The largest Whisper model (Large V3) has approximately 1.55 billion parameters and supports over 99 languages.[9]

Whisper generally achieves lower WER than Wav2Vec 2.0 across multiple languages, particularly in noisy environments. However, Whisper's reliance on massive labeled datasets makes it less adaptable to truly low-resource languages without extensive annotation. In contrast, Wav2Vec 2.0 and its multilingual descendants (XLS-R, MMS) excel precisely in low-resource scenarios where labeled data is minimal. The MMS model more than halved Whisper's word error rate on 54 FLEURS languages while using far less labeled training data.[6]

### Comparison Table

| Model | Organization | Year | Approach | Pre-Training Data | Parameters | Languages |
|-------|-------------|------|----------|------------------|-----------|----------|
| Wav2Vec 2.0 | Meta AI | 2020 | Self-supervised (contrastive + masked) | 53,000h unlabeled | 95M / 317M | English |
| HuBERT | Meta AI | 2021 | Self-supervised (masked prediction) | 60,000h unlabeled | 95M / 317M / 1B | English |
| WavLM | Microsoft | 2022 | Self-supervised (masked prediction + mixing) | 94,000h unlabeled | 95M / 317M | English |
| Whisper | OpenAI | 2022 | Supervised (encoder-decoder) | 680,000h labeled | 38M to 1.55B | 99+ |
| XLS-R | Meta AI | 2021 | Self-supervised (contrastive + masked) | 436,000h unlabeled | 300M / 1B / 2B | 128 |
| MMS | Meta AI | 2023 | Self-supervised + fine-tuned | ~500,000h unlabeled | 1B | 1,406 |

## What is Wav2Vec used for?

The Wav2Vec model family has been applied to a wide range of speech processing tasks beyond standard ASR.

### Automatic Speech Recognition

ASR remains the primary application of Wav2Vec models. The self-supervised pre-trained representations can be fine-tuned with CTC loss on small amounts of transcribed audio in any target language, making Wav2Vec 2.0 and its multilingual variants practical tools for building ASR systems in low-resource settings. Researchers have successfully fine-tuned Wav2Vec 2.0 and XLS-R for languages including Yoruba, Swahili, Mizo, and many others with limited digital resources.

### Speech Emotion Recognition

Wav2Vec 2.0 embeddings have been widely adopted for [speech emotion recognition](/wiki/speech_recognition) (SER). The pre-trained representations capture prosodic, tonal, and spectral features that are informative for detecting emotions such as happiness, sadness, anger, and fear. Fine-tuned Wav2Vec 2.0 models have achieved strong results on benchmarks like IEMOCAP and other emotion datasets, often outperforming traditional hand-crafted feature approaches.

### Speaker Verification and Identification

The learned representations also encode speaker-specific characteristics, making Wav2Vec 2.0 useful for [speaker verification](/wiki/speaker_verification) and identification tasks. Models fine-tuned on speaker verification datasets can distinguish between speakers based on their voice characteristics. The SUPERB benchmark evaluates these capabilities, and Wav2Vec 2.0 has shown competitive results on speaker-related tasks.

### Language Identification

The multilingual Wav2Vec variants (XLSR-53, XLS-R, MMS) are well-suited for language identification. The MMS model can identify over 4,000 spoken languages, far exceeding the coverage of any previous system.[6] XLS-R set state-of-the-art results on the VoxLingua107 language identification benchmark.[5]

### Speech Translation

XLS-R demonstrated strong performance on speech translation tasks, particularly on the CoVoST-2 benchmark. The model's cross-lingual representations enabled it to improve BLEU scores substantially, with especially large gains on low-resource language directions such as Indonesian-to-English translation, where accuracy roughly doubled compared to prior work.[5]

### Phoneme Recognition

vq-wav2vec and Wav2Vec 2.0 have been used for phoneme recognition tasks, where the goal is to identify the sequence of phonemes in a speech utterance. This is useful for linguistic research, pronunciation assessment, and as a component of larger speech processing pipelines.

## Impact on Low-Resource Language ASR

The Wav2Vec family of models has had a transformative impact on speech technology for low-resource languages. Before self-supervised speech models, building a functional ASR system for a language required hundreds or thousands of hours of transcribed audio, a resource available for only a handful of the world's languages.

Wav2Vec 2.0 demonstrated that competitive ASR performance could be achieved with as little as 10 minutes of labeled data when combined with large-scale unsupervised pre-training.[3] The multilingual extensions (XLSR-53, XLS-R, MMS) further amplified this impact by sharing learned representations across languages, enabling positive cross-lingual transfer.

Specific examples of low-resource language improvements include:

- On endangered language datasets, Wav2Vec 2.0 representations (English or XLSR-53) offered relative improvements of 56 to 86% over prior state-of-the-art approaches.
- For Swahili, continued pre-training of Wav2Vec 2.0 with just 20,000 labeled samples achieved 3.24% WER on CommonVoice, an 82% relative improvement over the baseline.
- For Mizo, a low-resource language of Northeast India, researchers leveraged XLS-R to achieve strong ASR accuracy that would have been impractical without multilingual self-supervised pre-training.
- MMS expanded ASR coverage to over 1,100 languages, many of which had no prior ASR system.[6]

For many communities speaking under-resourced languages, the Wav2Vec family opened the door to practical speech technology for the first time. The approach is especially valuable for languages where the Bible or other religious texts provide one of the few substantial sources of aligned text and audio data.

## Technical Innovations

Several technical innovations in the Wav2Vec family have had lasting influence on the field of speech processing.

### Contrastive Learning on Latent Representations

Wav2Vec 1.0 introduced the idea of contrastive predictive coding for speech, where the model learns to distinguish true future frames from negative samples.[1] Wav2Vec 2.0 refined this by applying the contrastive loss to masked (rather than future) positions, making the task more similar to BERT-style masked language modeling.[3]

### Joint Quantization and Representation Learning

Wav2Vec 2.0 jointly learns the quantization codebook and the Transformer representations during pre-training. This eliminates the need for a separate quantization step (as in vq-wav2vec) and allows the discrete targets to co-adapt with the contextual representations.[3]

### Masking in Latent Space

Rather than masking the raw audio waveform, Wav2Vec 2.0 masks the output of the CNN feature encoder in latent space. This design choice means the CNN encoder always sees the complete audio input, while only the Transformer must reconstruct masked positions from context. This proved more effective than masking the raw input.[3]

### CTC Fine-Tuning with Minimal Labels

The combination of self-supervised pre-training with CTC-based fine-tuning on minimal labeled data was a key practical innovation. CTC allows the model to be trained without precise alignment between audio frames and text characters, simplifying the fine-tuning pipeline.

### Cross-Lingual Shared Quantization

XLSR-53 introduced shared quantization across languages, where the same set of discrete speech units represented phonetic content across all training languages.[4] This encouraged the model to discover language-universal speech representations and enabled strong cross-lingual transfer.

## Is Wav2Vec open source?

All major Wav2Vec models are open source and available through Meta's fairseq library on GitHub. The models are also integrated into the [Hugging Face](/wiki/hugging_face) Transformers library, which provides convenient APIs for loading pre-trained models, running inference, and fine-tuning on custom datasets.

Key Hugging Face model identifiers include:

| Model | Hugging Face Identifier |
|-------|------------------------|
| Wav2Vec 2.0 BASE (960h) | facebook/wav2vec2-base-960h |
| Wav2Vec 2.0 LARGE (960h) | facebook/wav2vec2-large-960h |
| Wav2Vec 2.0 LARGE (LV-60k + 960h) | facebook/wav2vec2-large-960h-lv60-self |
| XLSR-53 | facebook/wav2vec2-large-xlsr-53 |
| XLS-R 300M | facebook/wav2vec2-xls-r-300m |
| XLS-R 1B | facebook/wav2vec2-xls-r-1b |
| XLS-R 2B | facebook/wav2vec2-xls-r-2b |

All models expect 16 kHz sampled single-channel audio as input. The Hugging Face `Wav2Vec2Processor` handles audio preprocessing, and `Wav2Vec2ForCTC` provides the CTC fine-tuning head for ASR tasks.

## Legacy and Influence

The Wav2Vec series has had a profound impact on the speech processing community. It demonstrated convincingly that self-supervised learning could achieve results comparable to or better than fully supervised systems, particularly in low-resource settings. The core ideas of masking latent speech features, applying contrastive objectives, and fine-tuning with CTC have been adopted and extended by numerous subsequent models.

HuBERT, WavLM, data2vec, and other self-supervised speech models all build upon foundations laid by the Wav2Vec line of research. The multilingual extensions (XLSR-53, XLS-R, MMS) have significantly advanced the goal of universal speech technology that works across the full diversity of human languages.

As of 2025, Wav2Vec 2.0 remains one of the most widely used self-supervised speech models in both research and production, and its architecture continues to serve as a baseline for new developments in speech representation learning.

## References

1. Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). "wav2vec: Unsupervised Pre-training for Speech Recognition." *Interspeech 2019*. [arXiv:1904.05862](https://arxiv.org/abs/1904.05862)

2. Baevski, A., Schneider, S., & Auli, M. (2020). "vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations." *ICLR 2020*. [arXiv:1910.05453](https://arxiv.org/abs/1910.05453)

3. Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." *NeurIPS 2020*. [arXiv:2006.11477](https://arxiv.org/abs/2006.11477)

4. Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. (2020). "Unsupervised Cross-lingual Representation Learning for Speech Recognition." [arXiv:2006.13979](https://arxiv.org/abs/2006.13979)

5. Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P., Saraf, Y., Pino, J., Baevski, A., Conneau, A., & Auli, M. (2021). "XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale." [arXiv:2111.09296](https://arxiv.org/abs/2111.09296)

6. Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.-N., Conneau, A., & Auli, M. (2023). "Scaling Speech Technology to 1,000+ Languages." [arXiv:2305.13516](https://arxiv.org/abs/2305.13516)

7. Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." *IEEE/ACM Transactions on Audio, Speech and Language Processing*. [arXiv:2106.07447](https://arxiv.org/abs/2106.07447)

8. Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., & Wei, F. (2022). "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing." *IEEE Journal of Selected Topics in Signal Processing*. [arXiv:2110.13900](https://arxiv.org/abs/2110.13900)

9. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." [arXiv:2212.04356](https://arxiv.org/abs/2212.04356)