# data2vec

> Source: https://aiwiki.ai/wiki/data2vec
> Updated: 2026-06-03
> Categories: Machine Learning, Meta AI, Multimodal AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# data2vec

**data2vec** is a [self-supervised learning](/wiki/self_supervised_learning) framework from [Meta AI](/wiki/meta_ai) (then Facebook AI Research) that applies the same training method to three different input types: speech, computer vision, and text. It was announced on January 20, 2022, with the paper "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli, posted to arXiv on February 7, 2022 and presented at the International Conference on Machine Learning (ICML) in 2022.[1][2][3] A more compute-efficient successor, data2vec 2.0, followed in December 2022.[4][5]

Most self-supervised methods of the period were built for one modality at a time, with objectives tailored to that data: predicting masked words for text, discrete speech units for audio, or pixels and visual tokens for images. data2vec keeps a single learning objective and a standard [transformer](/wiki/transformer) backbone for all three, changing only the input feature encoder and the masking scheme per modality. The headline result was that one recipe could reach state-of-the-art or competitive accuracy in each of the three domains.[1][6]

## Method

The core idea is a teacher-student setup that predicts the model's own internal representations rather than any externally defined target.[1] The same network is used in two roles. In teacher mode it encodes the full, unmasked input and produces representations that serve as training targets. In student mode it encodes a masked version of the same input and tries to predict the teacher's representations at the masked positions. Because the model predicts its own latent activations, the method is a form of self-distillation, closely related to [knowledge distillation](/wiki/knowledge_distillation).[1]

The teacher is not a separate trained network. Its weights are an exponentially moving average (EMA) of the student weights, updated each step as Δ ← τΔ + (1 − τ)θ, where Δ is the teacher's parameters, θ is the student's, and τ is a decay coefficient that is increased on a schedule from a starting value toward a target value over the first several updates and then held constant.[1] Updating the teacher more frequently early in training (when the student is still poorly trained) and less frequently later was found to help. The feature encoder and positional encoder are shared between the teacher and student.[1]

Two properties of the targets set data2vec apart from earlier masked-prediction work such as BERT. First, the targets are *continuous* latent vectors, not entries from a fixed, predefined vocabulary, so the set of targets is not limited in advance and the model can adapt them to each example. Second, the targets are *contextualized*: because the teacher uses self-attention over the entire unmasked input, each target encodes information from the whole sequence, not just a local patch or token.[1][6] This distinguishes data2vec from approaches that reconstruct local content such as pixels ([masked autoencoders](/wiki/masked_autoencoder)) or predict discrete visual or speech units.

Rather than using only the teacher's top layer as the target (as in methods like BYOL and DINO), data2vec averages the output of the top *K* transformer blocks of the teacher at each masked time-step, after a normalization step, to form the target.[1] An ablation showed that using multiple layers improved accuracy over using only the top layer (K = 1) in all three modalities. Targets are normalized to prevent the model from collapsing to a constant representation. The student is trained with a Smooth L1 (Huber) loss applied only at the masked positions, governed by a parameter β that controls the transition between squared-error and absolute-error behavior.[1]

The same model is used in two sizes, called data2vec Base and data2vec Large, with 12 or 24 transformer blocks and a hidden dimension of 768 or 1,024 respectively. Only the input handling and masking differ across modalities:[1]

- **Vision:** images of 224×224 pixels are split into 16×16 patches following the Vision Transformer (ViT) approach, and block-wise masking covers about 60% of patches.
- **Speech:** a 16 kHz waveform is processed by a multi-layer 1-D convolutional feature encoder into a roughly 50 Hz representation, and spans of the resulting units are masked (about 49% of time-steps).
- **Language:** text is tokenized into byte-pair-encoding sub-words, and the BERT masking scheme is applied to roughly 15% of tokens (with a variant masking spans of four tokens).

## Results

data2vec reported strong results across all three benchmark families. The numbers below are from the ICML 2022 paper.

### Computer vision

On ImageNet-1K image classification (top-1 validation accuracy after fine-tuning), data2vec outperformed prior single-model self-supervised methods for both ViT-B and ViT-L, and beat all listed prior work for ViT-L.[1]

| Method (single model) | ViT-B | ViT-L |
|---|---|---|
| MoCo v3 | 83.2 | 84.1 |
| DINO | 82.8 | N/A |
| MAE | 83.6 | 85.9 |
| SimMIM | 83.8 | N/A |
| iBOT | 83.8 | N/A |
| MaskFeat | 84.0 | 85.7 |
| **data2vec** | **84.2** | **86.6** |

### Speech

On Librispeech speech recognition, fine-tuning on different amounts of labeled data, data2vec lowered the word error rate (WER) compared with [wav2vec](/wiki/wav2vec) 2.0 and HuBERT, with the largest relative gains in the lowest-resource settings. The paper notes roughly a 20% relative WER improvement over wav2vec 2.0 for the Base model with 10 minutes of labeled data. The table shows WER on the test-other set using a 4-gram language model and LS-960 unlabeled data.[1]

| Model | 10 min | 1h | 10h | 100h | 960h |
|---|---|---|---|---|---|
| wav2vec 2.0 (Base) | 15.6 | 11.3 | 9.5 | 8.0 | 6.1 |
| HuBERT (Base) | 15.3 | 11.3 | 9.4 | 8.1 | N/A |
| **data2vec (Base)** | **12.3** | **9.1** | **8.1** | **6.8** | **5.5** |
| wav2vec 2.0 (Large) | 10.3 | 7.1 | 5.8 | 4.6 | 3.6 |
| HuBERT (Large) | 10.1 | 6.8 | 5.5 | 4.5 | 3.7 |
| **data2vec (Large)** | **8.4** | **6.3** | **5.3** | **4.6** | **3.7** |

data2vec also set a new result on the AudioSet audio event classification benchmark, reaching 34.5 mean average precision (mAP) using AudioSet-only pre-training, ahead of comparable methods such as SSAST, MaskSpec, and MAE-AST.[1]

### Natural language processing

On the GLUE natural language understanding benchmark (development set, single-task fine-tuning), data2vec scored slightly above a RoBERTa baseline that the authors retrained in the same setup, and above published BERT results. With wav2vec 2.0-style span masking, the average improved further. This was described as the first successful pre-trained NLP model whose training target was a contextualized latent representation rather than discrete tokens.[1]

| Model | GLUE average |
|---|---|
| BERT | 80.7 |
| RoBERTa baseline | 82.5 |
| **data2vec** | **82.7** |
| **data2vec (+ wav2vec 2.0 masking)** | **82.9** |

## data2vec 2.0

data2vec 2.0, titled "Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language," was published by Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli and posted to arXiv on December 14, 2022, with the Meta AI blog announcement dated December 13, 2022.[4][5] It keeps the contextualized latent-target objective of the original but cuts training cost substantially through three changes:[4][5]

1. **Not encoding masked tokens.** Following the masked-autoencoder idea, the student encoder processes only the unmasked portion of the input (for images, the roughly 20% of patches that remain visible), rather than running the full sequence through the encoder.
2. **A fast convolutional decoder.** The masked positions are filled in by a lightweight multi-layer convolutional decoder instead of a transformer decoder.
3. **Amortized (multi-mask) targets.** The teacher's representation of an example is computed once and then reused as the target for several different masked versions of that same example, spreading the cost of building the (relatively expensive) contextualized target over many student updates.

The paper reports efficiency gains of roughly 2x to 16x at similar accuracy across the three modalities.[4] On ImageNet-1K it matched the accuracy of Masked Autoencoders with 16.4x lower pre-training time; on Librispeech it matched wav2vec 2.0 in 10.6x less time; and on GLUE it matched a retrained RoBERTa model in about half the training time. Trading some of the speedup for accuracy, a ViT-L model trained for 150 epochs reached 86.8% top-1 accuracy on ImageNet-1K.[4][5]

| Aspect | data2vec (2022) | data2vec 2.0 (2022) |
|---|---|---|
| Learning target | Contextualized latent representations (EMA teacher) | Same |
| Encoder over masked positions | Full input encoded | Masked positions not encoded |
| Decoder for masked positions | Transformer | Multi-layer convolutional network |
| Teacher target reuse | One masked view per example | One target reused across multiple masked views |
| Main goal | General cross-modal accuracy | Same accuracy at much lower compute |
| Reported efficiency | Baseline | ~2x to 16x faster at similar accuracy |

## Significance and limitations

data2vec was notable for showing that a single self-supervised objective, predicting contextualized continuous representations from masked input, could match or beat specialized methods in speech, vision, and language without modality-specific target design.[1][6] It influenced later work on unified and efficient self-supervised pre-training, and the code and pretrained models for both versions were released in Meta's [fairseq](/wiki/fairseq) repository.[1][3][4]

An important nuance is that data2vec is a shared *method*, not a single shared *model*: it trains a separate model for each modality using the same objective and architecture style, rather than one network that ingests all modalities at once. The authors framed cross-modal and joint multimodal training as future work.[1][4] The targets also depend on a teacher whose representations must not collapse, which is why normalization of the targets and the layer-averaging strategy are part of the recipe.[1]

## References

1. Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language." Proceedings of the 39th International Conference on Machine Learning (ICML), PMLR 162. https://proceedings.mlr.press/v162/baevski22a/baevski22a.pdf
2. Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language." arXiv:2202.03555. https://arxiv.org/abs/2202.03555
3. Meta AI. (January 20, 2022). "data2vec: The first high-performance self-supervised algorithm that works for speech, vision, and text." https://ai.meta.com/blog/the-first-high-performance-self-supervised-algorithm-that-works-for-speech-vision-and-text/
4. Baevski, A., Babu, A., Hsu, W.-N., & Auli, M. (2022). "Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language." arXiv:2212.07525. https://arxiv.org/abs/2212.07525
5. Meta AI. (December 13, 2022). "data2vec 2.0: Highly efficient self-supervised learning for vision, speech and text." https://ai.meta.com/blog/ai-self-supervised-learning-data2vec/
6. Meta / Facebook Newsroom. (January 20, 2022). "Introducing the First Self-Supervised Algorithm for Speech, Vision and Text." https://about.fb.com/news/2022/01/first-self-supervised-algorithm-for-speech-vision-text/

