data2vec
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,756 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,756 words
Add missing citations, update stale details, or suggest a clearer explanation.
data2vec is a self-supervised learning framework from Meta AI (then Facebook AI Research) that applies the same training method to three different input types: speech, computer vision, and text. It was announced on January 20, 2022, with the paper "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli, posted to arXiv on February 7, 2022 and presented at the International Conference on Machine Learning (ICML) in 2022.[1][2][3] A more compute-efficient successor, data2vec 2.0, followed in December 2022.[4][5]
Most self-supervised methods of the period were built for one modality at a time, with objectives tailored to that data: predicting masked words for text, discrete speech units for audio, or pixels and visual tokens for images. data2vec keeps a single learning objective and a standard transformer backbone for all three, changing only the input feature encoder and the masking scheme per modality. The headline result was that one recipe could reach state-of-the-art or competitive accuracy in each of the three domains.[1][6]
The core idea is a teacher-student setup that predicts the model's own internal representations rather than any externally defined target.[1] The same network is used in two roles. In teacher mode it encodes the full, unmasked input and produces representations that serve as training targets. In student mode it encodes a masked version of the same input and tries to predict the teacher's representations at the masked positions. Because the model predicts its own latent activations, the method is a form of self-distillation, closely related to knowledge distillation.[1]
The teacher is not a separate trained network. Its weights are an exponentially moving average (EMA) of the student weights, updated each step as Δ ← τΔ + (1 − τ)θ, where Δ is the teacher's parameters, θ is the student's, and τ is a decay coefficient that is increased on a schedule from a starting value toward a target value over the first several updates and then held constant.[1] Updating the teacher more frequently early in training (when the student is still poorly trained) and less frequently later was found to help. The feature encoder and positional encoder are shared between the teacher and student.[1]
Two properties of the targets set data2vec apart from earlier masked-prediction work such as BERT. First, the targets are continuous latent vectors, not entries from a fixed, predefined vocabulary, so the set of targets is not limited in advance and the model can adapt them to each example. Second, the targets are contextualized: because the teacher uses self-attention over the entire unmasked input, each target encodes information from the whole sequence, not just a local patch or token.[1][6] This distinguishes data2vec from approaches that reconstruct local content such as pixels (masked autoencoders) or predict discrete visual or speech units.
Rather than using only the teacher's top layer as the target (as in methods like BYOL and DINO), data2vec averages the output of the top K transformer blocks of the teacher at each masked time-step, after a normalization step, to form the target.[1] An ablation showed that using multiple layers improved accuracy over using only the top layer (K = 1) in all three modalities. Targets are normalized to prevent the model from collapsing to a constant representation. The student is trained with a Smooth L1 (Huber) loss applied only at the masked positions, governed by a parameter β that controls the transition between squared-error and absolute-error behavior.[1]
The same model is used in two sizes, called data2vec Base and data2vec Large, with 12 or 24 transformer blocks and a hidden dimension of 768 or 1,024 respectively. Only the input handling and masking differ across modalities:[1]
data2vec reported strong results across all three benchmark families. The numbers below are from the ICML 2022 paper.
On ImageNet-1K image classification (top-1 validation accuracy after fine-tuning), data2vec outperformed prior single-model self-supervised methods for both ViT-B and ViT-L, and beat all listed prior work for ViT-L.[1]
| Method (single model) | ViT-B | ViT-L |
|---|---|---|
| MoCo v3 | 83.2 | 84.1 |
| DINO | 82.8 | N/A |
| MAE | 83.6 | 85.9 |
| SimMIM | 83.8 | N/A |
| iBOT | 83.8 | N/A |
| MaskFeat | 84.0 | 85.7 |
| data2vec | 84.2 | 86.6 |
On Librispeech speech recognition, fine-tuning on different amounts of labeled data, data2vec lowered the word error rate (WER) compared with wav2vec 2.0 and HuBERT, with the largest relative gains in the lowest-resource settings. The paper notes roughly a 20% relative WER improvement over wav2vec 2.0 for the Base model with 10 minutes of labeled data. The table shows WER on the test-other set using a 4-gram language model and LS-960 unlabeled data.[1]
| Model | 10 min | 1h | 10h | 100h | 960h |
|---|---|---|---|---|---|
| wav2vec 2.0 (Base) | 15.6 | 11.3 | 9.5 | 8.0 | 6.1 |
| HuBERT (Base) | 15.3 | 11.3 | 9.4 | 8.1 | N/A |
| data2vec (Base) | 12.3 | 9.1 | 8.1 | 6.8 | 5.5 |
| wav2vec 2.0 (Large) | 10.3 | 7.1 | 5.8 | 4.6 | 3.6 |
| HuBERT (Large) | 10.1 | 6.8 | 5.5 | 4.5 | 3.7 |
| data2vec (Large) | 8.4 | 6.3 | 5.3 | 4.6 | 3.7 |
data2vec also set a new result on the AudioSet audio event classification benchmark, reaching 34.5 mean average precision (mAP) using AudioSet-only pre-training, ahead of comparable methods such as SSAST, MaskSpec, and MAE-AST.[1]
On the GLUE natural language understanding benchmark (development set, single-task fine-tuning), data2vec scored slightly above a RoBERTa baseline that the authors retrained in the same setup, and above published BERT results. With wav2vec 2.0-style span masking, the average improved further. This was described as the first successful pre-trained NLP model whose training target was a contextualized latent representation rather than discrete tokens.[1]
| Model | GLUE average |
|---|---|
| BERT | 80.7 |
| RoBERTa baseline | 82.5 |
| data2vec | 82.7 |
| data2vec (+ wav2vec 2.0 masking) | 82.9 |
data2vec 2.0, titled "Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language," was published by Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli and posted to arXiv on December 14, 2022, with the Meta AI blog announcement dated December 13, 2022.[4][5] It keeps the contextualized latent-target objective of the original but cuts training cost substantially through three changes:[4][5]
The paper reports efficiency gains of roughly 2x to 16x at similar accuracy across the three modalities.[4] On ImageNet-1K it matched the accuracy of Masked Autoencoders with 16.4x lower pre-training time; on Librispeech it matched wav2vec 2.0 in 10.6x less time; and on GLUE it matched a retrained RoBERTa model in about half the training time. Trading some of the speedup for accuracy, a ViT-L model trained for 150 epochs reached 86.8% top-1 accuracy on ImageNet-1K.[4][5]
| Aspect | data2vec (2022) | data2vec 2.0 (2022) |
|---|---|---|
| Learning target | Contextualized latent representations (EMA teacher) | Same |
| Encoder over masked positions | Full input encoded | Masked positions not encoded |
| Decoder for masked positions | Transformer | Multi-layer convolutional network |
| Teacher target reuse | One masked view per example | One target reused across multiple masked views |
| Main goal | General cross-modal accuracy | Same accuracy at much lower compute |
| Reported efficiency | Baseline | ~2x to 16x faster at similar accuracy |
data2vec was notable for showing that a single self-supervised objective, predicting contextualized continuous representations from masked input, could match or beat specialized methods in speech, vision, and language without modality-specific target design.[1][6] It influenced later work on unified and efficient self-supervised pre-training, and the code and pretrained models for both versions were released in Meta's fairseq repository.[1][3][4]
An important nuance is that data2vec is a shared method, not a single shared model: it trains a separate model for each modality using the same objective and architecture style, rather than one network that ingests all modalities at once. The authors framed cross-modal and joint multimodal training as future work.[1][4] The targets also depend on a teacher whose representations must not collapse, which is why normalization of the targets and the layer-averaging strategy are part of the recipe.[1]