Masked autoencoder (MAE)

Computer Vision Machine Learning Training & Optimization Transformer Models

19 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v2 · 3,798 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Masked autoencoder (MAE) is a self-supervised learning method for vision transformers that masks roughly 75% of an input image's patches and trains a network to reconstruct the missing pixels from the small visible remainder. It was introduced by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick at Facebook AI Research (FAIR) in the November 2021 paper "Masked Autoencoders Are Scalable Vision Learners" (arXiv:2111.06377), which was accepted to CVPR 2022 ^[1]. Its two core designs are an asymmetric encoder-decoder, in which the heavy Vision Transformer encoder sees only the visible 25% of patches and a small lightweight decoder reconstructs the full image, and a deliberately high masking ratio that makes reconstruction a nontrivial self-supervised task ^[1]. As the abstract states, "masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task" ^[1].

MAE is closely inspired by BERT's masked language modelling for text, but adapted with two design choices that make pixel-level reconstruction practical at scale: a very high masking ratio that exploits the natural redundancy of pixels, and an asymmetric encoder-decoder in which the heavy Vision Transformer encoder sees only the visible 25% of patches while a small lightweight decoder handles the full sequence including learnable mask tokens ^[1]. Together these choices accelerate training by 3x or more over masked image modelling baselines that feed mask tokens through the encoder, while improving downstream accuracy ^[1]. A vanilla ViT-Huge pre-trained with MAE on ImageNet-1K reached 87.8% top-1 fine-tuning accuracy at 448 input resolution, the best result among methods that use only ImageNet-1K data at the time of publication ^[1].

Background and motivation

The two dominant approaches to large-scale self-supervised learning for vision before MAE were contrastive learning and prior masked image modelling. Contrastive methods such as SimCLR (Chen et al. 2020), MoCo and MoCo v3 (He et al. 2020, Chen et al. 2021), BYOL (Grill et al. 2020), and DINO (Caron et al. 2021) trained a network to map two augmented views of the same image to similar embeddings while pushing different images apart ^[2]^[3]^[4]^[5]. These methods produced strong linear-probe features but relied on heavy data augmentation pipelines, large batch sizes, momentum encoders, or careful negative-sample mining. Earlier reconstruction-based methods like the denoising autoencoders of Vincent et al. (2008) and the iGPT pixel-prediction approach of Chen et al. (2020) showed that generative pretraining was possible but trailed contrastive methods on standard benchmarks.

In natural language, the BERT recipe of randomly masking 15% of input tokens and predicting them from context had become the dominant pretraining objective and reliably scaled to billion-parameter models ^[6]. Researchers had asked for years why the same idea did not transfer cleanly to vision. He and colleagues argued that three differences explain the gap: language has discrete tokens that act as natural prediction targets, while images are continuous; convolutional networks did not have an obvious way to incorporate mask tokens or positional information for missing patches; and image patches are far more redundant than words, so masking 15% leaves a task that the network can solve through trivial interpolation rather than learning semantics ^[1]. The arrival of the Vision Transformer (Dosovitskiy et al. 2020) removed the architectural barrier by treating an image as a sequence of patch embeddings, and MAE addresses the redundancy issue by raising the masking ratio to 75% so that reconstruction requires understanding the global content of the image rather than copying nearby pixels.

A wave of related papers appeared at almost the same time. BEiT (Bao et al., ICLR 2022) trained a ViT to predict discrete visual tokens produced by a frozen DALL-E dVAE tokenizer, applying a BERT-style cross-entropy loss over the predicted token codes ^[7]. SimMIM (Xie et al., CVPR 2022) from Microsoft Research demonstrated independently that simple raw-pixel regression with a one-layer linear head and random patch masking also worked, with similar fine-tuning accuracy ^[8]. data2vec (Baevski et al., ICML 2022) generalized the recipe to speech, vision, and language by predicting a teacher network's contextualized latent representations rather than pixels or tokens ^[9]. MAE's distinguishing claim was simplicity and efficiency: pixels as targets, vanilla ViT encoder, very high masking ratio, and a decoder small enough to be effectively free.

How does a masked autoencoder work?

The MAE pipeline operates in five stages. The image is first split into a regular grid of non-overlapping patches (16x16 pixels for a standard 224x224 ViT), each linearly projected into a token. A random subset of these tokens, typically 75% of them, is then dropped without replacement. The encoder, a vanilla Vision Transformer, processes only the remaining visible 25%. After encoding, the visible token embeddings are placed back at their original positions in the sequence and a single learned mask token (with positional embedding) fills every masked position. This full sequence is passed through a small decoder that predicts the original pixel values for every masked patch. After pretraining, the decoder is discarded and only the encoder is kept for downstream tasks ^[1].

What is the asymmetric encoder-decoder?

The central innovation is the asymmetry between the encoder and decoder, summarised in the table below.

Component	Architecture	Sees mask tokens?	Tokens processed	Role
Encoder	Vanilla ViT-B / ViT-L / ViT-H	No	Only the 25% visible patches	Heavy feature extractor, the only part kept after pretraining
Decoder	Small ViT (default 8 blocks, 512-dim, 16 heads)	Yes	Full sequence (visible embeddings + mask tokens)	Lightweight reconstruction head, discarded after pretraining

Because the encoder only sees the visible quarter of the patches, its compute drops dramatically. ViT compute scales quadratically with the number of input tokens through self-attention, so cutting tokens from N to N/4 cuts attention compute by roughly 16x and total compute by 3x to 4x in practice ^[1]. The decoder is small (about 9% of the FLOPs of a ViT-L encoder) and only needs to reconstruct pixels well enough to provide a useful learning signal, not produce a representation for downstream tasks. This is what makes pretraining a 632 million parameter ViT-Huge on ImageNet-1K tractable on a normal cluster.

The mask token and positional embeddings

Mask tokens are a single learned vector that is shared across all masked positions. Positional embeddings are added to every token (visible or masked) before the decoder, so the network knows where each masked patch belongs in the original grid. Without positional embeddings, the decoder would have no way to order the predictions, and reconstruction would fail.

A crucial empirical observation in the paper is that putting mask tokens through the encoder (as BEiT and earlier masked-image-modelling papers did) hurts both speed and accuracy. The reason is partly distributional: at pretraining time the encoder sees a mix of real patches and mask tokens, but at downstream fine-tuning it only ever sees real patches, creating a distribution gap. By keeping mask tokens out of the encoder entirely, MAE also closes this gap ^[1].

Reconstruction target and loss

The decoder predicts the raw RGB pixel values for each masked patch and the loss is mean squared error (MSE) computed only over the masked positions. The paper shows that normalising the target patch by its own mean and standard deviation before computing MSE improves representation quality, presumably because the network is forced to predict relative structure rather than absolute brightness. Predicting visible patches in addition to masked ones gives no extra benefit and slightly hurts results, which suggests the masked positions are where the useful learning signal lives ^[1].

The paper compares pixel targets to two alternatives: tokenized targets in the style of BEiT's dVAE codebook, and high-frequency-aware variants. The simpler pixel target is competitive with or better than the tokenized one once normalisation is applied, which is one of the paper's main empirical messages.

Why is the masking ratio 75%?

The optimal masking ratio for natural images sits around 75%. Lower ratios make the task too easy because patches are highly correlated locally, so the network can copy nearby visible pixels; higher ratios destroy too much structure. Ablations in the paper show fine-tuning accuracy peaks at 75% and linear-probe accuracy peaks at 75% as well ^[1]. This is much higher than BERT's 15% for text, which is consistent with the intuition that natural images contain more redundancy at the patch level than language does at the token level.

How is MAE pretrained?

The canonical MAE training recipe on ImageNet-1K is short and uses surprisingly little augmentation, in contrast to contrastive methods that depend on aggressive colour jittering, blurring, and multi-crop strategies. The default recipe from the official Facebook AI Research code release is:

Setting	Value
Dataset	ImageNet-1K (about 1.28 million images, no labels used during pretraining)
Masking ratio	75%
Encoder	ViT-B/16, ViT-L/16, or ViT-H/14
Decoder	8 Transformer blocks, 512 dim, 16 heads
Loss	Per-patch normalised MSE on masked patches only
Optimizer	AdamW, betas (0.9, 0.95), weight decay 0.05
Base learning rate	1.5e-4 (linearly scaled with batch size)
Schedule	Cosine decay with 40-epoch linear warmup
Batch size	4096 images
Epochs	800 (default), 1600 (longer setting)
Augmentation	Random resized crop and horizontal flip only

No colour jitter, no blurring, no multi-crop. The simplicity of the augmentation pipeline is one of MAE's practical selling points: a contrastive recipe like MoCo v3 or DINO normally requires a carefully tuned chain of photometric distortions, while MAE works well with the kind of crops a supervised classifier would use.

How well does MAE perform?

The MAE paper reports results on three ViT scales pretrained on ImageNet-1K and then either linearly probed or fine-tuned on the same dataset and on downstream detection and segmentation benchmarks. Table values are taken directly from He et al. 2022 ^[1].

ImageNet-1K classification

Model	Parameters	Linear probe top-1	Fine-tune top-1	Fine-tune at 448
ViT-B/16	86 M	68.0%	83.6%	-
ViT-L/16	307 M	76.0%	85.9%	-
ViT-H/14	632 M	76.6%	86.9%	87.8%

The 87.8% number for ViT-H/14 at 448 input resolution was state of the art among methods that use only ImageNet-1K data at the time. For context, supervised training of the same ViT-H from scratch on ImageNet-1K reached only 83.1%, so MAE pretraining gave a roughly 4-point boost from the same data and architecture, just with an unsupervised objective on top ^[1].

The gap between linear probing and fine-tuning is wider for MAE than for contrastive methods. DINO, MoCo v3, and SimCLR typically have linear-probe accuracies within 1 to 3 points of their fine-tuning accuracy, while MAE's linear probe is 10 to 15 points lower. The interpretation in the paper is that pixel reconstruction encourages the encoder to retain low-level information that is useful when the head is trainable, but is not as immediately linearly separable as features learned by augmentation-invariant contrastive objectives.

Object detection on COCO

Fine-tuning a ViT-L MAE backbone with the Mask R-CNN detection head on COCO produced 53.3 box AP and 47.2 mask AP, beating the supervised pre-training baseline by about 4 box AP and matching or surpassing BEiT and DINO under the same protocol ^[1]. The improvement was largest for the larger ViT-L and ViT-H backbones, suggesting that MAE's gains compound with model scale.

Semantic segmentation on ADE20K

Using UperNet as the segmentation head on ADE20K, ViT-L MAE reached 53.6 mIoU, about 4 points above the supervised ViT-L baseline and ahead of BEiT under the same protocol ^[1].

How does MAE compare with prior self-supervised methods?

The table below summarises ImageNet-1K fine-tuning accuracy on ViT-B for several self-supervised methods reported around the time of the MAE paper.

Method	Year	Family	Target	ViT-B fine-tune top-1
Supervised ViT-B (DeiT)	2020	n/a	Class labels	81.8%
MoCo v3	2021	Contrastive	Augmented embeddings	83.2%
DINO	2021	Self-distillation	Teacher logits	82.8%
BEiT	2022	Masked image modelling	dVAE token IDs	83.2%
SimMIM	2022	Masked image modelling	Raw pixels	83.8%
MAE	2021	Masked image modelling	Normalised pixels	83.6%
data2vec	2022	Masked latent prediction	EMA latent features	84.2%

MAE, BEiT, and SimMIM cluster within a percentage point of each other on this benchmark, with MAE distinguishing itself on training efficiency rather than peak ViT-B accuracy. data2vec edges them out by predicting a target that is itself a learned latent representation, which foreshadows the move toward latent prediction in later work.

Variants and follow-ups

MAE quickly became a base recipe for adapting masked reconstruction to other modalities and tasks. The list below covers the most cited follow-ups.

Variant	Authors / venue	Idea
BEiT v2	Peng et al., 2022	Replace dVAE codebook with vector-quantised semantic tokenizer trained by knowledge distillation
SimMIM	Xie et al., CVPR 2022	Equivalent to MAE in spirit, with a one-layer linear prediction head and 32-pixel masked patches
data2vec	Baevski et al., ICML 2022	Predict EMA-teacher latent features instead of pixels; unified across vision, speech, language
data2vec 2.0	Baevski et al., 2023	Faster, contextualized targets, asymmetric encoder-decoder borrowed from MAE
VideoMAE	Tong et al., NeurIPS 2022	Extends MAE to video with tube masking and 90% to 95% masking ratio
Masked Autoencoders As Spatiotemporal Learners (MAE-ST)	Feichtenhofer, Fan, Li, He, NeurIPS 2022	Independent video MAE from FAIR, 90% masking ratio, vanilla spacetime ViT
VideoMAE V2	Wang et al., CVPR 2023	Dual masking and scaling to billion-parameter video ViTs
MultiMAE	Bachmann et al., ECCV 2022	Multi-modal pretraining with RGB, depth, and segmentation as targets
SatMAE	Cong et al., NeurIPS 2022	MAE for satellite imagery with temporal and spectral masking
Audio-MAE	Huang et al., NeurIPS 2022	MAE on log-mel spectrograms for audio classification
ConvMAE	Gao et al., NeurIPS 2022	Hybrid convolution-Transformer encoder with multi-scale masked reconstruction
Pixel-MAE / SiT / U-MAE	Multiple, 2022 to 2023	Theoretical and architectural refinements
I-JEPA	Assran et al., CVPR 2023	Yann LeCun's group, predicts in representation space rather than pixels
V-JEPA	Bardes et al., 2024	Video extension of I-JEPA, latent prediction over space-time

VideoMAE pushed the masking ratio to 90% to 95% because video is even more redundant than still images: the same patch barely changes from frame to frame, so a low masking ratio leaves the network solving a near-trivial copy task ^[10]. Masked Autoencoders As Spatiotemporal Learners, the FAIR companion paper from Feichtenhofer and colleagues, reached the same conclusion through independent experiments and noted a 4x or larger wall-clock speedup from dropping 90% of the spatiotemporal tokens ^[11]. MultiMAE took the opposite generalisation path by adding modalities at input and output, conditioning on any subset of RGB, depth, and segmentation maps and reconstructing the rest ^[12].

I-JEPA represents a deliberate move away from pixel reconstruction. Yann LeCun and collaborators argued that asking a network to predict pixels forces it to model irrelevant low-level details (textures, lighting, exact colour) that hurt the abstractness of the learned features. I-JEPA instead masks part of the image and predicts the latent representations of the masked region produced by an EMA-teacher copy of the encoder, which is closer in spirit to data2vec ^[13]. V-JEPA extends the recipe to video ^[14].

What are MAE's strengths?

MAE has become the standard masked-pretraining baseline for several reasons. The asymmetric encoder-decoder is the most cited contribution: it gives genuine compute savings during pretraining, not just smaller models. The recipe is also notable for its simplicity. There are no negative samples, no large batch sizes, no momentum encoders, and no painful augmentation pipeline; a practitioner only needs an autoencoder-style training loop and a Vision Transformer. Scaling behaviour is strong: gains over supervised pretraining grow with model size, and ViT-H benefits more than ViT-B. Transfer to detection and segmentation is consistently better than the supervised baseline by several points of AP or mIoU ^[1].

Unlike contrastive methods, MAE does not require careful tuning of the augmentation distribution to avoid collapse. The training signal is grounded in the input itself rather than a relationship between two views, so there is no representational collapse to a constant function and no need for stop-gradient tricks or asymmetric architectures specifically to prevent collapse.

What are MAE's limitations?

The weakness most often cited is that MAE features need fine-tuning to fully shine. Linear-probe accuracy on ImageNet-1K is roughly 10 percentage points below fine-tuning accuracy for a ViT-L MAE, while contrastive methods like MoCo v3 and DINO have a smaller gap. For applications that need ready-to-use features without further training, contrastive embeddings often work better.

Pixel-space targets are not necessarily the right objective for capturing high-level semantics. A network that predicts pixels has to model irrelevant details such as exact colours, textures, and lighting, even though downstream tasks may not care about any of those. data2vec, I-JEPA, and V-JEPA replace the pixel target with a latent target precisely to sidestep this issue ^[9]^[13]^[14].

MAE also requires a Transformer-based backbone to be natural. Convolutional networks do not handle mask tokens or random patch dropping cleanly, and convolutional MAE variants tend to involve hierarchical or hybrid architectures that complicate the recipe. The follow-up ConvMAE and MCMAE works show that this is solvable but at the cost of the simplicity that made the original MAE attractive.

Finally, the high masking ratio that makes MAE efficient on natural images is not universal. Domains with less spatial redundancy, such as medical imagery with sparse lesions or scientific images with localised structure, may need different ratios; the optimal ratio in the original paper is empirical and image-dependent rather than a universal constant.

Influence

MAE quickly became one of the most cited self-supervised vision papers of its era, accumulating tens of thousands of Google Scholar citations within a few years of release. The asymmetric encoder-decoder pattern, where the heavy backbone sees a small fraction of inputs and a small head reconstructs the rest, has been borrowed by data2vec 2.0, VideoMAE, MAE-ST, and several latent-prediction methods. The very high masking ratio idea was confirmed at even more extreme levels for video, raising questions about how to think about information density in different modalities.

More broadly, MAE established a generative pretraining objective for vision that finally rivalled contrastive methods at scale, and it did so with a simpler training recipe. The contrast with the iGPT pixel-prediction work of 2020 is particularly stark: MAE matched or exceeded iGPT's representation quality with orders of magnitude less compute, mostly because of the asymmetric encoder and the high masking ratio. After MAE, masked image modelling became one of two dominant pretraining paradigms for vision Transformers, alongside contrastive methods like DINO and DINOv2. The paper is paired closely with the Vision Transformer line as part of the basic toolkit for modern computer vision research, and the official Facebook AI Research code release at github.com/facebookresearch/mae is one of the most forked self-supervised learning repositories on GitHub.

MAE is part of the broader autoencoder family, which also includes the classical denoising autoencoder, sparse autoencoder, and variational autoencoder. What MAE shares with these is the encode-decode-reconstruct pattern; what it adds is the recognition that for high-dimensional natural images, masking 75% of the input and using a deliberately asymmetric architecture turns the bottleneck of reconstruction into a useful pretraining signal at very large scale.

References

He, K., Chen, X., Xie, S., Li, Y., Dollar, P., Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, pp. 16000-16009. arXiv:2111.06377. https://arxiv.org/abs/2111.06377 ↩
Chen, T., Kornblith, S., Norouzi, M., Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations" (SimCLR). ICML 2020. arXiv:2002.05709. https://arxiv.org/abs/2002.05709 ↩
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning" (MoCo). CVPR 2020. arXiv:1911.05722. https://arxiv.org/abs/1911.05722 ↩
Grill, J.B., et al. (2020). "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning" (BYOL). NeurIPS 2020. arXiv:2006.07733. https://arxiv.org/abs/2006.07733 ↩
Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., Joulin, A. (2021). "Emerging Properties in Self-Supervised Vision Transformers" (DINO). ICCV 2021. arXiv:2104.14294. https://arxiv.org/abs/2104.14294 ↩
Devlin, J., Chang, M.W., Lee, K., Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805. https://arxiv.org/abs/1810.04805 ↩
Bao, H., Dong, L., Piao, S., Wei, F. (2022). "BEiT: BERT Pre-Training of Image Transformers." ICLR 2022. arXiv:2106.08254. https://arxiv.org/abs/2106.08254 ↩
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H. (2022). "SimMIM: A Simple Framework for Masked Image Modeling." CVPR 2022. arXiv:2111.09886. https://arxiv.org/abs/2111.09886 ↩
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M. (2022). "data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language." ICML 2022. arXiv:2202.03555. https://arxiv.org/abs/2202.03555 ↩
Tong, Z., Song, Y., Wang, J., Wang, L. (2022). "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training." NeurIPS 2022. arXiv:2203.12602. https://arxiv.org/abs/2203.12602 ↩
Feichtenhofer, C., Fan, H., Li, Y., He, K. (2022). "Masked Autoencoders As Spatiotemporal Learners." NeurIPS 2022. arXiv:2205.09113. https://arxiv.org/abs/2205.09113 ↩
Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A. (2022). "MultiMAE: Multi-modal Multi-task Masked Autoencoders." ECCV 2022. arXiv:2204.01678. https://arxiv.org/abs/2204.01678 ↩
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., Ballas, N. (2023). "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" (I-JEPA). CVPR 2023. arXiv:2301.08243. https://arxiv.org/abs/2301.08243 ↩
Bardes, A., et al. (2024). "V-JEPA: Latent Video Prediction for Visual Representation Learning." Meta AI. https://ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/ ↩
Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021. arXiv:2010.11929. https://arxiv.org/abs/2010.11929
Facebook AI Research. "facebookresearch/mae: PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners." GitHub repository. https://github.com/facebookresearch/mae
Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y. (2023). "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking." CVPR 2023. arXiv:2303.16727. https://arxiv.org/abs/2303.16727
Baevski, A., Babu, A., Hsu, W.N., Auli, M. (2023). "Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language" (data2vec 2.0). ICML 2023. arXiv:2212.07525. https://arxiv.org/abs/2212.07525

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

Masked autoencoder (MAE)

Background and motivation

How does a masked autoencoder work?

What is the asymmetric encoder-decoder?

The mask token and positional embeddings

Reconstruction target and loss

Why is the masking ratio 75%?

How is MAE pretrained?

How well does MAE perform?

ImageNet-1K classification

Object detection on COCO

Semantic segmentation on ADE20K

How does MAE compare with prior self-supervised methods?

Variants and follow-ups

What are MAE's strengths?

What are MAE's limitations?

Influence

See also

References

Improve this article

What links here

What links here

Background and motivation

How does a masked autoencoder work?

What is the asymmetric encoder-decoder?

The mask token and positional embeddings

Reconstruction target and loss

Why is the masking ratio 75%?

How is MAE pretrained?

How well does MAE perform?

ImageNet-1K classification

Object detection on COCO

Semantic segmentation on ADE20K

How does MAE compare with prior self-supervised methods?

Variants and follow-ups

What are MAE's strengths?

What are MAE's limitations?

Influence

See also

References

Improve this article

Related Articles

Pre-training

Focal loss

Ring Attention

DeiT

Swin Transformer

DETR

What links here

Related Articles

Pre-training

Focal loss

Ring Attention

DeiT

Swin Transformer

DETR

What links here