Masked autoencoder (MAE)
Last reviewed
Apr 30, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,734 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,734 words
Add missing citations, update stale details, or suggest a clearer explanation.
Masked autoencoder (MAE) is a self-supervised learning approach for vision transformers introduced by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick at Facebook AI Research (FAIR) in November 2021. The method, presented in the paper "Masked Autoencoders Are Scalable Vision Learners" (arXiv:2111.06377), randomly masks a high fraction of the patches in an input image (typically 75%) and trains the network to reconstruct the missing pixels from the small set of visible patches alone [1]. The paper was accepted to CVPR 2022 and won a Best Paper Award nomination, and the architecture has become one of the standard reference points for pretraining image models without labels.
MAE is closely inspired by BERT's masked language modelling for text, but adapted with two design choices that make pixel-level reconstruction practical at scale: a very high masking ratio that exploits the natural redundancy of pixels, and an asymmetric encoder-decoder in which the heavy Vision Transformer encoder sees only the visible 25% of patches while a small lightweight decoder handles the full sequence including learnable mask tokens [1]. Together these choices give a roughly 3x to 4x speedup over masked image modelling baselines that feed mask tokens through the encoder, while improving downstream accuracy. A vanilla ViT-Huge pre-trained with MAE on ImageNet-1K reached 87.8% top-1 fine-tuning accuracy at 448 input resolution, the best result for ImageNet-1K-only methods at the time of publication [1].
The two dominant approaches to large-scale self-supervised learning for vision before MAE were contrastive learning and prior masked image modelling. Contrastive methods such as SimCLR (Chen et al. 2020), MoCo and MoCo v3 (He et al. 2020, Chen et al. 2021), BYOL (Grill et al. 2020), and DINO (Caron et al. 2021) trained a network to map two augmented views of the same image to similar embeddings while pushing different images apart [2][3][4][5]. These methods produced strong linear-probe features but relied on heavy data augmentation pipelines, large batch sizes, momentum encoders, or careful negative-sample mining. Earlier reconstruction-based methods like the denoising autoencoders of Vincent et al. (2008) and the iGPT pixel-prediction approach of Chen et al. (2020) showed that generative pretraining was possible but trailed contrastive methods on standard benchmarks.
In natural language, the BERT recipe of randomly masking 15% of input tokens and predicting them from context had become the dominant pretraining objective and reliably scaled to billion-parameter models [6]. Researchers had asked for years why the same idea did not transfer cleanly to vision. He and colleagues argued that three differences explain the gap: language has discrete tokens that act as natural prediction targets, while images are continuous; convolutional networks did not have an obvious way to incorporate mask tokens or positional information for missing patches; and image patches are far more redundant than words, so masking 15% leaves a task that the network can solve through trivial interpolation rather than learning semantics [1]. The arrival of the Vision Transformer (Dosovitskiy et al. 2020) removed the architectural barrier by treating an image as a sequence of patch embeddings, and MAE addresses the redundancy issue by raising the masking ratio to 75% so that reconstruction requires understanding the global content of the image rather than copying nearby pixels.
A wave of related papers appeared at almost the same time. BEiT (Bao et al., ICLR 2022) trained a ViT to predict discrete visual tokens produced by a frozen DALL-E dVAE tokenizer, applying a BERT-style cross-entropy loss over the predicted token codes [7]. SimMIM (Xie et al., CVPR 2022) from Microsoft Research demonstrated independently that simple raw-pixel regression with a one-layer linear head and random patch masking also worked, with similar fine-tuning accuracy [8]. data2vec (Baevski et al., ICML 2022) generalized the recipe to speech, vision, and language by predicting a teacher network's contextualized latent representations rather than pixels or tokens [9]. MAE's distinguishing claim was simplicity and efficiency: pixels as targets, vanilla ViT encoder, very high masking ratio, and a decoder small enough to be effectively free.
The MAE pipeline operates in five stages. The image is first split into a regular grid of non-overlapping patches (16x16 pixels for a standard 224x224 ViT), each linearly projected into a token. A random subset of these tokens, typically 75% of them, is then dropped without replacement. The encoder, a vanilla Vision Transformer, processes only the remaining visible 25%. After encoding, the visible token embeddings are placed back at their original positions in the sequence and a single learned mask token (with positional embedding) fills every masked position. This full sequence is passed through a small decoder that predicts the original pixel values for every masked patch. After pretraining, the decoder is discarded and only the encoder is kept for downstream tasks [1].
The central innovation is the asymmetry between the encoder and decoder, summarised in the table below.
| Component | Architecture | Sees mask tokens? | Tokens processed | Role |
|---|---|---|---|---|
| Encoder | Vanilla ViT-B / ViT-L / ViT-H | No | Only the 25% visible patches | Heavy feature extractor, the only part kept after pretraining |
| Decoder | Small ViT (default 8 blocks, 512-dim, 16 heads) | Yes | Full sequence (visible embeddings + mask tokens) | Lightweight reconstruction head, discarded after pretraining |
Because the encoder only sees the visible quarter of the patches, its compute drops dramatically. ViT compute scales quadratically with the number of input tokens through self-attention, so cutting tokens from N to N/4 cuts attention compute by roughly 16x and total compute by 3x to 4x in practice [1]. The decoder is small (about 9% of the FLOPs of a ViT-L encoder) and only needs to reconstruct pixels well enough to provide a useful learning signal, not produce a representation for downstream tasks. This is what makes pretraining a 632 million parameter ViT-Huge on ImageNet-1K tractable on a normal cluster.
Mask tokens are a single learned vector that is shared across all masked positions. Positional embeddings are added to every token (visible or masked) before the decoder, so the network knows where each masked patch belongs in the original grid. Without positional embeddings, the decoder would have no way to order the predictions, and reconstruction would fail.
A crucial empirical observation in the paper is that putting mask tokens through the encoder (as BEiT and earlier masked-image-modelling papers did) hurts both speed and accuracy. The reason is partly distributional: at pretraining time the encoder sees a mix of real patches and mask tokens, but at downstream fine-tuning it only ever sees real patches, creating a distribution gap. By keeping mask tokens out of the encoder entirely, MAE also closes this gap [1].
The decoder predicts the raw RGB pixel values for each masked patch and the loss is mean squared error (MSE) computed only over the masked positions. The paper shows that normalising the target patch by its own mean and standard deviation before computing MSE improves representation quality, presumably because the network is forced to predict relative structure rather than absolute brightness. Predicting visible patches in addition to masked ones gives no extra benefit and slightly hurts results, which suggests the masked positions are where the useful learning signal lives [1].
The paper compares pixel targets to two alternatives: tokenized targets in the style of BEiT's dVAE codebook, and high-frequency-aware variants. The simpler pixel target is competitive with or better than the tokenized one once normalisation is applied, which is one of the paper's main empirical messages.
The optimal masking ratio for natural images sits around 75%. Lower ratios make the task too easy because patches are highly correlated locally, so the network can copy nearby visible pixels; higher ratios destroy too much structure. Ablations in the paper show fine-tuning accuracy peaks at 75% and linear-probe accuracy peaks at 75% as well [1]. This is much higher than BERT's 15% for text, which is consistent with the intuition that natural images contain more redundancy at the patch level than language does at the token level.
The canonical MAE training recipe on ImageNet-1K is short and uses surprisingly little augmentation, in contrast to contrastive methods that depend on aggressive colour jittering, blurring, and multi-crop strategies. The default recipe from the official Facebook AI Research code release is:
| Setting | Value |
|---|---|
| Dataset | ImageNet-1K (about 1.28 million images, no labels used during pretraining) |
| Masking ratio | 75% |
| Encoder | ViT-B/16, ViT-L/16, or ViT-H/14 |
| Decoder | 8 Transformer blocks, 512 dim, 16 heads |
| Loss | Per-patch normalised MSE on masked patches only |
| Optimizer | AdamW, betas (0.9, 0.95), weight decay 0.05 |
| Base learning rate | 1.5e-4 (linearly scaled with batch size) |
| Schedule | Cosine decay with 40-epoch linear warmup |
| Batch size | 4096 images |
| Epochs | 800 (default), 1600 (longer setting) |
| Augmentation | Random resized crop and horizontal flip only |
No colour jitter, no blurring, no multi-crop. The simplicity of the augmentation pipeline is one of MAE's practical selling points: a contrastive recipe like MoCo v3 or DINO normally requires a carefully tuned chain of photometric distortions, while MAE works well with the kind of crops a supervised classifier would use.
The MAE paper reports results on three ViT scales pretrained on ImageNet-1K and then either linearly probed or fine-tuned on the same dataset and on downstream detection and segmentation benchmarks. Table values are taken directly from He et al. 2022 [1].
| Model | Parameters | Linear probe top-1 | Fine-tune top-1 | Fine-tune at 448 |
|---|---|---|---|---|
| ViT-B/16 | 86 M | 68.0% | 83.6% | - |
| ViT-L/16 | 307 M | 76.0% | 85.9% | - |
| ViT-H/14 | 632 M | 76.6% | 86.9% | 87.8% |
The 87.8% number for ViT-H/14 at 448 input resolution was state of the art for ImageNet-1K-only methods at the time. For context, supervised training of the same ViT-H from scratch on ImageNet-1K reached only 83.1%, so MAE pretraining gave a roughly 4-point boost from the same data and architecture, just with an unsupervised objective on top [1].
The gap between linear probing and fine-tuning is wider for MAE than for contrastive methods. DINO, MoCo v3, and SimCLR typically have linear-probe accuracies within 1 to 3 points of their fine-tuning accuracy, while MAE's linear probe is 10 to 15 points lower. The interpretation in the paper is that pixel reconstruction encourages the encoder to retain low-level information that is useful when the head is trainable, but is not as immediately linearly separable as features learned by augmentation-invariant contrastive objectives.
Fine-tuning a ViT-L MAE backbone with the Mask R-CNN detection head on COCO produced 53.3 box AP and 47.2 mask AP, beating the supervised pre-training baseline by about 4 box AP and matching or surpassing BEiT and DINO under the same protocol [1]. The improvement was largest for the larger ViT-L and ViT-H backbones, suggesting that MAE's gains compound with model scale.
Using UperNet as the segmentation head on ADE20K, ViT-L MAE reached 53.6 mIoU, about 4 points above the supervised ViT-L baseline and ahead of BEiT under the same protocol [1].
The table below summarises ImageNet-1K fine-tuning accuracy on ViT-B for several self-supervised methods reported around the time of the MAE paper.
| Method | Year | Family | Target | ViT-B fine-tune top-1 |
|---|---|---|---|---|
| Supervised ViT-B (DeiT) | 2020 | n/a | Class labels | 81.8% |
| MoCo v3 | 2021 | Contrastive | Augmented embeddings | 83.2% |
| DINO | 2021 | Self-distillation | Teacher logits | 82.8% |
| BEiT | 2022 | Masked image modelling | dVAE token IDs | 83.2% |
| SimMIM | 2022 | Masked image modelling | Raw pixels | 83.8% |
| MAE | 2021 | Masked image modelling | Normalised pixels | 83.6% |
| data2vec | 2022 | Masked latent prediction | EMA latent features | 84.2% |
MAE, BEiT, and SimMIM cluster within a percentage point of each other on this benchmark, with MAE distinguishing itself on training efficiency rather than peak ViT-B accuracy. data2vec edges them out by predicting a target that is itself a learned latent representation, which foreshadows the move toward latent prediction in later work.
MAE quickly became a base recipe for adapting masked reconstruction to other modalities and tasks. The list below covers the most cited follow-ups.
| Variant | Authors / venue | Idea |
|---|---|---|
| BEiT v2 | Peng et al., 2022 | Replace dVAE codebook with vector-quantised semantic tokenizer trained by knowledge distillation |
| SimMIM | Xie et al., CVPR 2022 | Equivalent to MAE in spirit, with a one-layer linear prediction head and 32-pixel masked patches |
| data2vec | Baevski et al., ICML 2022 | Predict EMA-teacher latent features instead of pixels; unified across vision, speech, language |
| data2vec 2.0 | Baevski et al., 2023 | Faster, contextualized targets, asymmetric encoder-decoder borrowed from MAE |
| VideoMAE | Tong et al., NeurIPS 2022 | Extends MAE to video with tube masking and 90% to 95% masking ratio |
| Masked Autoencoders As Spatiotemporal Learners (MAE-ST) | Feichtenhofer, Fan, Li, He, NeurIPS 2022 | Independent video MAE from FAIR, 90% masking ratio, vanilla spacetime ViT |
| VideoMAE V2 | Wang et al., CVPR 2023 | Dual masking and scaling to billion-parameter video ViTs |
| MultiMAE | Bachmann et al., ECCV 2022 | Multi-modal pretraining with RGB, depth, and segmentation as targets |
| SatMAE | Cong et al., NeurIPS 2022 | MAE for satellite imagery with temporal and spectral masking |
| Audio-MAE | Huang et al., NeurIPS 2022 | MAE on log-mel spectrograms for audio classification |
| ConvMAE | Gao et al., NeurIPS 2022 | Hybrid convolution-Transformer encoder with multi-scale masked reconstruction |
| Pixel-MAE / SiT / U-MAE | Multiple, 2022 to 2023 | Theoretical and architectural refinements |
| I-JEPA | Assran et al., CVPR 2023 | Yann LeCun's group, predicts in representation space rather than pixels |
| V-JEPA | Bardes et al., 2024 | Video extension of I-JEPA, latent prediction over space-time |
VideoMAE pushed the masking ratio to 90% to 95% because video is even more redundant than still images: the same patch barely changes from frame to frame, so a low masking ratio leaves the network solving a near-trivial copy task [10]. Masked Autoencoders As Spatiotemporal Learners, the FAIR companion paper from Feichtenhofer and colleagues, reached the same conclusion through independent experiments and noted a 4x or larger wall-clock speedup from dropping 90% of the spatiotemporal tokens [11]. MultiMAE took the opposite generalisation path by adding modalities at input and output, conditioning on any subset of RGB, depth, and segmentation maps and reconstructing the rest [12].
I-JEPA represents a deliberate move away from pixel reconstruction. Yann LeCun and collaborators argued that asking a network to predict pixels forces it to model irrelevant low-level details (textures, lighting, exact colour) that hurt the abstractness of the learned features. I-JEPA instead masks part of the image and predicts the latent representations of the masked region produced by an EMA-teacher copy of the encoder, which is closer in spirit to data2vec [13]. V-JEPA extends the recipe to video [14].
MAE has become the standard masked-pretraining baseline for several reasons. The asymmetric encoder-decoder is the most cited contribution: it gives genuine compute savings during pretraining, not just smaller models. The recipe is also notable for its simplicity. There are no negative samples, no large batch sizes, no momentum encoders, and no painful augmentation pipeline; a practitioner only needs an autoencoder-style training loop and a Vision Transformer. Scaling behaviour is strong: gains over supervised pretraining grow with model size, and ViT-H benefits more than ViT-B. Transfer to detection and segmentation is consistently better than the supervised baseline by several points of AP or mIoU [1].
Unlike contrastive methods, MAE does not require careful tuning of the augmentation distribution to avoid collapse. The training signal is grounded in the input itself rather than a relationship between two views, so there is no representational collapse to a constant function and no need for stop-gradient tricks or asymmetric architectures specifically to prevent collapse.
The weakness most often cited is that MAE features need fine-tuning to fully shine. Linear-probe accuracy on ImageNet-1K is roughly 10 percentage points below fine-tuning accuracy for a ViT-L MAE, while contrastive methods like MoCo v3 and DINO have a smaller gap. For applications that need ready-to-use features without further training, contrastive embeddings often work better.
Pixel-space targets are not necessarily the right objective for capturing high-level semantics. A network that predicts pixels has to model irrelevant details such as exact colours, textures, and lighting, even though downstream tasks may not care about any of those. data2vec, I-JEPA, and V-JEPA replace the pixel target with a latent target precisely to sidestep this issue [9][13][14].
MAE also requires a Transformer-based backbone to be natural. Convolutional networks do not handle mask tokens or random patch dropping cleanly, and convolutional MAE variants tend to involve hierarchical or hybrid architectures that complicate the recipe. The follow-up ConvMAE and MCMAE works show that this is solvable but at the cost of the simplicity that made the original MAE attractive.
Finally, the high masking ratio that makes MAE efficient on natural images is not universal. Domains with less spatial redundancy, such as medical imagery with sparse lesions or scientific images with localised structure, may need different ratios; the optimal ratio in the original paper is empirical and image-dependent rather than a universal constant.
MAE quickly became one of the most cited self-supervised vision papers of its era, accumulating tens of thousands of Google Scholar citations within a few years of release. The asymmetric encoder-decoder pattern, where the heavy backbone sees a small fraction of inputs and a small head reconstructs the rest, has been borrowed by data2vec 2.0, VideoMAE, MAE-ST, and several latent-prediction methods. The very high masking ratio idea was confirmed at even more extreme levels for video, raising questions about how to think about information density in different modalities.
More broadly, MAE established a generative pretraining objective for vision that finally rivalled contrastive methods at scale, and it did so with a simpler training recipe. The contrast with the iGPT pixel-prediction work of 2020 is particularly stark: MAE matched or exceeded iGPT's representation quality with orders of magnitude less compute, mostly because of the asymmetric encoder and the high masking ratio. After MAE, masked image modelling became one of two dominant pretraining paradigms for vision Transformers, alongside contrastive methods like DINO and DINOv2. The paper is paired closely with the Vision Transformer line as part of the basic toolkit for modern computer vision research, and the official Facebook AI Research code release at github.com/facebookresearch/mae is one of the most forked self-supervised learning repositories on GitHub.
MAE is part of the broader autoencoder family, which also includes the classical denoising autoencoder, sparse autoencoder, and variational autoencoder. What MAE shares with these is the encode-decode-reconstruct pattern; what it adds is the recognition that for high-dimensional natural images, masking 75% of the input and using a deliberately asymmetric architecture turns the bottleneck of reconstruction into a useful pretraining signal at very large scale.