Hiera

Computer Vision Meta AI Transformer Models

9 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v2 · 1,741 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Hiera is a hierarchical vision transformer from Meta AI (FAIR), introduced in the paper "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles" presented as an oral at the International Conference on Machine Learning (ICML) 2023.^[1]^[2] Its central claim is that the specialized spatial machinery added to models such as Swin and MViT is not necessary: if a simpler hierarchical model is pretrained with a strong self-supervised pretext task, specifically masked autoencoding (MAE), it can learn the spatial structure on its own, so the extra components can be removed. The resulting architecture is simpler, faster, and more accurate than prior work on both image and video recognition.^[1] Hiera later became the image-encoder backbone of Meta's SAM 2 (Segment Anything Model 2).^[3]

Background and motivation

Vision transformers (ViTs), introduced in 2020, split an image into fixed-size patches and process them as a sequence of tokens with self-attention. A plain ViT keeps the same spatial resolution and channel count throughout the network, which the Hiera authors note makes inefficient use of parameters relative to the multi-scale design of convolutional networks like ResNet, where early stages have high spatial resolution but few channels and later stages have low resolution but many channels.^[1]

To bring this hierarchical structure to transformers, models such as Swin (Liu et al., 2021) and MViT (Fan et al., 2021; Li et al., 2022) added vision-specific components: shifted or cross-shaped attention windows, convolutional layers, and decomposed relative position embeddings. These additions manually inject spatial inductive biases (locality, translation behavior) that a plain ViT lacks. The Hiera paper argues that while these modules produce good FLOP counts and accuracy under supervised training, the added complexity makes the models slower in wall-clock terms than a vanilla ViT, because the specialized operations map poorly onto hardware.^[1]

The authors' alternative is to let the model learn spatial biases rather than build them in. MAE (He et al., 2022) trains a network to reconstruct masked image patches, which has been shown to teach plain ViTs spatial reasoning useful for downstream detection and segmentation. MAE is also "sparse": it deletes masked tokens instead of replacing them with mask tokens, so pretraining runs roughly 4 to 10 times faster than ordinary supervised training. Hiera's thesis is that with MAE as the teacher, the spatial-bias modules can be stripped out without losing accuracy.^[1]

Architecture

Hiera is built by taking MViTv2 as the starting point and removing the non-essential components, then pretraining the simplified model with MAE.^[1] The authors chose MViTv2 because its small 3x3 convolution kernels are the least disrupted by the engineering trick needed to make MAE compatible with hierarchical models, though they state a different transformer would likely have given a similar end result.^[1]

A practical obstacle is that MAE deletes masked tokens, which breaks the rigid 2D grid that convolutions and pooling in hierarchical models depend on. Hiera resolves this by distinguishing tokens from "mask units." Tokens are the internal resolution of the model (4x4 pixel patches), while masking is applied at the coarser scale of a mask unit, 32x32 pixels, equal to 8x8 tokens at the start of the network. Treating each mask unit as a contiguous block lets the model run sparse MAE on a hierarchical backbone.^[1]

Starting from MViTv2-L, the paper removes components one at a time and confirms that accuracy holds (Table 1 of the paper):^[1]

Step	Change from MViTv2
1	Replace decomposed relative position embeddings with absolute position embeddings
2	Replace convolutions with maxpooling, then delete the extra stride-1 maxpools
3	Remove overlap by setting each maxpool kernel size equal to its stride
4	Remove the attention residual connection
5	Replace the pooling attention used in the first two stages with "Mask Unit Attention" (local attention within a mask unit)

The result is described by the authors as a model "with no bells-and-whistles: no convolutions, no shifted or cross-shaped windows, no decomposed relative position embeddings. Just a pure, simple hierarchical ViT."^[1] Hiera consists entirely of standard ViT blocks. It keeps a hierarchy through query (Q) pooling at stage transitions (features doubled by a linear layer, spatial dimension halved by 2x2 maxpool) and uses local Mask Unit Attention in the first two stages while switching to global attention in stages 3 and 4. Mask Unit Attention is distinct from Swin's window attention because it attends within a mask unit (which always covers visible tokens) rather than a fixed-size window that could leak into deleted tokens after a downsample.^[1]

For these changes, Hiera-L is reported to be 2.4 times faster on images and 5.1 times faster on video than the MViTv2-L it started from, while being more accurate, because of MAE pretraining. It is also about 3 times faster to train than a supervised MViTv2-L on images and 2.1 times faster on video.^[1]

Model sizes

The paper defines six configurations spanning roughly 28M to 673M parameters. FLOPs are measured for image classification at 224x224 resolution.^[1]

Model	Parameters	FLOPs (G)	Channels	Blocks per stage
Hiera-T (Tiny)	28M	5	96-192-384-768	1-2-7-2
Hiera-S (Small)	35M	6	96-192-384-768	1-2-11-2
Hiera-B (Base)	52M	9	96-192-384-768	2-3-16-3
Hiera-B+ (Base-Plus)	70M	13	112-224-448-896	2-3-16-3
Hiera-L (Large)	214M	40	144-288-576-1152	2-6-36-4
Hiera-H (Huge)	673M	125	256-512-1024-2048	2-6-36-4

The Base-Plus variant was introduced specifically to allow a direct comparison against prior work whose Base-size models were slower.^[1]

Results

Image classification

On ImageNet-1K, all Hiera image models take 224x224 inputs. The following top-1 accuracies and A100 fp16 throughput are reported by the official repository and match the paper's comparison table.^[1]^[4]

Model	ImageNet-1K top-1	A100 fp16 speed (im/s)
Hiera-T	82.8%	2758
Hiera-S	83.8%	2211
Hiera-B	84.5%	1556
Hiera-B+	85.2%	1247
Hiera-L	86.1%	531
Hiera-H	86.9%	274

The paper reports that on images Hiera is faster and more accurate than recent state-of-the-art MAE-based work at every scale, offering a 30 to 40 percent speed-up over the best comparable model.^[1] At the default Large scale, Hiera-L reaches 86.1 percent, which the authors describe as a 0.8-point gain over MViTv2-L and a 0.2-point gain over a ViT-L MAE that is 42 percent larger and has 1.6 times the FLOPs.^[1] Remarkably, even Hiera-B (84.5 percent) without any spatial-bias modules slightly outperforms a supervised MViTv2-B that uses convolutions.^[1]

Video recognition

Hiera applies the same design to video, where a mask unit corresponds to 2 frames of 32x32 pixels. Video models process 16 frames at 224x224 with temporal stride 4. On Kinetics-400 (K400) the reported top-1 accuracies (3 spatial crops by 5 temporal clips) and A100 fp16 throughput are:^[1]^[4]

Model	K400 top-1	A100 fp16 speed (clip/s)
Hiera-B	84.0%	133.6
Hiera-B+	85.0%	84.1
Hiera-L	87.3%	40.8
Hiera-H	87.8%	20.9

On video the authors describe Hiera as establishing a new class of performance. Hiera-L brings a 2.1-point gain over the previous state of the art on K400 while using about 45 percent fewer FLOPs, being about 43 percent smaller, and running 2.3 times faster.^[1] The paper reports further gains on related video benchmarks: Hiera-L reaches 88.3 percent and Hiera-H 88.8 percent on Kinetics-600; Hiera-L reaches 80.3 percent and Hiera-H 81.1 percent on Kinetics-700; and on Something-Something-v2 (SSv2), the Hiera-L_32 variant reaches 76.5 percent.^[1] The authors also report improvements transferring to action detection on AVA v2.2.^[1]

Detection and transfer

For object detection and instance segmentation on COCO, the authors fine-tune Mask R-CNN with Hiera backbones and a Feature Pyramid Network. Hiera-L is reported at +1.8 box AP over MViTv2-L with a 24 percent reduction in inference time, illustrating the model's hierarchical, multi-scale features in downstream use.^[1] Transfer-learning experiments on iNaturalist and Places datasets show Hiera-L and Hiera-H consistently outperforming MAE-pretrained ViT.^[1]

Use in SAM 2

Hiera's most prominent downstream use is as the image-encoder backbone of SAM 2 (Segment Anything Model 2), the promptable image-and-video segmentation model released by Meta in 2024.^[3] The SAM 2 paper states directly: "We use an MAE pre-trained Hiera image encoder, which is hierarchical, allowing us to use multiscale features during decoding."^[3] Whereas the original Segment Anything Model used a plain ViT encoder, SAM 2 takes advantage of Hiera's multi-scale stages: stride-16 and stride-32 features from stages 3 and 4 are fused by a feature pyramid network into the per-frame image embeddings, while higher-resolution stride-4 and stride-8 features from stages 1 and 2 are fed into upsampling layers in the mask decoder to recover fine segmentation detail.^[3] SAM 2 ships in four sizes built on the corresponding Hiera variants (Tiny, Small, Base-Plus, and Large).^[3] This adoption is a notable validation of Hiera's argument that a simple, MAE-pretrained hierarchical backbone is well suited to dense prediction.

Availability

The code and pretrained models are released by Meta under the facebookresearch/hiera repository on GitHub.^[4] Hiera is also integrated into the Hugging Face Transformers library, where it is exposed through HieraModel, HieraForPreTraining, and HieraForImageClassification classes, with public checkpoints such as facebook/hiera-base-224 and facebook/hiera-tiny-224-mae-hf. The Transformers integration was contributed by community members Eduardo Pacheco and Naman Garg.^[5]

Authors

The paper's authors are Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, and Christoph Feichtenhofer. Several authors, including Bolya, Malik, Li, and Feichtenhofer, are affiliated with Meta AI / FAIR, with academic collaborators at Georgia Tech and Johns Hopkins University. Ryali, Hu, Bolya, Wei, Li, and Feichtenhofer are listed as equal contributors.^[1]

References

Ryali, C., Hu, Y.-T., Bolya, D., Wei, C., Fan, H., Huang, P.-Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., Malik, J., Li, Y., Feichtenhofer, C. "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles." arXiv:2306.00989, June 1, 2023. https://arxiv.org/abs/2306.00989 ↩
ICML 2023 Oral listing, "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles." https://icml.cc/virtual/2023/oral/25563 ↩
Ravi, N., et al. "SAM 2: Segment Anything in Images and Videos." arXiv:2408.00714, August 2024. https://arxiv.org/abs/2408.00714 ↩
facebookresearch/hiera GitHub repository. https://github.com/facebookresearch/hiera ↩
"Hiera" model documentation, Hugging Face Transformers. https://huggingface.co/docs/transformers/main/model_doc/hiera ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Segment Anything Model and Dataset (SAM and SA-1B)

Background and motivation

Architecture

Model sizes

Results

Image classification

Video recognition

Detection and transfer

Use in SAM 2

Availability

Authors

References

Improve this article

Related Articles

MEGABYTE

DeiT

Swin Transformer

DETR

Masked autoencoder (MAE)

Segment Anything Model and Dataset (SAM and SA-1B)

What links here

Related Articles

MEGABYTE

DeiT

Swin Transformer

DETR

Masked autoencoder (MAE)

Segment Anything Model and Dataset (SAM and SA-1B)