Swin Transformer

Computer Vision Deep Learning Neural Networks Transformer Models

31 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v10 · 6,103 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Swin Transformer (Shifted Window Transformer) is a hierarchical vision transformer architecture that computes self-attention within local, non-overlapping windows and introduces a shifted window partitioning scheme to enable cross-window connections. It was developed by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo at Microsoft Research Asia.^[1] The paper, titled "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," was published at the IEEE/CVF International Conference on Computer Vision (ICCV) 2021, where it received the Marr Prize (Best Paper Award).^[1] The name "Swin" is a portmanteau of "Shifted Window." The authors describe it as "a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision."^[1]

The Swin Transformer addresses two core challenges that prevented standard transformers from serving as general-purpose vision backbones: the quadratic computational cost of global self-attention when processing high-resolution images, and the lack of multi-scale feature representations needed for dense prediction tasks such as object detection and semantic segmentation. By restricting attention to fixed-size local windows and shifting the window partition across consecutive layers, the architecture achieves linear computational complexity with respect to image size.^[1] As the paper states, "The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection."^[1] Its hierarchical design, built through successive patch merging operations, produces feature maps at multiple resolutions (1/4, 1/8, 1/16, and 1/32 of the input), directly compatible with established dense prediction frameworks.^[1]

Upon release, the Swin Transformer set new state-of-the-art results on ImageNet classification (87.3% top-1 accuracy), COCO object detection (58.7 box AP and 51.1 mask AP on test-dev), and ADE20K semantic segmentation (53.5 mIoU on val).^[1] With over 14,800 citations as of early 2026, it is one of the most influential computer vision papers of the 2020s and established the shifted window mechanism as a foundational design pattern for efficient vision transformers.

What problem does the Swin Transformer solve?

The Vision Transformer (ViT), introduced by Dosovitskiy et al. in 2020, showed that a pure transformer architecture could match or exceed convolutional neural networks (CNNs) on image classification when pre-trained on large datasets such as JFT-300M.^[3] ViT treats an image as a flat sequence of fixed-size patches and applies global self-attention across all patches.^[3] While effective for classification, this design has two significant limitations for use as a general-purpose vision backbone.

First, global self-attention has quadratic computational complexity with respect to the number of image tokens. For an image with n patches, computing the attention matrix requires $O(n^2)$ operations. A 224x224 image with 16x16 patches produces 196 tokens, which is manageable, but increasing the resolution to 800x1333 (typical for object detection) would produce over 4,000 tokens with 16x16 patches, making global attention prohibitively expensive.

Second, ViT produces feature maps at a single scale (typically 1/16 of the input resolution). Dense prediction tasks such as object detection and semantic segmentation benefit from multi-scale feature pyramids that capture both fine-grained details and high-level semantics. Adapting ViT for these tasks requires additional architectural modifications or feature interpolation.

DeiT (Data-efficient Image Transformers), published by Touvron et al. in 2021, improved ViT's data efficiency through better training strategies, knowledge distillation, and stronger data augmentation.^[4] DeiT demonstrated that ViT could be trained competitively on ImageNet-1K alone, without massive pre-training datasets.^[4] However, DeiT retained ViT's fundamental architectural properties: single-scale features and quadratic-complexity global attention.

The Pyramid Vision Transformer (PVT), introduced by Wang et al. in 2021, took a step toward addressing the multi-scale problem by proposing a hierarchical transformer with progressively reduced spatial resolution across stages.^[5] PVT used spatial reduction of keys and values to lower the cost of attention, but its complexity remained quadratic (though with a reduced constant factor) within each stage.^[5]

The Swin Transformer synthesized the lessons from these prior works. From ViT and DeiT, it inherited the use of self-attention as the core computational primitive. From PVT and classical CNNs such as ResNet, it adopted the hierarchical multi-stage design with progressive downsampling.^[10] Its original contribution was the shifted window attention mechanism, which achieved strictly linear complexity while preserving the ability to model cross-window interactions.^[1]

How is the Swin Transformer architecture structured?

Overall Design

The Swin Transformer processes an input image through four hierarchical stages. Each stage consists of a patch merging layer (except Stage 1, which uses the initial patch partition and linear embedding) followed by a sequence of Swin Transformer blocks. The spatial resolution decreases by a factor of 2 at each stage transition, while the channel dimensionality doubles, producing a feature pyramid with resolutions at 1/4, 1/8, 1/16, and 1/32 of the input image.^[1]

Patch Partition and Linear Embedding

The input RGB image of size H x W x 3 is first divided into non-overlapping patches of size 4x4. Each patch is treated as a token, producing a grid of H/4 x W/4 tokens. The raw pixel values within each patch are concatenated into a 48-dimensional vector (4 x 4 x 3 = 48). A linear embedding layer then projects these vectors to a C-dimensional space, where C is the base channel number that varies by model size (96 for Swin-T and Swin-S, 128 for Swin-B, 192 for Swin-L).^[1]

Stage 1

The first stage applies a sequence of Swin Transformer blocks to the H/4 x W/4 tokens at dimension C. The spatial resolution remains H/4 x W/4 throughout Stage 1. The number of blocks depends on the model variant (2 blocks for all standard configurations).

Patch Merging Layers

Between consecutive stages, a patch merging layer reduces the spatial resolution by a factor of 2 in each dimension. The process works as follows:

For each group of 2x2 spatially adjacent tokens, their C-dimensional features are concatenated, producing a 4C-dimensional vector.
A linear layer projects this 4C-dimensional vector down to 2C dimensions.

This halves both the height and width of the feature map while doubling the channel count. After merging between Stages 1 and 2, the resolution becomes H/8 x W/8 with 2C channels. After merging between Stages 2 and 3, it becomes H/16 x W/16 with 4C channels. After the final merging between Stages 3 and 4, the resolution is H/32 x W/32 with 8C channels.^[1]

For Swin-T with C = 96, the feature dimensions across the four stages are 96, 192, 384, and 768. For Swin-B with C = 128, they are 128, 256, 512, and 1024.

This design mirrors the feature pyramid structure of classical CNNs and makes the Swin Transformer directly compatible with multi-scale frameworks such as Feature Pyramid Networks (FPN), U-Net, and UPerNet.

Swin Transformer Block

Each Swin Transformer block follows a standard transformer design with pre-normalization and residual connections. The blocks come in pairs: the first block uses window-based multi-head self-attention (W-MSA), and the second uses shifted window-based multi-head self-attention (SW-MSA).^[1] Each block consists of:

Layer normalization (LN)
W-MSA or SW-MSA (alternating between consecutive blocks)
Residual connection
Layer normalization
Two-layer MLP with GELU activation and 4x expansion ratio
Residual connection

For two consecutive blocks l and l+1, the computation can be expressed as:

\hat{z}^l = \text{W-MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1}

z^l = \mathrm{MLP}(\mathrm{LN}(\hat{z}^l)) + \hat{z}^l

\hat{z}^{l+1} = \text{SW-MSA}(\mathrm{LN}(z^l)) + z^l

z^{l+1} = \mathrm{MLP}(\mathrm{LN}(\hat{z}^{l+1})) + \hat{z}^{l+1}

How does shifted window self-attention work?

The shifted window mechanism is the defining innovation of the Swin Transformer. It replaces global self-attention with local window attention while using a shifting strategy to maintain cross-window information flow.^[1]

Window-Based Multi-Head Self-Attention (W-MSA)

In W-MSA, the feature map of h x w tokens is evenly partitioned into non-overlapping windows of size $M \times M$ (with $M = 7$ by default). Self-attention is computed independently within each local window. This produces $\lceil h/M \rceil \times \lceil w/M \rceil$ windows, each containing $M^2 = 49$ tokens.^[1]

The key benefit is computational. For a feature map with h x w tokens and C channels, the complexity of global multi-head self-attention is:

\Omega(\text{MSA}) = 4hwC^2 + 2(hw)^2 C

The first term ( $4hwC^2$ ) comes from the linear projections for queries, keys, values, and output. The second term ( $2(hw)^2 C$ ) comes from computing the attention matrix and multiplying it with the value matrix. This second term is quadratic in spatial resolution.

Window-based multi-head self-attention has complexity:

\Omega(\text{W-MSA}) = 4hwC^2 + 2M^2 hwC

Since M is a fixed constant (7), the second term is now linear in $hw$ .^[1] For a 56x56 feature map (from a 224x224 input) with C = 96, the attention matrix cost drops from $2 \times (56 \times 56)^2 \times 96$ = roughly 600 billion operations to $2 \times 49 \times (56 \times 56) \times 96$ = roughly 30 billion operations, a reduction by a factor of about 64.

Shifted Window Multi-Head Self-Attention (SW-MSA)

The limitation of W-MSA is that self-attention is confined to each local window, preventing direct interaction between tokens in different windows. To address this, consecutive Swin Transformer blocks alternate between two window configurations:

Block l (regular windows): The feature map is partitioned into non-overlapping M x M windows starting from the top-left corner.
Block l+1 (shifted windows): The window partition is displaced by (floor(M/2), floor(M/2)) pixels, causing each shifted window to span portions of four adjacent regular windows.

This alternation ensures that tokens at the boundaries of one window configuration are grouped together in the next, allowing information to flow across window boundaries in successive layers.^[1] After several layers, the effective receptive field expands well beyond any individual window.

Efficient Batch Computation via Cyclic Shift

A naive implementation of the shifted window partition would create windows of varying sizes at image boundaries, complicating batched computation. The Swin Transformer uses a cyclic shift strategy to avoid this problem:

Before computing SW-MSA, the feature map is cyclically shifted toward the top-left by (floor(M/2), floor(M/2)) pixels.
The shifted feature map is then partitioned into regular, non-overlapping M x M windows (the same number and size as in W-MSA).
A masking mechanism ensures that self-attention is computed only between tokens that were originally adjacent, preventing information leakage between non-contiguous regions brought together by the cyclic shift.
After attention computation, the inverse cyclic shift restores the original spatial arrangement.

This approach maintains a constant number of equally-sized windows in every layer, enabling efficient batched matrix multiplication. The authors reported that the cyclic shift implementation achieved 13%, 18%, and 18% speedups on Swin-T, Swin-S, and Swin-B, respectively, compared to a naive padding-based approach.^[1]

Relative Position Bias

Instead of using absolute positional encodings as in ViT, the Swin Transformer incorporates a learnable relative position bias into the attention computation:

\mathrm{Attention}(Q, K, V) = \mathrm{SoftMax}\!\left(\frac{QK^\top}{\sqrt{d}} + B\right) V

Here, B is a bias matrix of size $M^2 \times M^2$ , drawn from a learnable parameter table of size $(2M - 1) \times (2M - 1)$ . The table is indexed by the relative displacement between each pair of tokens along both spatial axes. Since relative positions along each axis range from $-(M-1)$ to $+(M-1)$ , the table has $(2M - 1)$ entries per axis, for a total of $(2M - 1)^2 = 169$ unique bias values when $M = 7$ .^[1]

Each attention head has its own relative position bias, and the biases are shared across all windows within the same layer. Ablation experiments showed that relative position bias improved ImageNet-1K top-1 accuracy by approximately 1.2% over no position encoding and by approximately 0.5% over absolute position embedding.^[1]

What are the Swin Transformer model sizes?

The Swin Transformer family includes four standard model variants of increasing capacity. All use a window size of $M = 7$ , a query dimension of $d = 32$ per attention head, a patch size of 4x4, and an MLP expansion ratio of 4.^[1]

Variant	Embed Dim (C)	Layer Depths	Heads per Stage	Parameters	FLOPs (224x224)	Throughput (img/s)
Swin-T (Tiny)	96	{2, 2, 6, 2}	{3, 6, 12, 24}	29M	4.5G	755.2
Swin-S (Small)	96	{2, 2, 18, 2}	{3, 6, 12, 24}	50M	8.7G	436.9
Swin-B (Base)	128	{2, 2, 18, 2}	{4, 8, 16, 32}	88M	15.4G	278.1
Swin-L (Large)	192	{2, 2, 18, 2}	{6, 12, 24, 48}	197M	34.5G	42.1 (384x384)

The number of attention heads per stage is determined by dividing the channel dimension at that stage by the per-head query dimension of 32. For Swin-T (C = 96), Stage 1 has $96/32 = 3$ heads, Stage 2 has $192/32 = 6$ heads, Stage 3 has $384/32 = 12$ heads, and Stage 4 has $768/32 = 24$ heads.

Swin-T and Swin-S have computational costs comparable to ResNet-50 and ResNet-101, respectively. Swin-B has similar complexity to ViT-B and DeiT-B but achieves substantially higher accuracy.^[1] Swin-L is designed for ImageNet-22K pre-training and is approximately 2x the size of Swin-B.

The main architectural difference between Swin-T and Swin-S is the depth of Stage 3: Swin-T uses 6 blocks (3 pairs of W-MSA + SW-MSA), while Swin-S, Swin-B, and Swin-L all use 18 blocks (9 pairs). Stage 3 is the deepest stage because the feature map is at 1/16 resolution with 4C channels, offering a good balance between spatial detail and semantic abstraction.

How was the Swin Transformer trained?

ImageNet-1K Training

For training from scratch on ImageNet-1K (1.28 million training images, 1,000 classes), the authors used the following setup:^[1]

Optimizer: AdamW with initial learning rate of 0.001 and weight decay of 0.05
Schedule: 300 epochs with cosine learning rate decay and 20 epochs of linear warm-up
Batch size: 1024
Data augmentation: RandAugment, Mixup (alpha = 0.8), CutMix (alpha = 1.0), random erasing (probability = 0.25)
Regularization: Stochastic depth with rates of 0.2 (Swin-T), 0.3 (Swin-S), and 0.5 (Swin-B)
Input resolution: 224x224

ImageNet-22K Pre-training

For ImageNet-22K pre-training (14.2 million images across 21,841 classes), the models were pre-trained for 90 epochs at 224x224 resolution and then fine-tuned on ImageNet-1K at both 224x224 and 384x384 resolutions.^[1] Fine-tuning used a smaller learning rate (typically 1e-5) for 30 epochs.

How does the Swin Transformer perform on benchmarks?

ImageNet Image Classification

Swin Transformer variants were evaluated on the ImageNet-1K benchmark against prior vision transformers and CNN baselines.

Model	Pre-training	Resolution	Params	FLOPs	Top-1 Acc (%)
ResNet-50	ImageNet-1K	224x224	25M	4.1G	76.2
DeiT-S	ImageNet-1K	224x224	22M	4.6G	79.8
Swin-T	ImageNet-1K	224x224	29M	4.5G	81.3
DeiT-B	ImageNet-1K	224x224	86M	17.5G	81.8
Swin-S	ImageNet-1K	224x224	50M	8.7G	83.0
Swin-B	ImageNet-1K	224x224	88M	15.4G	83.5
Swin-B	ImageNet-1K	384x384	88M	47.1G	84.5
ViT-B/16	ImageNet-22K	384x384	86M	55.4G	84.0
ViT-L/16	ImageNet-22K	384x384	307M	190.7G	85.2
Swin-B	ImageNet-22K	224x224	88M	15.4G	85.2
Swin-B	ImageNet-22K	384x384	88M	47.1G	86.4
Swin-L	ImageNet-22K	224x224	197M	34.5G	86.3
Swin-L	ImageNet-22K	384x384	197M	103.9G	87.3

Key comparisons:

Swin-T vs. DeiT-S: At similar FLOPs (~4.5G), Swin-T (81.3%) outperforms DeiT-S (79.8%) by +1.5%, showing the benefit of hierarchical architecture and local attention over single-scale global attention.^[1]
Swin-B vs. DeiT-B: At similar parameter counts (~87M), Swin-B (83.5%) outperforms DeiT-B (81.8%) by +1.7% while using fewer FLOPs (15.4G vs. 17.5G).^[1]
Swin-B vs. ViT-B/16: With ImageNet-22K pre-training, Swin-B at 384x384 (86.4%) outperforms ViT-B/16 (84.0%) by +2.4% with fewer FLOPs (47.1G vs. 55.4G) and comparable throughput.^[1]
Swin-L vs. ViT-L/16: Swin-L (87.3%) surpasses ViT-L/16 (85.2%) by +2.1% with far fewer FLOPs (103.9G vs. 190.7G) and fewer parameters (197M vs. 307M).^[1]

COCO Object Detection and Instance Segmentation

For object detection and instance segmentation on the COCO benchmark, Swin Transformer was evaluated as a backbone with several detection frameworks.

Cascade Mask R-CNN (COCO val2017):

Backbone	Params	FLOPs	FPS	Box AP	Mask AP
ResNet-50	82M	739G	18.0	46.3	40.1
Swin-T	86M	745G	15.3	50.5	43.7
ResNeXt101-64x4d	140M	972G	10.4	48.3	41.7
Swin-S	107M	838G	12.6	51.8	45.0
Swin-B	145M	982G	11.6	51.9	45.0

HTC++ with ImageNet-22K pre-training (COCO test-dev):

Backbone	Box AP	Mask AP
Swin-L (single-scale)	57.1	49.5
Swin-L (multi-scale)	58.7	51.1

Swin-T surpassed ResNet-50 by +4.2 box AP and +3.6 mask AP with similar model size and FLOPs. Swin-B achieved 51.9 box AP, a +3.6 gain over ResNeXt101-64x4d, which had considerably more parameters and FLOPs. The best configuration, Swin-L with HTC++ and multi-scale testing, achieved 58.7 box AP and 51.1 mask AP on COCO test-dev, surpassing the previous state-of-the-art by +2.7 box AP and +2.6 mask AP.^[1]

ADE20K Semantic Segmentation

For semantic segmentation on the ADE20K benchmark, Swin Transformer was used as the backbone for the UPerNet framework.

Backbone	Pre-training	Params	FLOPs	FPS	mIoU (ss)	mIoU (ms+flip)
DeiT-S	ImageNet-1K	52M	1099G	16.2	44.0	--
Swin-T	ImageNet-1K	60M	945G	18.5	44.5	45.8
Swin-S	ImageNet-1K	81M	1038G	15.2	47.6	49.5
Swin-B	ImageNet-22K	121M	1841G	8.7	49.7	51.6
Swin-L	ImageNet-22K	234M	3230G	6.2	52.1	53.5

Swin-L with multi-scale testing and horizontal flipping achieved 53.5 mIoU, surpassing the previous state-of-the-art SETR model (based on ViT-L) by +3.2 mIoU. Even Swin-T outperformed DeiT-S while using fewer FLOPs (945G vs. 1099G).^[1]

Ablation Studies

The authors conducted ablation experiments on Swin-T to quantify the contribution of each design choice:^[1]

Configuration	ImageNet Top-1	COCO Box AP	ADE20K mIoU
Regular windows only (no shift)	80.2%	47.7	43.3
Shifted windows (full model)	81.3%	50.5	46.1
Improvement from shift	+1.1%	+2.8	+2.8

The shifted window mechanism provided consistent improvements across all three tasks, with particularly large gains on the dense prediction tasks (COCO and ADE20K), where cross-window information flow is critical.^[1]

Replacing relative position bias with absolute position embedding reduced ImageNet-1K accuracy by 1.2%, confirming the importance of relative position encoding for the Swin architecture.^[1]

What is Swin Transformer V2?

Swin Transformer V2: Scaling Up Capacity and Resolution, published by Liu, Hu et al. at CVPR 2022, addressed three challenges that arise when scaling vision transformers to very large sizes: training instability, resolution gaps between pre-training and fine-tuning, and the need for vast labeled training data.^[2]

Residual Post-Normalization

The original Swin Transformer uses pre-normalization, applying layer normalization before each attention and MLP sub-layer. The authors found that activation amplitudes at deeper layers grow uncontrollably when scaling model capacity, leading to training instability and divergence. Swin V2 moves to post-normalization, where layer normalization is applied after each residual block. This stabilizes activation magnitudes across layers and allows training of much larger models.^[2]

Scaled Cosine Attention

Standard dot-product attention can produce extremely large logit values when model capacity increases, dominating the softmax computation and causing training instability. Swin V2 replaces dot-product attention with cosine attention, where attention logits are computed as the cosine similarity between query and key vectors, scaled by a learnable temperature parameter tau:

\mathrm{Attention}(Q, K, V) = \mathrm{SoftMax}\!\left(\frac{\cos(Q, K)}{\tau} + B\right) V

The cosine similarity naturally bounds the attention logits to the range [-1, 1] before scaling, preventing magnitude explosion.^[2]

Log-Spaced Continuous Position Bias (Log-CPB)

The original Swin Transformer's learnable relative position bias table is tied to the window size used during pre-training. When fine-tuning with larger windows (due to higher input resolution), position biases for previously unseen relative positions must be obtained through interpolation, which degrades performance.

Swin V2 replaces the discrete bias table with a small meta-network: a two-layer MLP that takes log-spaced relative coordinates as input and outputs the position bias value. The log transformation compresses the coordinate range, enabling smoother extrapolation from smaller to larger window sizes. Formally, for relative coordinates (delta_x, delta_y), the log-spaced transformation is:

\hat{\delta}_x = \text{sign}(\delta_x) \log(1 + \lvert \delta_x \rvert)

\hat{\delta}_y = \text{sign}(\delta_y) \log(1 + \lvert \delta_y \rvert)

This allows models pre-trained at low resolution to transfer effectively to high-resolution downstream tasks with minimal accuracy loss.^[2]

SimMIM Self-Supervised Pre-Training

To reduce dependence on large labeled datasets, Swin V2 adopts SimMIM (Simple Framework for Masked Image Modeling), a self-supervised learning method.^[11] Random patches of the input image are masked, and the model is trained to reconstruct the raw pixel values of the masked patches using a simple L1 loss.^[11] This approach requires no labeled data for pre-training and significantly reduces the data requirements compared to supervised pre-training.^[11]

Swin V2 Model Variants

Swin V2 retained the original four sizes and introduced two additional large-scale variants:

Variant	Embed Dim (C)	Layer Depths	Parameters	Resolution
SwinV2-T	96	{2, 2, 6, 2}	28M	256x256
SwinV2-S	96	{2, 2, 18, 2}	50M	256x256
SwinV2-B	128	{2, 2, 18, 2}	88M	256x256
SwinV2-L	192	{2, 2, 18, 2}	197M	256x256
SwinV2-H (Huge)	352	{2, 2, 18, 2}	~658M	512x512
SwinV2-G (Giant)	512	{2, 2, 42, 4}	~3B	512x512

The SwinV2-G model, with approximately 3 billion parameters, was the largest dense vision model at the time of publication. It was capable of processing images at up to 1,536x1,536 resolution after fine-tuning.^[2]

Swin V2 Benchmark Results

Swin V2 set new records on four major vision benchmarks:

Benchmark	Task	Best Model	Metric	Score
ImageNet-V2	Image Classification	SwinV2-G	Top-1 Accuracy	84.0%
COCO	Object Detection	SwinV2-G	Box AP / Mask AP	63.1 / 54.4
ADE20K	Semantic Segmentation	SwinV2-G	mIoU	59.9
Kinetics-400	Video Action Classification	SwinV2-G	Top-1 Accuracy	86.8%

Leveraging SimMIM self-supervised pre-training, the SwinV2-G model achieved these results while using "40 times less labelled data and 40 times less training time" compared to prior billion-parameter vision models.^[2] The COCO results (63.1 box AP) represented a +4.4 box AP improvement over the original Swin-L HTC++ result (58.7 box AP), and the ADE20K result (59.9 mIoU) was a +6.4 mIoU improvement over the original 53.5 mIoU.^[2]

How does the Swin Transformer compare to ViT, DeiT, and ConvNeXt?

Swin Transformer vs. Vision Transformer (ViT)

Aspect	ViT	Swin Transformer
Attention scope	Global (all tokens)	Local (within M x M windows)
Computational complexity	Quadratic $O(n^2)$	Linear $O(n)$
Feature scales	Single (1/16)	Multi-scale (1/4, 1/8, 1/16, 1/32)
Position encoding	Absolute	Relative bias
Dense prediction support	Requires modifications	Native (FPN-compatible)
ImageNet-22K (384x384)	84.0% (ViT-B), 85.2% (ViT-L)	86.4% (Swin-B), 87.3% (Swin-L)

ViT's strength lies in its simplicity and the fact that global attention captures long-range dependencies from the first layer. However, the Swin Transformer's linear complexity and hierarchical outputs make it more practical for high-resolution and dense prediction tasks.

Swin Transformer vs. DeiT

DeiT improved ViT's training efficiency but retained the same architecture.^[4] At comparable model sizes:

Swin-T (81.3%, 29M, 4.5G FLOPs) vs. DeiT-S (79.8%, 22M, 4.6G FLOPs): +1.5% accuracy
Swin-B (83.5%, 88M, 15.4G FLOPs) vs. DeiT-B (81.8%, 86M, 17.5G FLOPs): +1.7% accuracy with fewer FLOPs

The gains extended to downstream tasks: on ADE20K, Swin-T (44.5 mIoU) outperformed DeiT-S (44.0 mIoU) while using fewer FLOPs.^[1]

Swin Transformer vs. ConvNeXt

ConvNeXt, introduced by Zhuang Liu et al. at CVPR 2022, modernized the ResNet architecture with design choices borrowed from transformers.^[6] ConvNeXt demonstrated that a pure CNN could match or slightly exceed Swin Transformer's performance:^[6]

Model Pair	ImageNet-1K Top-1	Params	FLOPs
Swin-T / ConvNeXt-T	81.3% / 82.1%	29M / 29M	4.5G / 4.5G
Swin-S / ConvNeXt-S	83.0% / 83.1%	50M / 50M	8.7G / 8.7G
Swin-B / ConvNeXt-B	83.5% / 83.8%	88M / 89M	15.4G / 15.4G

With ImageNet-22K pre-training at 384x384, ConvNeXt-B reached 85.8% (vs. Swin-B at 86.4%) and ConvNeXt-XL achieved 87.8% (vs. Swin-L at 87.3%). ConvNeXt also showed higher inference throughput on GPUs.^[6]

The significance of ConvNeXt was not its marginal performance gains but its demonstration that many of Swin Transformer's advantages came from training recipes and hierarchical design rather than the attention mechanism itself.^[6] When CNNs adopted transformer-era training practices (stronger augmentation, layer normalization, larger kernel sizes), the performance gap narrowed substantially.

Summary Table

Architecture	Attention Type	Feature Scales	Complexity	Key Advantage
ViT	Global	Single	Quadratic	Full receptive field from layer 1
DeiT	Global	Single	Quadratic	Data-efficient ViT training
PVT	Spatial reduction	Multi-scale	Reduced quadratic	Hierarchical with attention
Swin	Shifted window	Multi-scale	Linear	Efficient, general-purpose backbone
ConvNeXt	Convolution	Multi-scale	Linear	CNN simplicity with modern design

What is the Swin Transformer used for?

The Swin Transformer's combination of hierarchical features, linear complexity, and strong empirical performance has made it one of the most widely adopted vision backbones across diverse applications.

Image Classification

As a classification backbone, Swin Transformer is used for transfer learning across domains including medical imaging, remote sensing, autonomous driving, agriculture, and industrial quality inspection. Its ability to fine-tune at higher resolutions than the pre-training resolution makes it valuable for tasks requiring fine-grained spatial detail.

Object Detection and Instance Segmentation

Swin Transformer has been integrated with all major detection frameworks: Mask R-CNN, Cascade Mask R-CNN, HTC++, and DETR-style detectors. Its multi-scale feature maps align with the expectations of FPN and similar multi-scale architectures. At the time of publication, it achieved the highest scores on the COCO benchmark.^[1] Subsequent detectors such as DINO and Co-DETR adopted Swin backbones for their best results.

Semantic and Panoptic Segmentation

For pixel-level prediction, Swin Transformer integrates with segmentation decoders such as UPerNet, DeepLab, and Mask2Former. The hierarchical features at 1/4, 1/8, 1/16, and 1/32 resolutions provide the multi-scale context needed for accurate segmentation. On ADE20K, Swin-based models held top positions on the leaderboard for an extended period.

Image Restoration

SwinIR (Liang et al., ICCV 2021 Workshop) adapted the Swin Transformer architecture for image restoration tasks including super-resolution, denoising, and JPEG artifact removal.^[7] SwinIR demonstrated that the shifted window attention mechanism was well-suited for low-level vision tasks, outperforming previous CNN-based methods with only 11.8 million parameters for lightweight super-resolution.^[7]

Video Understanding

Video Swin Transformer (Liu et al., CVPR 2022) extended the shifted window mechanism to 3D spatiotemporal volumes for video recognition.^[8] It computed local 3D attention within shifted spatiotemporal windows, achieving 84.9% top-1 accuracy on Kinetics-400 and 85.9% on Kinetics-600 with 20x less pre-training data and 3x smaller model size compared to competing methods at the time.^[8]

Medical Imaging

Swin UNETR (Hatamizadeh et al., 2022) combined a Swin Transformer encoder with a U-Net-style decoder for 3D medical image segmentation, achieving strong results on brain tumor segmentation from MRI scans.^[9] The hierarchical structure proved effective for capturing both local anatomical details and global structural context in volumetric medical data. Other medical applications include retinal image analysis, histopathology slide classification, and organ segmentation in CT scans.

Remote Sensing

Swin Transformer backbones have been widely adopted for satellite and aerial image analysis, including land cover classification, change detection, building footprint extraction, and scene understanding. The architecture's ability to capture multi-scale spatial relationships is particularly beneficial for remote sensing imagery, which often contains objects at vastly different scales.

3D Vision

Swin3D extended the shifted window mechanism to point cloud processing for 3D object detection and scene segmentation. By partitioning 3D space into voxel-based windows and applying shifted attention in three dimensions, it brought the efficiency benefits of Swin Transformer to 3D understanding tasks.

Why did the Swin Transformer become a dominant vision backbone?

Several factors contributed to the Swin Transformer's rapid and widespread adoption in computer vision.

Hierarchical multi-scale features. The four-stage design with patch merging produces feature maps at resolutions that match the expectations of decades of dense prediction research. Researchers could adopt Swin Transformer as a drop-in replacement for CNN backbones like ResNet in existing detection and segmentation frameworks (FPN, U-Net, UPerNet) without modifying task-specific heads. This compatibility with the existing ecosystem was critical for rapid adoption.

Linear computational complexity. Windowed attention scales linearly with image resolution, making the Swin Transformer practical for high-resolution inputs.^[1] Object detection typically uses 800x1333 input images, and segmentation may use 512x512 or larger crops. Global attention at these resolutions would be prohibitively expensive, but Swin Transformer handles them efficiently.

Broad empirical superiority. Setting new state-of-the-art results simultaneously on ImageNet classification, COCO detection, and ADE20K segmentation demonstrated that Swin Transformer was not a niche solution for one task but a genuinely general-purpose backbone.^[1] This breadth of strong results convinced the community of its practical value.

Simple, well-documented implementation. The cyclic shift and masking strategy for shifted window attention is implementable using standard tensor operations without custom CUDA kernels. Microsoft released the full PyTorch implementation, pre-trained weights for all variants on ImageNet-1K and ImageNet-22K, and comprehensive documentation.^[12]

Framework integration. The model was quickly integrated into major computer vision libraries including MMDetection, MMSegmentation, Detectron2, and the Hugging Face Transformers library. This widespread framework support lowered the barrier to adoption for both researchers and practitioners.

Timing. The Swin Transformer arrived at a moment when the community was actively searching for a transformer-based alternative to CNN backbones that could handle dense prediction tasks. It filled this gap more completely than any prior work, capturing attention at a pivotal time in the transition from CNN-dominated to transformer-based vision architectures.

What are the limitations of the Swin Transformer?

Despite its influence, the Swin Transformer has several recognized limitations.

Fixed window size. The window size M is a fixed hyperparameter (typically 7). This limits the attention range within each layer. While shifted windows enable cross-window information flow over successive layers, the effective receptive field grows slowly compared to architectures with global or deformable attention. Swin V2's Log-CPB partially addresses the transfer problem across window sizes but does not eliminate the fundamental constraint.^[2]

Implementation complexity. The alternating regular and shifted window partitions, cyclic shifting, and attention masking add engineering overhead compared to the simplicity of standard ViT (global attention) or CNNs (convolution). This complexity can complicate debugging, profiling, and integration with new frameworks.

Throughput vs. pure CNNs. On modern GPU hardware, convolution operations are more heavily optimized than the operations required for Swin attention (window partitioning, cyclic shifting, masking, attention computation, unpartitioning). ConvNeXt demonstrated that carefully designed CNNs achieve higher inference throughput than Swin Transformer at comparable accuracy levels.^[6]

Pre-training data sensitivity. The largest performance gains from Swin Transformer (especially Swin-L) require ImageNet-22K or larger-scale pre-training. When restricted to ImageNet-1K training, the advantage of Swin over modernized CNNs narrows.

Quadratic complexity within windows. While global complexity is linear in image size, the complexity within each window is $O(M^4)$ per window. Increasing the window size for a larger receptive field rapidly increases the per-window computation. This constrains the practical range of window sizes.

Influence and Legacy

The Swin Transformer's impact on computer vision has been substantial and enduring.

Derived architectures. Numerous subsequent architectures built directly on the shifted window concept. CSWin Transformer introduced cross-shaped window attention. Focal Transformer combined local and global attention within the same framework. MaxViT used block and grid attention patterns inspired by Swin's windowed approach. Twins explored alternating local and global attention.

Hybrid CNN-transformer designs. The competition between Swin Transformer and ConvNeXt catalyzed research into hybrid architectures that combine convolutional and attention-based layers, including CoAtNet, EfficientFormerV2, and FastViT.

Foundation models. Swin Transformer served as the vision backbone in several large-scale multimodal and foundation models, including applications in visual grounding, visual question answering, and image-text matching.

Benchmark impact. At publication, Swin Transformer topped the leaderboards on COCO, ADE20K, and ImageNet.^[1] Its results served as the baseline that subsequent architectures had to surpass, effectively resetting the bar for vision backbone performance.

Citation impact. With over 14,800 citations by early 2026, the Swin Transformer paper is among the most-cited computer vision papers of the decade. Its Marr Prize at ICCV 2021 recognized its role in establishing a new paradigm for vision backbone design.^[1]

The core principles introduced by Swin Transformer, that vision transformers should produce hierarchical multi-scale features and use efficient local attention mechanisms, have become standard assumptions in the field. Even architectures that do not use shifted windows specifically (such as those based on deformable attention or neighborhood attention) operate within the design framework that Swin Transformer helped establish.

References

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*, pp. 10012-10022. arXiv:2103.14030. https://arxiv.org/abs/2103.14030 ↩
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., & Guo, B. (2022). "Swin Transformer V2: Scaling Up Capacity and Resolution." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 12009-12019. arXiv:2111.09883. https://arxiv.org/abs/2111.09883 ↩
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *International Conference on Learning Representations (ICLR)*. arXiv:2010.11929. ↩
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jegou, H. (2021). "Training data-efficient image transformers & distillation through attention." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2012.12877. ↩
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*. arXiv:2102.12122. ↩
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). "A ConvNet for the 2020s." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2201.03545. ↩
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., & Timofte, R. (2021). "SwinIR: Image Restoration Using Swin Transformer." *Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)*. arXiv:2108.10257. ↩
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). "Video Swin Transformer." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2106.13230. ↩
Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H., & Xu, D. (2022). "Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images." *International MICCAI Brainlesion Workshop*. arXiv:2201.01266. ↩
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 770-778. ↩
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). "SimMIM: A Simple Framework for Masked Image Modeling." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2111.09886. ↩
Microsoft Swin Transformer GitHub Repository. https://github.com/microsoft/Swin-Transformer ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

9 revisions by 1 contributors · full history

Suggest edit