U-Net

Computer Vision Deep Learning Machine Learning

21 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v4 · 4,180 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

U-Net is a convolutional neural network architecture designed for biomedical image segmentation. It was introduced by Olaf Ronneberger, Philipp Fischer, and Thomas Brox at the University of Freiburg, Germany, and presented at the MICCAI 2015 conference in a paper titled "U-Net: Convolutional Networks for Biomedical Image Segmentation" ^[1]. The architecture features a distinctive U-shaped design consisting of a contracting encoder path that captures context and a symmetric expanding decoder path that enables precise spatial localization. In the authors' words, "the architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization" ^[1]. Its key innovation is the use of skip connections that concatenate feature maps from the encoder directly to the corresponding decoder layers, allowing the network to combine high-level semantic information with fine-grained spatial details.

U-Net was originally designed to work with very small training datasets, a common constraint in medical imaging where annotated data is expensive to obtain. The paper explicitly motivated this design, stating that it relies "on the strong use of data augmentation to use the available annotated samples more efficiently" ^[1]. Despite this modest beginning, U-Net's elegant design has made it one of the most widely adopted architectures in deep learning. It won the ISBI cell tracking challenge in 2015 by a large margin ^[1], became the standard approach for medical image segmentation across dozens of imaging modalities, and found an unexpected second life as the denoising backbone in diffusion models like Stable Diffusion. The original architecture has roughly 31 million parameters, and the paper reported that "segmentation of a 512x512 image takes less than a second on a recent GPU" ^[1]. As of 2025, the original paper has been cited over 87,000 times on Google Scholar, making it one of the most cited papers in the history of medical imaging ^[2].

Background and Motivation

Before U-Net, applying deep learning to biomedical image segmentation faced two major challenges.

First, biomedical datasets are typically small. Annotating medical images requires domain expertise (often trained physicians or biologists), making large-scale annotation impractical. While ImageNet provided millions of labeled images for classification, a typical biomedical segmentation task might have only 30 to 50 annotated images.

Second, segmentation requires pixel-level predictions, not just image-level labels. The network must output a class label for every pixel in the input image, preserving spatial resolution while also understanding the broader context of the image. Standard classification networks like AlexNet and VGGNet progressively reduced spatial resolution through pooling operations, discarding the fine spatial information needed for precise segmentation.

The state of the art before U-Net was a sliding window approach by Ciresan et al. (2012), which classified each pixel by feeding it a local patch around that pixel through a network ^[3]. This approach had two major drawbacks: it was extremely slow (each pixel required a separate forward pass), and it could not capture large-scale context because the patch size was limited. The U-Net paper noted that it "outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks" ^[1].

Long et al. (2015) introduced Fully Convolutional Networks (FCN), which replaced fully connected layers with convolutional layers and used upsampling to produce dense predictions ^[4]. FCN was a significant step forward, but it still struggled with fine spatial details because information lost during downsampling was not fully recovered during upsampling. Ronneberger et al. built directly on the FCN concept, adding the skip connections and symmetric architecture that define U-Net.

How is the U-Net architecture structured?

U-Net's architecture has a symmetric structure that resembles the letter U when drawn as a diagram. It consists of two main paths: a contracting path (encoder) on the left side and an expansive path (decoder) on the right side, connected by skip connections at each resolution level.

Contracting Path (Encoder)

The contracting path follows the typical architecture of a convolutional network. It consists of four blocks, each containing:

Two 3x3 convolutions, each followed by a ReLU activation function
A 2x2 max pooling operation with stride 2 for downsampling

At each downsampling step, the number of feature channels is doubled. The input image (originally 572x572 pixels in the paper's implementation) is progressively reduced in spatial resolution: 572 to 568 to 284 to 280 to 140, and so on. The spatial reduction occurs both from the unpadded ("valid") convolutions (which lose one pixel on each border per convolution) and from the max pooling operations.

The contracting path serves to capture increasingly abstract and contextual features. Early layers detect edges and textures, while deeper layers recognize complex structures and object-level patterns.

Bottleneck

At the bottom of the U, a bottleneck layer consists of two 3x3 convolutions with ReLU. This layer processes the most spatially compressed representation, operating at 1/16 of the original spatial resolution with the highest number of feature channels (1024 in the original architecture). The bottleneck captures the broadest context, encoding high-level information about the entire input.

Expansive Path (Decoder)

The expansive path mirrors the contracting path. Each block contains:

A 2x2 up-convolution (transposed convolution) that halves the number of feature channels and doubles the spatial resolution
Concatenation with the correspondingly cropped feature map from the contracting path (the skip connection)
Two 3x3 convolutions, each followed by ReLU

The concatenation step is the defining feature of U-Net. By combining upsampled features (which carry semantic information from deeper layers) with high-resolution features from the encoder (which preserve spatial detail), the network can produce precise segmentation masks that respect fine boundaries.

The cropping is necessary because the unpadded convolutions in the encoder reduce the spatial dimensions at each level. The encoder feature maps are larger than the corresponding decoder maps, so they are center-cropped before concatenation.

Final Layer

A 1x1 convolution maps the final 64-channel feature map to the desired number of output classes. For a binary segmentation task (foreground vs. background), the output has two channels. In total, the original network has 23 convolutional layers and roughly 31 million parameters ^[1].

Architecture Summary Table

Path	Level	Operation	Input Channels	Output Channels	Spatial Size (approx.)
Encoder	1	2x Conv 3x3 + ReLU, MaxPool 2x2	1	64	572 -> 568 -> 284
Encoder	2	2x Conv 3x3 + ReLU, MaxPool 2x2	64	128	284 -> 280 -> 140
Encoder	3	2x Conv 3x3 + ReLU, MaxPool 2x2	128	256	140 -> 136 -> 68
Encoder	4	2x Conv 3x3 + ReLU, MaxPool 2x2	256	512	68 -> 64 -> 32
Bottleneck	5	2x Conv 3x3 + ReLU	512	1024	32 -> 28
Decoder	4	Up-conv 2x2, Concat, 2x Conv 3x3 + ReLU	1024	512	28 -> 56 -> 52
Decoder	3	Up-conv 2x2, Concat, 2x Conv 3x3 + ReLU	512	256	52 -> 104 -> 100
Decoder	2	Up-conv 2x2, Concat, 2x Conv 3x3 + ReLU	256	128	100 -> 200 -> 196
Decoder	1	Up-conv 2x2, Concat, 2x Conv 3x3 + ReLU	128	64	196 -> 392 -> 388
Output	-	Conv 1x1	64	num_classes	388

Note: Modern implementations commonly use zero-padded convolutions to preserve spatial dimensions, simplifying the architecture and eliminating the need for cropping.

Overlap-Tile Strategy

For segmenting large images that do not fit into GPU memory, U-Net employs an overlap-tile strategy. The image is divided into overlapping tiles, and the network predicts the segmentation only for the central region of each tile. The overlap ensures that every pixel in the output has sufficient context from surrounding pixels. Pixels near the image borders are handled by mirroring the image content. This strategy allows U-Net to segment images of arbitrary size, a practical necessity in medical imaging where whole-slide pathology images can be tens of thousands of pixels in each dimension ^[1].

How is U-Net trained on small datasets?

U-Net's training approach was specifically designed to handle the challenges of small biomedical datasets.

Data Augmentation

The authors employed aggressive data augmentation, applying elastic deformations, rotations, shifts, and flips to training images. Elastic deformations were particularly important for biomedical applications because they simulate the natural variability in biological tissue. Random displacement vectors are generated on a coarse grid and then smoothed with a Gaussian filter to create smooth deformation fields, which are applied to both the input image and its segmentation mask. The authors reported that data augmentation was the most critical factor for achieving good performance with limited training data ^[1].

Weighted Loss Function

The paper introduced a weighted cross-entropy loss function that assigns higher weights to pixels near the boundaries between touching objects. This is important in cell segmentation, where individual cells must be separated even when they are in direct contact. The weight map is pre-computed for each training image based on the distance of each pixel to the nearest cell borders. A pixel that lies exactly between two cells receives a very high weight, encouraging the network to learn to create thin separation boundaries ^[1].

The weight function is defined as:

w(x) = w_c(x) + w_0 * exp(-(d1(x) + d2(x))^2 / (2 * sigma^2))

where d1(x) and d2(x) are the distances to the border of the nearest and second-nearest cell, w_0 = 10, and sigma is approximately 5 pixels.

Training Details

The network was trained using stochastic gradient descent with a high momentum of 0.99. The authors used a large batch size relative to the small dataset, applying heavy data augmentation to each batch. The use of unpadded convolutions meant that the output segmentation map was smaller than the input, which the overlap-tile strategy accounted for.

Competition Results

U-Net demonstrated its effectiveness on several competitive benchmarks.

Challenge	Dataset	Metric	U-Net Score	2nd Best Score
ISBI 2012	Neuronal Structures (EM)	Warping Error	0.000353	0.000420
ISBI 2015	PhC-U373 (Cell Tracking)	Mean IoU	92.0%	83.0%
ISBI 2015	DIC-HeLa (Cell Tracking)	Mean IoU	77.5%	46.0%

The results on the ISBI 2015 cell tracking challenge were particularly striking. On the PhC-U373 dataset, U-Net achieved an average intersection over union (IoU) of 92%, compared to 83% for the second-best method. On the more challenging DIC-HeLa dataset, U-Net's advantage was even more dramatic: 77.5% vs. 46.0% ^[1]. The authors noted that with the same network trained on transmitted light microscopy images, "we won the ISBI cell tracking challenge 2015 in these categories by a large margin" ^[1]. These results established U-Net as the clear state of the art for biomedical image segmentation.

Why do skip connections matter in U-Net?

The skip connections in U-Net serve a fundamentally different purpose than the residual (additive) connections in ResNet. In ResNet, skip connections add the input to the output of a block, preserving gradient flow and easing optimization. In U-Net, skip connections concatenate feature maps from the encoder to the decoder, preserving spatial information that would otherwise be lost during downsampling.

This distinction is important. The encoder's feature maps at a given resolution level contain detailed spatial information (edges, textures, fine boundaries) that the max pooling operations discard. By concatenating these features with the upsampled decoder features, U-Net gives the decoder access to both high-level semantic information (from the decoder's upsampled features) and low-level spatial detail (from the encoder's skip connections). The result is segmentation masks with sharp, accurate boundaries.

Without skip connections, the decoder would have to reconstruct fine spatial details solely from the heavily compressed bottleneck representation, a task that proves very difficult in practice. Ablation studies consistently show that removing skip connections significantly degrades segmentation quality, particularly along object boundaries.

What is U-Net used for in medical imaging?

U-Net's impact on medical image segmentation has been enormous. Within a few years of its publication, U-Net and its variants became the dominant approach across virtually every medical imaging modality and anatomical target.

Application Areas

Modality	Application	Example
CT	Organ segmentation	Liver, kidney, lung lobes
MRI	Brain tumor segmentation	Glioma in BraTS challenge
X-ray	Chest pathology detection	COVID-19 lung lesions
Pathology	Cell and tissue segmentation	Cancer grading from H&E slides
Ultrasound	Fetal measurement	Biometric measurements
Fundoscopy	Retinal vessel segmentation	Diabetic retinopathy screening
Microscopy	Cell counting and tracking	Fluorescence microscopy
Dermoscopy	Skin lesion segmentation	Melanoma detection

The architecture's success in medical imaging stems from several properties that align with the domain's constraints. It works well with small training sets. It produces sharp segmentation boundaries. It is relatively simple to implement and train. And its encoder-decoder structure naturally handles the multi-scale nature of anatomical structures.

Why is U-Net used in diffusion models?

Perhaps the most surprising chapter in U-Net's story is its adoption as the core component of diffusion models for image generation.

Diffusion Models Background

Diffusion models generate images by learning to reverse a gradual noising process. During training, noise is progressively added to an image over many timesteps until it becomes pure Gaussian noise. The model learns to predict and remove the noise at each step, gradually recovering a clean image from random noise. The network that performs this denoising is called the "denoising backbone," and it must accept a noisy image (plus a timestep embedding indicating the noise level) and output a prediction of the noise or the denoised image.

Why U-Net for Diffusion

U-Net turned out to be an ideal architecture for this denoising task for several reasons. The encoder-decoder structure allows the model to process the image at multiple scales, capturing both global composition and local details. The skip connections preserve the fine spatial information needed to produce sharp, detailed images. And the architecture naturally accepts and produces images of the same spatial resolution, which is exactly what a denoising network needs.

Ho et al. (2020) used a modified U-Net as the denoising backbone in their landmark Denoising Diffusion Probabilistic Models (DDPM) paper, which established the modern framework for diffusion-based image generation ^[5]. Their backbone was built on a PixelCNN++-style U-Net with Wide ResNet blocks, and the modifications included adding self-attention layers at certain resolution levels and incorporating timestep embeddings through adaptive group normalization.

Stable Diffusion

Stable Diffusion, developed by Stability AI and based on the Latent Diffusion Model (LDM) framework from Rombach et al. at LMU Munich (2021), uses a U-Net as its denoising backbone ^[6]. In the LDM architecture, a variational autoencoder (VAE) first compresses images from pixel space into a lower-dimensional latent space. The U-Net then performs the diffusion process in this latent space rather than in pixel space, dramatically reducing computational cost.

The U-Net in Stable Diffusion is substantially larger and more complex than the original biomedical U-Net. It incorporates self-attention and cross-attention layers (for text conditioning), residual connections, and group normalization. The cross-attention layers allow the U-Net to be conditioned on text embeddings from a CLIP text encoder, enabling text-to-image generation. The Stable Diffusion 1.5 U-Net has roughly 860 million parameters, compared to the original's 31 million ^[6]. Despite these modifications, the fundamental U-shaped encoder-decoder structure with skip connections remains the same.

Stable Diffusion 1.x and 2.x both used the U-Net backbone. Stable Diffusion 3 (released in 2024) transitioned to a Diffusion Transformer (DiT) architecture, replacing U-Net with a transformer-based design. Similarly, DALL-E 3 uses a transformer backbone. However, U-Net-based diffusion models remain widely deployed and actively developed as of 2026.

Diffusion U-Net vs. Medical U-Net

Feature	Original U-Net (2015)	Diffusion U-Net (e.g., Stable Diffusion)
Purpose	Biomedical image segmentation	Image denoising for generation
Input	Microscopy image	Noisy latent + timestep + text embedding
Output	Segmentation mask (per-pixel class)	Predicted noise or denoised latent
Attention	None	Self-attention and cross-attention
Normalization	None (or batch norm in variants)	Group normalization
Conditioning	None	Timestep embedding, text embedding
Skip connections	Concatenation	Concatenation (same principle)
Parameters	~31M	~860M (SD 1.5)
Training data	30-50 annotated images	Billions of image-text pairs

What are the main U-Net variants?

U-Net's modular design has inspired a large family of variants, each targeting specific limitations or application domains.

3D U-Net

Cicek et al. (2016) extended U-Net to three dimensions for volumetric segmentation of medical images such as CT and MRI scans ^[7]. The 3D U-Net replaces all 2D operations (convolutions, pooling, upsampling) with their 3D counterparts, allowing the network to capture spatial context in all three dimensions. This is important for organs and structures that have complex 3D shapes. 3D U-Net can also be trained in a semi-supervised fashion, learning from sparsely annotated volumes where only a few slices have ground truth labels.

V-Net

Milletari et al. (2016) proposed V-Net, a 3D variant that introduced residual connections within each encoder and decoder block and replaced the max pooling operations with convolutional downsampling ^[8]. V-Net also introduced the Dice loss function for training, which directly optimizes the overlap between the predicted and ground truth segmentation, addressing the class imbalance problem that is common in medical segmentation (where the foreground object often occupies a small fraction of the image).

Attention U-Net

Oktay et al. (2018) introduced attention gates into the U-Net skip connections ^[9]. Instead of naively concatenating encoder features with decoder features, attention gates learn to selectively emphasize informative features and suppress irrelevant ones. The attention mechanism uses the decoder features as a gating signal to filter the encoder features before concatenation. This is particularly useful when the target structure is small relative to the image (e.g., a pancreas in an abdominal CT scan), as it helps the network focus on the relevant region.

nnU-Net

Isensee et al. (2021) developed nnU-Net ("no new net"), a self-configuring framework that automatically adapts the U-Net architecture and training pipeline to any given segmentation dataset ^[10]. Rather than proposing a new architecture, nnU-Net systematically optimizes preprocessing, data augmentation, network topology, training schedule, and post-processing based on dataset-specific properties such as image size, spacing, and class distribution. nnU-Net has won or placed competitively in numerous medical image segmentation challenges and is widely considered the strongest out-of-the-box method for medical image segmentation as of 2025. Its success demonstrated that careful engineering of the training pipeline often matters more than architectural novelty.

Swin-UNet

Cao et al. (2021) proposed Swin-UNet, which replaces the convolutional blocks in U-Net with Swin Transformer blocks ^[11]. The architecture maintains the U-shaped encoder-decoder structure and skip connections but uses shifted window self-attention instead of convolutions for feature extraction. Swin-UNet demonstrated competitive performance on medical image segmentation benchmarks, showing that the U-Net design principle (symmetric encoder-decoder with skip connections) is effective regardless of whether the building blocks are convolutional or attention-based.

TransUNet

Chen et al. (2021) proposed TransUNet, a hybrid architecture that uses a transformer encoder combined with a CNN-based U-Net decoder ^[12]. The transformer encoder captures global context through self-attention, while the CNN decoder recovers fine spatial details. This hybrid design aims to combine the global modeling capability of transformers with the localization precision of the U-Net decoder.

UNet++ and UNet 3+

Zhou et al. (2018) proposed UNet++, which adds nested dense skip connections between the encoder and decoder ^[13]. Instead of a single skip connection at each level, UNet++ includes a series of intermediate dense blocks that progressively fuse features from different levels. This design reduces the semantic gap between encoder and decoder features before concatenation. UNet 3+ (Huang et al., 2020) further extended this idea by incorporating full-scale skip connections that aggregate features from all encoder and decoder levels.

Variant Comparison

Variant	Year	Key Innovation	Primary Application
U-Net	2015	Encoder-decoder with concatenation skip connections	2D biomedical segmentation
3D U-Net	2016	3D convolutions for volumetric data	CT/MRI volume segmentation
V-Net	2016	3D with residual connections and Dice loss	Volumetric segmentation
Attention U-Net	2018	Attention gates on skip connections	Small structure segmentation
UNet++	2018	Nested dense skip connections	Multi-scale segmentation
nnU-Net	2021	Self-configuring pipeline; no new architecture	Any medical segmentation task
Swin-UNet	2021	Swin Transformer blocks replace convolutions	Medical segmentation with attention
TransUNet	2021	Hybrid transformer encoder + CNN decoder	Organ segmentation

Design Principles and Lessons

Several design principles from U-Net have proven broadly applicable beyond its original domain.

Encoder-decoder symmetry. The idea of a symmetric encoder and decoder, where the decoder mirrors the encoder's structure, has become a standard pattern for dense prediction tasks. This symmetry ensures that the decoder has the capacity to reconstruct spatial information at each resolution level.

Multi-scale feature fusion. Combining features from different scales through skip connections is now recognized as essential for tasks that require both semantic understanding and spatial precision. This principle appears in architectures like Feature Pyramid Networks (FPN) for object detection and in many semantic segmentation models.

Working with small datasets. U-Net showed that heavy data augmentation, appropriate architecture design, and carefully weighted loss functions can enable strong performance even with extremely limited training data. This lesson is especially relevant in specialized domains where annotation is expensive.

Architecture generality. The U-Net design has proven remarkably versatile. Its basic structure has been successfully applied to tasks far removed from its original biomedical context: image generation (diffusion models), image-to-image translation, super-resolution, denoising, inpainting, depth estimation, and even audio processing. This versatility suggests that the encoder-decoder structure with skip connections captures a fundamental pattern for learning spatial transformations.

Is U-Net still relevant in 2025-2026?

U-Net remains one of the most practically important architectures in deep learning, though its role is evolving.

In medical imaging, U-Net and its variants (especially nnU-Net) continue to dominate segmentation benchmarks and clinical applications. The architecture's simplicity, reliability, and extensive validation make it the go-to choice for practitioners. New variants incorporating transformer components continue to be published regularly.

In generative AI, U-Net's role is shifting. While it served as the denoising backbone in the first wave of successful diffusion models (Stable Diffusion 1.x and 2.x, Imagen, DALL-E 2), newer systems are transitioning to transformer-based architectures (DiT) that scale more favorably with increased parameters and data. However, billions of U-Net-based diffusion model inferences still run daily across applications like image generation, video synthesis, and creative tools.

The broader principle that U-Net embodies, combining hierarchical feature extraction with skip connections for spatial precision, remains deeply embedded in the design vocabulary of modern deep learning. Whether implemented with convolutions, attention mechanisms, or hybrid approaches, the U-shaped encoder-decoder pattern continues to be one of the most successful architectural templates in the field.

References

Ronneberger, O., Fischer, P., Brox, T. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI 2015. https://arxiv.org/abs/1505.04597 ↩
Ronneberger, O. et al. Google Scholar citation data. https://scholar.google.com/citations?view_op=view_citation&citation_for_view=7jrO1NwAAAAJ:u5HHmVD_uO8C ↩
Ciresan, D., Giusti, A., Gambardella, L.M., Schmidhuber, J. "Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images." NeurIPS 2012. https://papers.nips.cc/paper/4741-deep-neural-networks-segment-neuronal-membranes-in-electron-microscopy-images ↩
Long, J., Shelhamer, E., Darrell, T. "Fully Convolutional Networks for Semantic Segmentation." CVPR 2015. https://arxiv.org/abs/1411.4038 ↩
Ho, J., Jain, A., Abbeel, P. "Denoising Diffusion Probabilistic Models." NeurIPS 2020. https://arxiv.org/abs/2006.11239 ↩
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. https://arxiv.org/abs/2112.10752 ↩
Cicek, O., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O. "3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation." MICCAI 2016. https://arxiv.org/abs/1606.06650 ↩
Milletari, F., Navab, N., Ahmadi, S. "V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation." 3DV 2016. https://arxiv.org/abs/1606.04797 ↩
Oktay, O., Schlemper, J., Folgoc, L.L., et al. "Attention U-Net: Learning Where to Look for the Pancreas." MIDL 2018. https://arxiv.org/abs/1804.03999 ↩
Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H. "nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation." Nature Methods, 2021. https://www.nature.com/articles/s41592-020-01008-z ↩
Cao, H., Wang, Y., Chen, J., et al. "Swin-UNet: Unet-like Pure Transformer for Medical Image Segmentation." ECCV 2022 Workshops. https://arxiv.org/abs/2105.05537 ↩
Chen, J., Lu, Y., Yu, Q., et al. "TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation." 2021. https://arxiv.org/abs/2102.04306 ↩
Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J. "UNet++: A Nested U-Net Architecture for Medical Image Segmentation." DLMIA 2018. https://arxiv.org/abs/1807.10165 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

U-Net

Background and Motivation

How is the U-Net architecture structured?

Contracting Path (Encoder)

Bottleneck

Expansive Path (Decoder)

Final Layer

Architecture Summary Table

Overlap-Tile Strategy

How is U-Net trained on small datasets?

Data Augmentation

Weighted Loss Function

Training Details

Competition Results

Why do skip connections matter in U-Net?

What is U-Net used for in medical imaging?

Application Areas

Why is U-Net used in diffusion models?

Diffusion Models Background

Why U-Net for Diffusion

Stable Diffusion

Diffusion U-Net vs. Medical U-Net

What are the main U-Net variants?

3D U-Net

V-Net

Attention U-Net

nnU-Net

Swin-UNet

TransUNet

UNet++ and UNet 3+

Variant Comparison

Design Principles and Lessons

Is U-Net still relevant in 2025-2026?

See Also

References

Improve this article

What links here (24 of 40)

What links here (24 of 40)

Background and Motivation

How is the U-Net architecture structured?

Contracting Path (Encoder)

Bottleneck

Expansive Path (Decoder)

Final Layer

Architecture Summary Table

Overlap-Tile Strategy

How is U-Net trained on small datasets?

Data Augmentation

Weighted Loss Function

Training Details

Competition Results

Why do skip connections matter in U-Net?

What is U-Net used for in medical imaging?

Application Areas

Why is U-Net used in diffusion models?

Diffusion Models Background

Why U-Net for Diffusion

Stable Diffusion

Diffusion U-Net vs. Medical U-Net

What are the main U-Net variants?

3D U-Net

V-Net

Attention U-Net

nnU-Net

Swin-UNet

TransUNet

UNet++ and UNet 3+

Variant Comparison

Design Principles and Lessons

Is U-Net still relevant in 2025-2026?

See Also

References

Improve this article

Related Articles

Diffusion model

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

Image Recognition

What links here (24 of 40)

Related Articles

Diffusion model

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

Image Recognition

What links here (24 of 40)