ControlNet
Last reviewed
Apr 30, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 3,596 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 3,596 words
Add missing citations, update stale details, or suggest a clearer explanation.
ControlNet is a neural network architecture that adds spatial and structural control to large pretrained text-to-image diffusion models. It was introduced in February 2023 by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala at Stanford University in the paper "Adding Conditional Control to Text-to-Image Diffusion Models" [1]. The architecture allows users to constrain image generation with auxiliary inputs such as Canny edge maps, depth maps, human pose skeletons, semantic segmentation masks, or scribbles, in addition to the usual text prompt. ControlNet is most commonly paired with Stable Diffusion, which provided the first widely available open-weights text-to-image model.
ControlNet works by freezing a copy of the pretrained diffusion U-Net and training a parallel trainable copy of its encoder, connected back to the frozen network through "zero convolutions," 1x1 convolution layers initialized to zero weights and biases. The zero initialization keeps the auxiliary branch from disrupting the base model at the start of training, which lets ControlNet be fine-tuned reliably even on relatively small datasets. The paper received the Marr Prize / Best Paper Award at the International Conference on Computer Vision (ICCV) 2023 [2][3].
Diffusion models, originally proposed by Sohl-Dickstein and colleagues in 2015 and refined into Denoising Diffusion Probabilistic Models by Ho et al. in 2020, learn to generate samples by reversing a gradual noising process applied to training data [4]. By 2022 they had become the dominant approach to text-to-image generation, powering systems including DALL-E 2, Google's Imagen, and Stable Diffusion. Stable Diffusion, released in August 2022 by Rombach and colleagues at the CompVis group with collaborators at Runway and Stability AI, used a latent diffusion formulation that ran the denoising process in a compressed VAE latent space rather than at full pixel resolution. Crucially, its model weights were released openly, which let researchers and hobbyists experiment with fine-tuning, custom samplers, and downstream tools.
Text prompts alone gave impressive results, but they were a blunt instrument when an artist or designer needed precise spatial control. A prompt could request "a cat sitting on a striped rug," yet the cat's pose, the rug's perspective, and the camera angle remained at the model's discretion. Earlier customization techniques addressed identity and style rather than layout. Textual inversion, proposed by Gal et al. in 2022, learned new word embeddings from a handful of reference images. DreamBooth, from Ruiz et al. at Google, fine-tuned the diffusion weights to bind a unique token to a specific subject. LoRA (Low-Rank Adaptation) injected small trainable matrices into the cross-attention layers to teach a style or character with minimal parameters. None of these methods provided pixel-level spatial conditioning, so they could not, for example, force a generated person to match a specific pose or a generated room to match a specific architectural plan.
Classifier-free guidance and image-to-image initialization gave partial control. The img2img mode in Stable Diffusion accepts a reference image, encodes it to latents, adds noise, and then denoises with a prompt; this preserves rough composition but blurs and distorts geometry, especially at strong denoising. ControlNet was designed to fill this gap by accepting structured visual inputs alongside the prompt and respecting their geometry throughout the diffusion process [1].
The central design idea of ControlNet is to add a side branch to the denoising U-Net while leaving the original network's weights untouched. Given a pretrained diffusion U-Net with encoder, middle, and decoder blocks, ControlNet creates a trainable copy of the encoder and the middle block. The copy is initialized from the frozen weights, so it starts with the same representational capacity as the base model rather than from random noise [1].
The trainable copy receives the conditioning image (for example, a Canny edge map) as input. The conditioning image is first projected to the same spatial resolution as the model's noisy latent through a small four-layer convolutional preprocessor that ends in a stride-1 layer. The encoder copy then processes the combined signal. After each block, the output is passed through a zero convolution, a 1x1 convolution layer whose weights and biases are both initialized to zero. The zero-convolution outputs are added into the corresponding skip connections feeding the frozen U-Net's decoder.
Because the zero convolutions output exactly zero at initialization, the entire ControlNet branch contributes nothing on the very first training step. The frozen U-Net therefore produces the same output it would have produced without conditioning, and any gradient that flows back through ControlNet is well behaved rather than dominated by random noise. The authors argue that this property is what allows the architecture to be trained on small custom datasets without corrupting the base model. Mathematically, if the frozen feature is y = F(x) and the ControlNet branch produces y_c = Z(F'(x + Z(c))), then at step zero Z(.) = 0 and y_c = 0, so the combined output y + y_c equals y exactly [1].
A characteristic behavior reported in the paper is the sudden convergence phenomenon: during training the loss does not improve gradually as the model learns to follow the conditioning. Instead, the network appears to ignore the control input for several thousand steps, then abruptly aligns its outputs to the condition, often around step 6,133 in the authors' Canny edge experiments. The phenomenon is generally attributed to the zero convolutions, which require many gradient updates before they grow large enough to influence the decoder.
During training, the authors randomly replace 50% of text prompts with empty strings. This forces the ControlNet to rely on the visual condition alone for guidance and prevents the text branch from carrying too much of the burden, which improves the strength of spatial alignment.
The original ControlNet 1.0 release in February 2023 shipped eight conditioning modalities for Stable Diffusion 1.5, with additional modalities added in ControlNet 1.1 later that year [5]. Each modality is a separately trained ControlNet checkpoint paired with a preprocessor that converts an arbitrary input image into the expected condition.
| Modality | Preprocessor | Typical use |
|---|---|---|
| Canny edges | OpenCV Canny detector | Reproducing line structure of a reference photo |
| Hough lines (M-LSD) | Mobile Line Segment Detector | Architecture, interiors, perspective scenes |
| HED soft edges | Holistically-Nested Edge Detection | Painterly recoloring and stylizing |
| Sketch / scribble | Thinning + simplification | User-drawn input from rough strokes |
| Human pose | OpenPose body / hand / face keypoints | Pose-specified character generation |
| Semantic segmentation | ADE20K-style class map | Scene layout with explicit class regions |
| Depth | MiDaS monocular depth network | 3D-aware composition, room layouts |
| Normal map | Computed from MiDaS depth | Surface-aware re-lighting and stylization |
| Anime line drawing | Manga-line preprocessor | Coloring of anime / cartoon line art |
The Canny model uses the classic Canny edge detector from 1986. The pose model uses OpenPose, the body and hand keypoint estimator developed at Carnegie Mellon. The depth model uses Intel's MiDaS monocular depth network. The semantic segmentation model is trained against the ADE20K protocol, which defines 150 scene-parsing classes. Each preprocessor is shipped with the ControlNet repository so that users can run an off-the-shelf computer vision model on their reference image to generate the input ControlNet expects [5].
The authors train each ControlNet on a separate dataset that pairs natural images with the corresponding condition. Edge models are trained on roughly 3 million image-edge pairs, semantic segmentation models on roughly 164,000 ADE20K samples extended with internet imagery, and pose models on roughly 80,000 image-keypoint pairs because labeled pose data is scarcer than auto-extracted edges. The paper demonstrates that the same architecture is robust across this range, training successfully on datasets smaller than 50,000 images and larger than 1 million images [1].
Training runs use the standard latent diffusion noise prediction loss, augmented only by the additional ControlNet branch. The authors report a single-condition training cost on the order of 600 NVIDIA A100 GPU hours per modality for the original 1.0 release, with finer-grained 1.1 retraining adding 200 to 2,160 GPU hours per checkpoint depending on dataset size and modality difficulty [5]. Compatibility was originally limited to Stable Diffusion 1.5, with SDXL checkpoints arriving later in 2023 from both the community and Stability AI.
The sudden convergence phenomenon described in the architecture section is reproducible across modalities. Practitioners observed that running training too short can yield checkpoints that ignore the condition, while running it past convergence usually produces clean alignment with diminishing returns from further training.
In April 2023, Zhang released ControlNet 1.1 as a nightly version of the same GitHub repository, then promoted it to stable later that year. ControlNet 1.1 was not a single new model but a refresh of all 1.0 checkpoints plus several new ones, retrained on better data and renamed to a stricter convention. Files now followed the pattern control_v11<status>_sd15_<name>, where the status code is p for production, e for experimental, and f1 for a bug-fix release on top of an earlier checkpoint [5].
The production checkpoints for ControlNet 1.1 are control_v11p_sd15_canny, control_v11p_sd15_mlsd, control_v11f1p_sd15_depth, control_v11p_sd15_normalbae, control_v11p_sd15_seg, control_v11p_sd15_inpaint, control_v11p_sd15_lineart, control_v11p_sd15s2_lineart_anime, control_v11p_sd15_openpose, control_v11p_sd15_scribble, and control_v11p_sd15_softedge. The three experimental checkpoints in 1.1 are control_v11e_sd15_shuffle for content shuffling, control_v11e_sd15_ip2p for InstructPix2Pix style instruction following (Instruct-Pix2Pix), and control_v11f1e_sd15_tile for high-resolution tile-based upscaling.
Notable changes in 1.1 included swapping the older HED preprocessor for a soft-edge model, retraining several checkpoints after Zhang found that earlier datasets had contained quality issues such as a small group of grayscale human images duplicated thousands of times, and improving robustness to imperfect preprocessor outputs. The Inpaint and Tile models in particular became popular because they extended ControlNet beyond strict structural conditioning into more general image enhancement and editing.
Stability AI released SDXL in July 2023, replacing Stable Diffusion 1.5 with a larger U-Net (roughly 2.6 billion parameters) and a higher native output resolution of 1024x1024. ControlNet checkpoints had to be retrained to match the new backbone. Both the open-source community and Stability AI released SDXL ControlNet variants over the second half of 2023.
In August 2023, Stability AI released a set of four official Control-LoRA checkpoints for SDXL covering Canny, Depth, Recolor, and Sketch, formatted as LoRA modules rather than full ControlNet branches. Where the original SDXL ControlNet weights were roughly 5 GB on disk, Stability's Control-LoRA versions came in 800 MB rank-256 and 400 MB rank-128 sizes, trading a small amount of quality for substantial savings in disk and VRAM usage [6]. Independent groups including Diffusers (canny-sdxl-1.0, depth-sdxl-1.0-small) and InstantX produced additional SDXL ControlNets, including pose models that the community had been requesting.
ControlNet's release in February 2023 was rapidly followed by a wave of related conditioning approaches. The table below compares the main families.
| Method | Year | Approach | Strength |
|---|---|---|---|
| ControlNet | 2023 | Locked U-Net plus trainable encoder copy with zero convolutions | High-fidelity spatial control |
| T2I-Adapter | 2023 | Lightweight adapter (around 77M params) injected at U-Net features | Smaller and faster than ControlNet |
| Composer | 2023 | Train a single diffusion model jointly on many decomposed factors | Composable conditions in one model |
| Uni-ControlNet | NeurIPS 2023 | Two adapters covering all local and global conditions | One model handles many condition types |
| ControlNet-LoRA | 2023 | LoRA-style decomposition of ControlNet weights | Smaller files, easier merging |
| IP-Adapter | 2023 | Image prompt adapter with decoupled cross-attention | Conditions on a reference image's content |
| InstantID | 2024 | IdentityNet combining IP-Adapter with face-landmark ControlNet | Face-preserving generation from one photo |
T2I-Adapter, from Mou and colleagues at Tencent ARC, was posted to arXiv only eight days after ControlNet (arXiv:2302.08453) and reached similar conditioning quality with about 77 million parameters and 300 MB of storage [7]. Composer, from Huang and colleagues at Alibaba's DAMO Academy (arXiv:2302.09778), trained a single diffusion model jointly on a large set of decomposed image factors and remixed them at inference time [8]. Uni-ControlNet, accepted to NeurIPS 2023, used two lightweight adapters to handle all local conditions and all global conditions through a single backbone, regardless of how many condition types were combined at inference [9]. IP-Adapter, also from Tencent (Ye et al. 2023), generalized the idea further by accepting an arbitrary reference image as a visual prompt, and InstantID combined an IP-Adapter with a face-landmark ControlNet to preserve identity from a single face photo [10].
The most widely used integration for the original Stable Diffusion 1.5 ControlNet is sd-webui-controlnet, a third-party extension for AUTOMATIC1111's Stable Diffusion web UI maintained by GitHub user Mikubill. The extension exposes all ControlNet 1.0 and 1.1 checkpoints, runs the corresponding preprocessors automatically, and supports stacked ControlNets so that a user can combine, for example, an OpenPose constraint with a depth map [11]. By 2024 the extension had over 17,000 GitHub stars and was a standard feature of most Stable Diffusion installations.
ComfyUI, the node-based interface for diffusion workflows, includes native ControlNet nodes out of the box and supports T2I-Adapter, ControlLoRA, ControlLLLite, SparseCtrls, and SVD-ControlNets through built-in or community node packs. InvokeAI, a polished open-source GUI for Stable Diffusion, similarly ships with built-in ControlNet support. The Hugging Face Diffusers library exposes ControlNet through pipelines including StableDiffusionControlNetPipeline, StableDiffusionXLControlNetPipeline, StableDiffusionControlNetInpaintPipeline, and the multi-ControlNet variants, and provides a controlnet_conditioning_scale parameter for tuning how strongly the condition is enforced [12].
Zhang himself authored two further tools that lean on ControlNet's design lessons. Fooocus, released in August 2023, is a simplified Stable Diffusion XL front end that hides most knobs and uses GPT-2 to expand prompts; it integrates ControlNet-style conditioning under the hood. Stable Diffusion WebUI Forge, released in early 2024, is a fork of AUTOMATIC1111's web UI optimized for memory efficiency and ControlNet performance, with a UNet patcher system that allows ControlNets, LoRAs, and other adapters to be applied without rebuilding the model graph each time [13].
Proprietary systems also adopted similar conditioning ideas after ControlNet's release. Adobe Firefly added structure reference and style reference features in 2023 and 2024, Midjourney v6 introduced character reference and style reference modes, and Runway's video models accepted pose and depth conditions, although none of these systems publicly disclosed whether they reused ControlNet code or simply borrowed the concept.
ControlNet expanded the practical range of image generation and image-to-image workflows. Architects and interior designers used the depth and M-LSD ControlNets to turn rough sketches and 3D mock-ups into photorealistic renderings while preserving floor plans and sight lines. Fashion designers fed pose skeletons into the OpenPose ControlNet to generate consistent figures across a clothing lookbook. Storyboard artists and animators used pose plus scribble conditioning to keep characters on-model across many frames. Visual effects studios and indie game developers adopted ControlNet to generate environment art, texture references, and concept variations from depth-rendered geometry.
In 2D art tools, plugins for Krita and Photoshop wrapped ControlNet pipelines so that an artist could paint a rough composition and have the model fill in details while respecting line work. Avatar generation services used ControlNet for face-consistent stylization, and InstantID later took this further by combining a face-landmark ControlNet with an IP-Adapter for one-photo identity preservation. In scientific visualization, researchers experimented with ControlNet as a way to render molecular structures, fluid simulations, and microscopy outputs in a controllable artistic style.
Video extensions also built on the same idea. Sparse-frame ControlNets and motion ControlNets were used with AnimateDiff and Stable Video Diffusion to keep character motion aligned with reference dance footage or pose sequences.
Because ControlNet runs both the frozen U-Net and the trainable copy of its encoder forward at inference, it roughly doubles the compute cost of one diffusion step compared with the base model. Memory overhead is typically smaller than a 2x increase because only the encoder half is duplicated, but practitioners running on consumer GPUs still report measurable VRAM pressure when stacking multiple ControlNets. The Forge web UI is one practical response: it implements aggressive UNet patching and offloading to keep multiple ControlNets resident at once on cards with 8 to 12 GB of VRAM [13].
LoRA-style ControlNets such as Stability's Control-LoRA and the community's ControlLLLite reduce the on-disk size of each conditioning model by an order of magnitude, although they often sacrifice some conditioning fidelity for the savings. Distillation methods including ControlNet-XS and ControlNeXt have been proposed to compress the architecture further by stripping away parts of the encoder copy that contribute little to the output.
ControlNet was met with immediate enthusiasm in the open-source generative AI community. Within weeks of release, demonstrations of pose-controlled and edge-controlled image generation were widely shared on the Stable Diffusion subreddit, on X (formerly Twitter), and in publications including The Verge and Ars Technica. The paper accumulated several thousand citations within its first year on Google Scholar and was widely treated as the reference architecture for adding any structured conditioning signal to a frozen diffusion model.
At ICCV 2023 in Paris, the paper received the Marr Prize / Best Paper Award, putting it alongside historic computer vision papers including the original SIFT and Mask R-CNN papers as Marr Prize honorees [2][3]. Lead author Lvmin Zhang, widely known online as lllyasviel on GitHub, became one of the most influential individual contributors in the open-source diffusion ecosystem; alongside ControlNet he authored the manga-line preprocessor used by the Anime Lineart model, the Fooocus interface, the Stable Diffusion WebUI Forge fork, and contributions to layered diffusion control. Co-authors Anyi Rao and Maneesh Agrawala had backgrounds in cinematic video understanding and human-computer interaction respectively, which the authors credited as influencing the focus on practical user-driven control.
ControlNet's broader influence is visible in the family of follow-up papers it inspired: T2I-Adapter, Composer, UniControl, Uni-ControlNet, IP-Adapter, ControlNet-XS, ControlNet++, ControlNeXt, and InstantID all build on the basic recipe of pairing a frozen large diffusion backbone with a smaller trainable conditioning network [7][8][9][10]. Major commercial systems including Adobe Firefly, Stable Diffusion 3, Flux, Hunyuan-DiT, and others now expose ControlNet-style structure and reference inputs as a standard feature.
The original ControlNet design has several practical limitations. Each conditioning modality requires a separately trained model, which inflates the total disk footprint when a user wants to combine many control types. The unified successors Uni-ControlNet and UniControl partly address this by sharing weights across modalities, but they have not displaced the original per-modality checkpoints in mainstream tools.
Quality of the generated image is bounded by the quality of the conditioning input. A noisy Canny edge map, an OpenPose skeleton with missing keypoints, or a depth map with halo artifacts will all propagate into the output. Sparse or ambiguous conditions, such as a very rough scribble or a low-resolution depth map, often fail to constrain the model strongly enough and produce images that drift away from the user's intent.
Residual artifacts can also arise from the locked-versus-trainable mismatch. Because only the encoder copy is fine-tuned, conditioning information has to be smuggled into the frozen decoder through the skip connections, which sometimes manifests as faint texture inconsistencies or color shifts at object boundaries. Practitioners often work around this by reducing the controlnet_conditioning_scale or by mixing ControlNet output with a plain text-only generation in latent space.
Finally, ControlNet's compute cost roughly doubles a base diffusion step, and using multiple ControlNets at once multiplies the overhead. Subsequent research, including ControlNet-XS, ControlLoRA, and ControlNeXt, has aimed at reducing this cost while preserving conditioning fidelity, but the basic two-branch architecture still defines the upper bound on how lightweight a ControlNet-style conditioner can be while remaining as expressive as the original [1][6].