Wan 2.1-VACE
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 2,762 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 2,762 words
Add missing citations, update stale details, or suggest a clearer explanation.
Wan 2.1-VACE (also written Wan2.1-VACE) is an open-weights video creation and editing model released by Alibaba's Tongyi Lab on May 14, 2025. The name VACE stands for Video All-in-One Creation and Editing, and the model is positioned as the first open-source system to combine multiple video generation and editing tasks within a single unified framework. It builds on the Wan 2.1 base video generation model, adding a multimodal conditioning interface that handles text, image, video, and mask inputs together.
The release ships in two parameter sizes, 1.3 billion and 14 billion, both available for free download on Hugging Face, GitHub, and Alibaba's ModelScope platform under the Apache 2.0 license. The 1.3B variant targets consumer hardware and runs at 480P, while the 14B variant supports both 480P and 720P. VACE supports reference-to-video generation, video-to-video editing, masked video-to-video editing, character animation, video inpainting, outpainting, pose transfer, depth control, and spatio-temporal extension, all from one model rather than a stack of single-task tools.
The accompanying research paper, VACE: All-in-One Video Creation and Editing, was first published to arXiv on March 11, 2025 and accepted to ICCV 2025 on June 26, 2025. The model's release sat between the original Wan 2.1 launch in February 2025 and the later Wan 2.5 preview that arrived in September 2025, with the Wan 2.6 series following in December 2025.
Alibaba's video model program runs out of the Tongyi Lab inside Alibaba Cloud, where it sits alongside the broader Tongyi family of foundation models. The team had been building toward open-source video generation for several years before VACE, with early Wanxiang text-to-image work feeding into the first Wan video releases. When Wan 2.1 launched on February 25, 2025, it topped the VBench leaderboard with an overall score of 84.7 percent and accumulated more than 2.2 million downloads across Hugging Face and ModelScope within days.
That first Wan 2.1 release already covered text-to-video, image-to-video, and first-and-last-frame video generation in separate model checkpoints. The gap it left, and the gap VACE was designed to fill, was video editing. Most open-source video pipelines in early 2025 still required separate expert models for tasks like inpainting, outpainting, or pose-driven animation. Users had to chain these together with their own glue code, often losing temporal consistency between stages. VACE was the team's answer to that fragmentation.
The broader context here is the open-source video race that picked up speed across 2024 and 2025. Tencent's HunyuanVideo, Genmo's Mochi, and Lightricks' LTX-Video had all shown that open weights could approach the visual quality of closed systems, but none of them had bundled creation and editing into one model. VACE was Alibaba's attempt to leapfrog that fragmentation rather than chase a marginal quality improvement on text-to-video alone.
VACE is built on top of the Wan 2.1 Diffusion Transformer (DiT) backbone, which uses Flow Matching as its generative framework. The base architecture for the 14B variant has 40 layers, 40 attention heads, a hidden dimension of 5120, and a T5 text encoder for multilingual conditioning. The 1.3B variant uses 30 layers, 12 heads, and a hidden dimension of 1536. Both share the Wan-VAE, a 3D causal variational autoencoder that can encode and decode 1080P video of arbitrary length while preserving temporal information.
The VACE-specific contribution sits on top of this backbone in two pieces.
The Video Condition Unit, or VCU, is a unified input interface. Rather than defining a separate input format for each task, the team treated text, reference images, source videos, masks, and control signals as different fields of a common multimodal record. A request to inpaint a region of a video and a request to generate a video from a single reference image both flatten into the same VCU representation, just with different fields populated.
This design lets a single trained model serve every supported task without architectural switching. It also makes task combinations possible at inference time. A user can supply a reference image, a source video, and a mask in the same call, and the model treats this as a combined reference-and-edit instruction rather than two separate operations.
The second piece is the Context Adapter, a structure that injects task-specific concepts into the DiT backbone using formalized representations of temporal and spatial dimensions. During training, the team froze the base Wan 2.1 weights and only trained the adapter layers. According to coverage of the technical report, this approach converged faster than full fine-tuning and reduced the risk of degrading the base model's generation quality while adding editing capabilities.
The paper's abstract describes the result as performance on par with task-specific models across various subtasks, achieved within a single unified system. The authors of the VACE paper are Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu, all from Alibaba's Tongyi Lab.
VACE consolidates a range of video generation and editing tasks behind one API. The HuggingFace model card and the GitHub repository group these into three primary categories: Reference-to-Video (R2V), Video-to-Video (V2V), and Masked Video-to-Video (MV2V). In community marketing materials, these get rebranded as the "Anything" family of operations.
| Capability | Category | What it does |
|---|---|---|
| Text-to-video | Base generation | Generates a video clip from a text prompt, inherited from Wan 2.1 |
| Image-to-video | Base generation | Animates a still image into a video clip following a text prompt |
| Reference-to-video | R2V | Generates a new video that preserves the identity of one or more reference images |
| Video-to-video editing | V2V | Edits an existing video globally with a text prompt, including style transfer and recolorization |
| Masked video-to-video | MV2V | Edits only the masked region of a video, leaving the rest untouched |
| Video inpainting | MV2V | Fills in or replaces selected areas inside a video using mask-based control |
| Video outpainting | MV2V | Extends a video beyond its original frame boundaries spatially |
| Spatio-temporal extension | V2V | Extends a video forward, backward, or outward in time and space |
| Character animation | R2V | Animates a reference character following a driving pose or motion signal |
| Pose transfer | V2V | Transfers human pose sequences from one video onto a target subject |
| Depth control | V2V | Conditions generation on depth maps for scene structure |
| Motion control | V2V | Conditions generation on optical flow or motion fields |
| Visual text rendering | Base generation | Renders English and Chinese text inside generated video frames |
The community-facing names for these operations include Move-Anything (motion transfer), Swap-Anything (subject replacement), Reference-Anything (R2V from arbitrary references), Expand-Anything (spatial and temporal outpainting), and Animate-Anything (driving a reference character with a control signal). These map to combinations of VCU fields rather than separate models.
A notable detail in the model card is that Wan 2.1 was the first video foundation model capable of rendering both Chinese and English text inside generated frames, a capability VACE inherits. Earlier open-source video systems either could not produce legible in-frame text at all or were limited to Latin scripts.
VACE shipped in two parameter sizes at launch, with an earlier preview release of the smaller variant available from March 2025.
| Variant | Parameters | Layers | Heads | Hidden dim | 480P | 720P | Frame budget | License |
|---|---|---|---|---|---|---|---|---|
| Wan2.1-VACE-1.3B | 1.3 billion | 30 | 12 | 1536 | Yes | Not recommended | 81 frames | Apache 2.0 |
| Wan2.1-VACE-14B | 14 billion | 40 | 40 | 5120 | Yes | Yes | 81 frames | Apache 2.0 |
| VACE-Wan2.1-1.3B-Preview | 1.3 billion | 30 | 12 | 1536 | Yes (preview) | No | 81 frames | Apache 2.0 |
| VACE-LTX-Video-0.9 | 0.9 billion | n/a | n/a | n/a | 97 frames at 512x768 | No | 97 frames | RAIL-M |
The 1.3B model is designed to fit on consumer GPUs. According to the model card, the underlying Wan 2.1 1.3B text-to-video model needs about 8.19 GB of VRAM and can generate a 5-second 480P clip in roughly 4 minutes on a single RTX 4090 without further optimization. The 14B model is heavier and is the recommended choice for 720P output, where the 1.3B model becomes unstable.
The LTX-Video variant is a separate community-contributed checkpoint that ports the VACE framework onto Lightricks' LTX-Video base, not an Alibaba release. It uses the RAIL-M license inherited from its base model rather than Apache 2.0.
Weights for both Alibaba checkpoints are mirrored across Hugging Face under the Wan-AI organization, on the official Wan-Video/Wan2.1 GitHub repository, and on ModelScope. Quantized GGUF variants in FP8 and lower precisions appeared within weeks of the initial release through community packagers, including the Comfy-Org repackaged build for ComfyUI users and QuantStack's GGUF conversions.
The VACE codebase landed on GitHub under the ali-vilab/VACE repository on March 31, 2025, ahead of the Wan-branded model weights. Native ComfyUI support arrived shortly after the May 14 weight release, with workflow templates contributed by community members including Datou, T8star-Aix, and Kijai. Kijai in particular published the VACE node system that became the basis for most third-party workflows.
Integrations followed quickly across the rest of the open-source video stack.
| Framework | Integration | Notes |
|---|---|---|
| Diffusers (Hugging Face) | First-class | DiffusionPipeline.from_pretrained("Wan-AI/Wan2.1-VACE-1.3B") |
| ComfyUI | Native | Repackaged checkpoints at Comfy-Org/Wan_2.1_ComfyUI_repackaged |
| ComfyUI-GGUF | Community | FP8 and lower-precision quants for VRAM-limited setups |
| DiffSynth-Studio | Community | Adds LoRA training and FP8 quantization on top of VACE |
| Gradio | Reference demos | Shipped in the official ali-vilab/VACE repo for all task types |
The permissive Apache 2.0 license on the official weights, the unified VCU interface, and the early ComfyUI port together made VACE the default open-source video editing model for many independent creators through mid-2025. Quantized FP16 and FP8 checkpoints under 30 GB enabled local use on a single 24 GB consumer card for the 14B model at 480P, which had previously required cloud compute for comparable quality.
The original Wan 2.1 base model had already cleared 2.2 million downloads in its first 48 hours, and the broader Wan 2.1 series passed 3.3 million downloads by the time VACE was announced, according to Alibaba Cloud's announcement of the release.
VACE sits in a crowded field of mid-2025 video models. The comparison below covers what each system actually shipped at the time, drawn from public model cards and announcements.
| Model | Developer | Release | Weights | License | Editing in same model | Resolutions | Audio |
|---|---|---|---|---|---|---|---|
| Wan 2.1-VACE | Alibaba Tongyi Lab | May 2025 | Open | Apache 2.0 | Yes (R2V, V2V, MV2V) | 480P, 720P | No |
| Wan 2.1 (base) | Alibaba Tongyi Lab | Feb 2025 | Open | Apache 2.0 | No, separate checkpoints | 480P, 720P | No |
| HunyuanVideo | Tencent | Dec 2024 | Open | Tencent Community License | No, T2V only at launch | 720P, 1280x720 | No |
| Sora 2 | OpenAI | Sep 2025 | Closed | Proprietary | Limited | Up to 1080P (Pro) | Yes |
| Veo 3 | Google DeepMind | May 2025 | Closed | Proprietary | Limited | Up to 4K | Yes |
| Seedance | ByteDance Seed | Jun 2025 | Closed | Proprietary | Limited | 1080P | No |
| LTX-Video 0.9 | Lightricks | Nov 2024 | Open | RAIL-M | Some | 512x768 | No |
| Mochi 1 | Genmo | Oct 2024 | Open | Apache 2.0 | No | 480P | No |
Against the closed commercial systems, VACE trades raw visual fidelity for openness and editing flexibility. Veo 3 and Sora 2 both ship native audio synthesis and reach higher resolutions, but neither exposes the model weights and neither offers VACE's combination of inpainting, outpainting, and reference-driven editing inside one prompt. Against the other open-weights releases of the period, the unified editing interface is the differentiating feature: HunyuanVideo at launch was text-to-video only, and Mochi 1 did not bundle editing.
The comparison shifts again with Wan 2.5, which Alibaba previewed on September 23, 2025. Wan 2.5 added native multimodal generation across text, image, video, and audio in a single architecture, and pushed output to 1080P at 24 frames per second with synchronized audio and lip-sync. Wan 2.5 closed the audio gap with Sora 2 and Veo 3 but launched as a preview through the Alibaba Cloud Model Studio API rather than as an immediate open-weights release. VACE remained the openly downloadable workhorse for users who needed weights they could run locally.
The Wan 2.6 series, unveiled on December 16, 2025, extended clip length to 15 seconds and introduced multi-shot storytelling and reference-to-video with both appearance and voice preservation. Wan 2.6 also runs through Model Studio and the Qwen App rather than as a direct weight release at launch. VACE's role through late 2025 was therefore the open-weights anchor of the series while later versions handled the proprietary-feature frontier.
The announcement on Alizila, Alibaba's English-language news site, framed VACE as the first open-source unified video editing model and emphasized the consolidation of tasks that had previously required separate experts. Independent coverage from outlets like AIBase, Artificial Intelligence News, and DeepNewz echoed the unified-model framing and highlighted the dual 1.3B and 14B sizing as evidence of a deliberate consumer-grade and prosumer-grade split.
In the practitioner community, the reception was driven less by leaderboard numbers and more by the ComfyUI workflows that landed within days of release. The Wan2.1-VACE Native Support announcement on the ComfyUI blog described the model as significantly improving the efficiency and quality of video creation, with workflow examples for the Move-Anything, Swap-Anything, Reference-Anything, Expand-Anything, and Animate-Anything operations. RunComfy and similar workflow distribution sites started shipping VACE templates almost immediately.
The ICCV 2025 acceptance of the underlying paper on June 26, 2025 gave the work an academic anchor beyond the model release. The paper's claim that a unified model can match task-specific models across multiple subtasks was the part most often cited in subsequent video-model papers through late 2025.
A practical limitation noted in community discussion was the 81-frame budget at the supported resolutions, which translates to roughly 5 seconds of video at 16 fps. Longer outputs required chaining generations with VACE's spatio-temporal extension capability, which works but is slower than a single forward pass. The arrival of Wan 2.5 and Wan 2.6 with their 10-second and 15-second clip lengths later in the year addressed this directly, although those releases did not initially ship open weights.
A second limitation flagged in the model card itself is text-to-video stability at 720P on the 1.3B variant, which the card explicitly does not recommend. Users running on consumer hardware were therefore funneled toward 480P unless they had GPU memory and patience for the 14B variant.