Wan 2.1-VACE

Wan 2.1-VACE (also written Wan2.1-VACE) is an open-weights video creation and editing model released by Alibaba's Tongyi Lab on May 14, 2025. The name VACE stands for Video All-in-One Creation and Editing, and the model is positioned as the first open-source system to combine multiple video generation and editing tasks within a single unified framework. It builds on the Wan 2.1 base video generation model, adding a multimodal conditioning interface that handles text, image, video, and mask inputs together.

The release ships in two parameter sizes, 1.3 billion and 14 billion, both available for free download on Hugging Face, GitHub, and Alibaba's ModelScope platform under the Apache 2.0 license. The 1.3B variant targets consumer hardware and runs at 480P, while the 14B variant supports both 480P and 720P. VACE supports reference-to-video generation, video-to-video editing, masked video-to-video editing, character animation, video inpainting, outpainting, pose transfer, depth control, and spatio-temporal extension, all from one model rather than a stack of single-task tools.

The accompanying research paper, VACE: All-in-One Video Creation and Editing, was first published to arXiv on March 11, 2025 and accepted to ICCV 2025 on June 26, 2025. The model's release sat between the original Wan 2.1 launch in February 2025 and the later Wan 2.5 preview that arrived in September 2025, with the Wan 2.6 series following in December 2025.

Background

Alibaba's video model program runs out of the Tongyi Lab inside Alibaba Cloud, where it sits alongside the broader Tongyi family of foundation models. The team had been building toward open-source video generation for several years before VACE, with early Wanxiang text-to-image work feeding into the first Wan video releases. When Wan 2.1 launched on February 25, 2025, it topped the VBench leaderboard with an overall score of 84.7 percent and accumulated more than 2.2 million downloads across Hugging Face and ModelScope within days.

That first Wan 2.1 release already covered text-to-video, image-to-video, and first-and-last-frame video generation in separate model checkpoints. The gap it left, and the gap VACE was designed to fill, was video editing. Most open-source video pipelines in early 2025 still required separate expert models for tasks like inpainting, outpainting, or pose-driven animation. Users had to chain these together with their own glue code, often losing temporal consistency between stages. VACE was the team's answer to that fragmentation.

The broader context here is the open-source video race that picked up speed across 2024 and 2025. Tencent's HunyuanVideo, Genmo's Mochi, and Lightricks' LTX-Video had all shown that open weights could approach the visual quality of closed systems, but none of them had bundled creation and editing into one model. VACE was Alibaba's attempt to leapfrog that fragmentation rather than chase a marginal quality improvement on text-to-video alone.

Architecture

VACE is built on top of the Wan 2.1 Diffusion Transformer (DiT) backbone, which uses Flow Matching as its generative framework. The base architecture for the 14B variant has 40 layers, 40 attention heads, a hidden dimension of 5120, and a T5 text encoder for multilingual conditioning. The 1.3B variant uses 30 layers, 12 heads, and a hidden dimension of 1536. Both share the Wan-VAE, a 3D causal variational autoencoder that can encode and decode 1080P video of arbitrary length while preserving temporal information.

The VACE-specific contribution sits on top of this backbone in two pieces.

Video Condition Unit

The Video Condition Unit, or VCU, is a unified input interface. Rather than defining a separate input format for each task, the team treated text, reference images, source videos, masks, and control signals as different fields of a common multimodal record. A request to inpaint a region of a video and a request to generate a video from a single reference image both flatten into the same VCU representation, just with different fields populated.

This design lets a single trained model serve every supported task without architectural switching. It also makes task combinations possible at inference time. A user can supply a reference image, a source video, and a mask in the same call, and the model treats this as a combined reference-and-edit instruction rather than two separate operations.

Context Adapter

The second piece is the Context Adapter, a structure that injects task-specific concepts into the DiT backbone using formalized representations of temporal and spatial dimensions. During training, the team froze the base Wan 2.1 weights and only trained the adapter layers. According to coverage of the technical report, this approach converged faster than full fine-tuning and reduced the risk of degrading the base model's generation quality while adding editing capabilities.

The paper's abstract describes the result as performance on par with task-specific models across various subtasks, achieved within a single unified system. The authors of the VACE paper are Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu, all from Alibaba's Tongyi Lab.

Capabilities

VACE consolidates a range of video generation and editing tasks behind one API. The HuggingFace model card and the GitHub repository group these into three primary categories: Reference-to-Video (R2V), Video-to-Video (V2V), and Masked Video-to-Video (MV2V). In community marketing materials, these get rebranded as the "Anything" family of operations.

Capability	Category	What it does
Text-to-video	Base generation	Generates a video clip from a text prompt, inherited from Wan 2.1
Image-to-video	Base generation	Animates a still image into a video clip following a text prompt
Reference-to-video	R2V	Generates a new video that preserves the identity of one or more reference images
Video-to-video editing	V2V	Edits an existing video globally with a text prompt, including style transfer and recolorization
Masked video-to-video	MV2V	Edits only the masked region of a video, leaving the rest untouched
Video inpainting	MV2V	Fills in or replaces selected areas inside a video using mask-based control
Video outpainting	MV2V	Extends a video beyond its original frame boundaries spatially
Spatio-temporal extension	V2V	Extends a video forward, backward, or outward in time and space
Character animation	R2V	Animates a reference character following a driving pose or motion signal
Pose transfer	V2V	Transfers human pose sequences from one video onto a target subject
Depth control	V2V	Conditions generation on depth maps for scene structure
Motion control	V2V	Conditions generation on optical flow or motion fields
Visual text rendering	Base generation	Renders English and Chinese text inside generated video frames

The community-facing names for these operations include Move-Anything (motion transfer), Swap-Anything (subject replacement), Reference-Anything (R2V from arbitrary references), Expand-Anything (spatial and temporal outpainting), and Animate-Anything (driving a reference character with a control signal). These map to combinations of VCU fields rather than separate models.

A notable detail in the model card is that Wan 2.1 was the first video foundation model capable of rendering both Chinese and English text inside generated frames, a capability VACE inherits. Earlier open-source video systems either could not produce legible in-frame text at all or were limited to Latin scripts.

Variants and weights

VACE shipped in two parameter sizes at launch, with an earlier preview release of the smaller variant available from March 2025.

Variant	Parameters	Layers	Heads	Hidden dim	480P	720P	Frame budget	License
Wan2.1-VACE-1.3B	1.3 billion	30	12	1536	Yes	Not recommended	81 frames	Apache 2.0
Wan2.1-VACE-14B	14 billion	40	40	5120	Yes	Yes	81 frames	Apache 2.0
VACE-Wan2.1-1.3B-Preview	1.3 billion	30	12	1536	Yes (preview)	No	81 frames	Apache 2.0
VACE-LTX-Video-0.9	0.9 billion	n/a	n/a	n/a	97 frames at 512x768	No	97 frames	RAIL-M

The 1.3B model is designed to fit on consumer GPUs. According to the model card, the underlying Wan 2.1 1.3B text-to-video model needs about 8.19 GB of VRAM and can generate a 5-second 480P clip in roughly 4 minutes on a single RTX 4090 without further optimization. The 14B model is heavier and is the recommended choice for 720P output, where the 1.3B model becomes unstable.

The LTX-Video variant is a separate community-contributed checkpoint that ports the VACE framework onto Lightricks' LTX-Video base, not an Alibaba release. It uses the RAIL-M license inherited from its base model rather than Apache 2.0.

Weights for both Alibaba checkpoints are mirrored across Hugging Face under the Wan-AI organization, on the official Wan-Video/Wan2.1 GitHub repository, and on ModelScope. Quantized GGUF variants in FP8 and lower precisions appeared within weeks of the initial release through community packagers, including the Comfy-Org repackaged build for ComfyUI users and QuantStack's GGUF conversions.

Open-source ecosystem

The VACE codebase landed on GitHub under the ali-vilab/VACE repository on March 31, 2025, ahead of the Wan-branded model weights. Native ComfyUI support arrived shortly after the May 14 weight release, with workflow templates contributed by community members including Datou, T8star-Aix, and Kijai. Kijai in particular published the VACE node system that became the basis for most third-party workflows.

Integrations followed quickly across the rest of the open-source video stack.

Framework	Integration	Notes
Diffusers (Hugging Face)	First-class	`DiffusionPipeline.from_pretrained("Wan-AI/Wan2.1-VACE-1.3B")`
ComfyUI	Native	Repackaged checkpoints at Comfy-Org/Wan_2.1_ComfyUI_repackaged
ComfyUI-GGUF	Community	FP8 and lower-precision quants for VRAM-limited setups
DiffSynth-Studio	Community	Adds LoRA training and FP8 quantization on top of VACE
Gradio	Reference demos	Shipped in the official ali-vilab/VACE repo for all task types

The permissive Apache 2.0 license on the official weights, the unified VCU interface, and the early ComfyUI port together made VACE the default open-source video editing model for many independent creators through mid-2025. Quantized FP16 and FP8 checkpoints under 30 GB enabled local use on a single 24 GB consumer card for the 14B model at 480P, which had previously required cloud compute for comparable quality.

The original Wan 2.1 base model had already cleared 2.2 million downloads in its first 48 hours, and the broader Wan 2.1 series passed 3.3 million downloads by the time VACE was announced, according to Alibaba Cloud's announcement of the release.

Comparison to competitors

VACE sits in a crowded field of mid-2025 video models. The comparison below covers what each system actually shipped at the time, drawn from public model cards and announcements.

Model	Developer	Release	Weights	License	Editing in same model	Resolutions	Audio
Wan 2.1-VACE	Alibaba Tongyi Lab	May 2025	Open	Apache 2.0	Yes (R2V, V2V, MV2V)	480P, 720P	No
Wan 2.1 (base)	Alibaba Tongyi Lab	Feb 2025	Open	Apache 2.0	No, separate checkpoints	480P, 720P	No
HunyuanVideo	Tencent	Dec 2024	Open	Tencent Community License	No, T2V only at launch	720P, 1280x720	No
Sora 2	OpenAI	Sep 2025	Closed	Proprietary	Limited	Up to 1080P (Pro)	Yes
Veo 3	Google DeepMind	May 2025	Closed	Proprietary	Limited	Up to 4K	Yes
Seedance	ByteDance Seed	Jun 2025	Closed	Proprietary	Limited	1080P	No
LTX-Video 0.9	Lightricks	Nov 2024	Open	RAIL-M	Some	512x768	No
Mochi 1	Genmo	Oct 2024	Open	Apache 2.0	No	480P	No

Against the closed commercial systems, VACE trades raw visual fidelity for openness and editing flexibility. Veo 3 and Sora 2 both ship native audio synthesis and reach higher resolutions, but neither exposes the model weights and neither offers VACE's combination of inpainting, outpainting, and reference-driven editing inside one prompt. Against the other open-weights releases of the period, the unified editing interface is the differentiating feature: HunyuanVideo at launch was text-to-video only, and Mochi 1 did not bundle editing.

The comparison shifts again with Wan 2.5, which Alibaba previewed on September 23, 2025. Wan 2.5 added native multimodal generation across text, image, video, and audio in a single architecture, and pushed output to 1080P at 24 frames per second with synchronized audio and lip-sync. Wan 2.5 closed the audio gap with Sora 2 and Veo 3 but launched as a preview through the Alibaba Cloud Model Studio API rather than as an immediate open-weights release. VACE remained the openly downloadable workhorse for users who needed weights they could run locally.

The Wan 2.6 series, unveiled on December 16, 2025, extended clip length to 15 seconds and introduced multi-shot storytelling and reference-to-video with both appearance and voice preservation. Wan 2.6 also runs through Model Studio and the Qwen App rather than as a direct weight release at launch. VACE's role through late 2025 was therefore the open-weights anchor of the series while later versions handled the proprietary-feature frontier.

Reception

The announcement on Alizila, Alibaba's English-language news site, framed VACE as the first open-source unified video editing model and emphasized the consolidation of tasks that had previously required separate experts. Independent coverage from outlets like AIBase, Artificial Intelligence News, and DeepNewz echoed the unified-model framing and highlighted the dual 1.3B and 14B sizing as evidence of a deliberate consumer-grade and prosumer-grade split.

In the practitioner community, the reception was driven less by leaderboard numbers and more by the ComfyUI workflows that landed within days of release. The Wan2.1-VACE Native Support announcement on the ComfyUI blog described the model as significantly improving the efficiency and quality of video creation, with workflow examples for the Move-Anything, Swap-Anything, Reference-Anything, Expand-Anything, and Animate-Anything operations. RunComfy and similar workflow distribution sites started shipping VACE templates almost immediately.

The ICCV 2025 acceptance of the underlying paper on June 26, 2025 gave the work an academic anchor beyond the model release. The paper's claim that a unified model can match task-specific models across multiple subtasks was the part most often cited in subsequent video-model papers through late 2025.

A practical limitation noted in community discussion was the 81-frame budget at the supported resolutions, which translates to roughly 5 seconds of video at 16 fps. Longer outputs required chaining generations with VACE's spatio-temporal extension capability, which works but is slower than a single forward pass. The arrival of Wan 2.5 and Wan 2.6 with their 10-second and 15-second clip lengths later in the year addressed this directly, although those releases did not initially ship open weights.

A second limitation flagged in the model card itself is text-to-video stability at 720P on the 1.3B variant, which the card explicitly does not recommend. Users running on consumer hardware were therefore funneled toward 480P unless they had GPU memory and patience for the 14B variant.

References

Alibaba Group. "Alibaba Unveils Wan2.1-VACE: Groundbreaking Open-Source AI Model for Video Creation and Editing." Alizila, May 15, 2025. https://www.alizila.com/alibaba-unveils-wan2-1-vace-groundbreaking-open-source-ai-model-for-video-creation-and-editing/
Alibaba Cloud. "Alibaba Introduces Open-Source Model for Video Creation and Editing." Alibaba Cloud Community, May 2025. https://www.alibabacloud.com/blog/alibaba-introduces-open-source-model-for-video-creation-and-editing_602226
Jiang, Zeyinzi; Han, Zhen; Mao, Chaojie; Zhang, Jingfeng; Pan, Yulin; Liu, Yu. "VACE: All-in-One Video Creation and Editing." arXiv preprint arXiv:2503.07598, March 11, 2025. https://arxiv.org/abs/2503.07598
Wan-AI. "Wan2.1-VACE-14B model card." Hugging Face, May 2025. https://huggingface.co/Wan-AI/Wan2.1-VACE-14B
Wan-AI. "Wan2.1-VACE-1.3B model card." Hugging Face, May 2025. https://huggingface.co/Wan-AI/Wan2.1-VACE-1.3B
ali-vilab. "VACE: All-in-One Video Creation and Editing (official implementation)." GitHub, 2025. https://github.com/ali-vilab/VACE
Wan-Video. "Wan2.1 official repository." GitHub, 2025. https://github.com/Wan-Video/Wan2.1
ComfyUI. "Wan2.1-VACE Native Support and Ace-Step Workflow Refined." ComfyUI Blog, May 2025. https://blog.comfy.org/p/wan21-vace-native-support-and-ace
AIBase. "Alibaba Qianwen Wan2.1-VACE Open Source Claims to Be the First Open-source Unified Video Editing Model." AIBase, May 2025. https://www.aibase.com/news/18059
Artificial Intelligence News. "Alibaba Wan2.1-VACE: Open-source AI video tool for all." May 2025. https://www.artificialintelligence-news.com/news/alibaba-wan2-1-vace-open-source-ai-video-tool-for-all/
DeepNewz. "Alibaba Launches Wan2.1-VACE Open-Source Video Generation Suite With 1.3B and 14B Models." May 2025. https://deepnewz.com/ai-modeling/alibaba-launches-wan2-1-vace-open-source-video-generation-suite-1-3b-14b-models-321dc439
Alibaba Cloud. "Alibaba Unveils Wan 2.5-Preview." September 2025. https://www.alibabacloud.com/blog/
Alibaba Cloud. "Alibaba Unveils Wan2.6 Series Enabling Everyone to Star in Videos." December 16, 2025. https://www.alibabacloud.com/en/press-room/alibaba-unveils-wan2-6-series-enabling-everyone
Wang, Ang; Ai, Baole; Wen, Bin; et al. "Wan: Open and Advanced Large-Scale Video Generative Models." arXiv preprint arXiv:2503.20314, 2025. https://arxiv.org/abs/2503.20314

Wan 2.1-VACE

Background

Architecture

Video Condition Unit

Context Adapter

Capabilities

Variants and weights

Open-source ecosystem

Comparison to competitors

Reception

See also

References

Improve this article

Background

Architecture

Video Condition Unit

Context Adapter

Capabilities

Variants and weights

Open-source ecosystem

Comparison to competitors

Reception

See also

References

Background

Architecture

Video Condition Unit

Context Adapter

Capabilities

Variants and weights

Open-source ecosystem

Comparison to competitors

Reception

See also

References

Improve this article

Related Articles

Seedance

Wan 2.5

Wan 2.1

NVIDIA Picasso

OpenClaw

Luma Dream Machine

Background

Architecture

Video Condition Unit

Context Adapter

Capabilities

Variants and weights

Open-source ecosystem

Comparison to competitors

Reception

See also

References

Related Articles

Seedance

Wan 2.5

Wan 2.1

NVIDIA Picasso

OpenClaw

Luma Dream Machine