Wan 2.1

Wan 2.1 (also written Wan2.1, from the Chinese Tongyi Wanxiang or 通义万象) is a family of open-weights video generation models developed by Alibaba's Tongyi Wanxiang team and released on February 25, 2025. The family includes 1.3-billion and 14-billion parameter variants covering text-to-video, image-to-video, first-and-last-frame video, and video creation and editing tasks. It is released under the Apache 2.0 license, making the weights freely available for commercial and research use.

Upon release, Wan 2.1 topped the VBench leaderboard with an overall score of 84.7%, outperforming both open-source competitors and closed commercial systems including Sora at that time. Within two days of release, the GitHub repository accumulated over 4,000 stars, and total downloads across Hugging Face and ModelScope exceeded 2.2 million. Its successor, Wan 2.2, was released in July 2025 with a Mixture-of-Experts architecture.

Background: Tongyi Wanxiang

Tongyi Wanxiang (通义万象) is Alibaba Cloud's multimodal foundation model, first introduced in July 2023 as part of Alibaba's broader Tongyi family of AI models. The name translates roughly as "all things in one" and reflects the platform's goal of handling diverse modalities including text, images, and video within a single model lineage.

The Tongyi Wanxiang team sits within Alibaba Cloud's AI research division and has focused on scaling video generation from early text-to-image experiments toward full video synthesis. The project went through multiple internal and commercial releases before Wan 2.1 became the version that was fully open-sourced for the global AI community.

Alibaba Cloud is one of the largest cloud providers in Asia and a major player in the global AI model race. Its decision to release Wan 2.1 under a permissive open-source license placed it in direct competition with models like HunyuanVideo from Tencent and international offerings from OpenAI (Sora 2) and Google (Veo 3), while also contributing to the broader open-source AI ecosystem.

Release

The initial open-source release of Wan 2.1 occurred on February 25, 2025, when Alibaba published inference code and model weights for four models simultaneously: T2V-14B, T2V-1.3B, I2V-14B-720P, and I2V-14B-480P. The code was published to the GitHub repository at Wan-Video/Wan2.1, and weights were hosted on both Hugging Face (under the Wan-AI organization) and ModelScope.

This was followed by a rapid sequence of integrations and expansions:

February 27, 2025: ComfyUI integration launched with native node support
March 3, 2025: Integration into the Hugging Face Diffusers library for T2V and I2V pipelines
April 17, 2025: First-Last-Frame-to-Video (FLF2V) model released as an additional 14B variant
May 14, 2025: VACE (Video creation and editing) models released in both 1.3B and 14B sizes

The open-source release was a deliberate strategic move. Alibaba Cloud positioned Wan 2.1 as a direct challenge to proprietary models, noting that the weights and code were available for free modification and distribution without usage-based pricing barriers that affect access to models like Sora and Veo 2.

Model variants

The Wan 2.1 family consists of several distinct models targeting different hardware profiles and creative tasks.

T2V-1.3B

The 1.3-billion parameter text-to-video model is the entry-level variant, designed to run on consumer-grade graphics cards. It requires 8.19 GB of VRAM, making it compatible with cards such as the RTX 4060 (with optimization) and RTX 4090. On an RTX 4090, a five-second 480P video generates in approximately four minutes without quantization or acceleration. The model operates at 480P resolution.

This model was widely praised for making capable AI video generation accessible outside of professional workstations. Earlier open-source video models either required 24GB or more of VRAM or produced noticeably lower quality output at this memory level.

T2V-14B

The 14-billion parameter text-to-video model is the full-sized variant. It supports both 480P and 720P resolution output and requires high-memory GPUs. The recommended configuration is a single GPU with 24GB or more VRAM (such as an NVIDIA A100 or RTX 4090), though single-GPU inference at 720P typically requires model CPU offloading via the --offload_model True flag. Multi-GPU inference is supported through FSDP and xDiT with Ulysses or Ring parallelization strategies for up to eight GPUs.

The T2V-14B has 40 transformer layers, 40 attention heads, a model dimension of 5,120, and a feedforward dimension of 13,824.

I2V-14B

The image-to-video models (I2V-14B-480P and I2V-14B-720P) take a single image as a conditioning input along with a text prompt and animate it into a video. The 720P variant targets higher-fidelity output. These models are useful for animating still illustrations, product photos, and portrait images, and they were among the four original models released on February 25, 2025.

FLF2V-14B

Released on April 17, 2025, the First-Last-Frame-to-Video model takes two images as input: a starting frame and an ending frame. Rather than morphing between them, the model generates realistic intermediate motion informed by a text prompt describing the action or style. This enables creators to set a precise visual beginning and ending while letting the model determine how to bridge the gap with natural motion. Output is at 720P. Common use cases include cinematic scene transitions, character movement sequences, and animation workflows where keyframes are defined in advance.

VACE models

The VACE (Video creation and Editing) models were released May 14, 2025 in both 1.3B and 14B sizes. VACE stands for an all-in-one framework for video creation and editing. These models accept multi-modal inputs including text, image, video, and structural control signals such as depth maps, optical flow, grayscale, line drawings, and pose estimation.

VACE supports a wide set of editing operations including replacing actors or objects in existing footage, animating still characters, expanding video frames spatially, repainting masked regions with new content, and generating video from human pose sequences. The framework was accepted to ICCV 2025, indicating peer-reviewed recognition of its technical contributions.

Architecture

Wan 2.1 is built on the diffusion transformer (DiT) paradigm, using a Flow Matching framework for training and inference. Flow Matching is an alternative to standard score-matching diffusion that parameterizes a transport map between noise and data directly, often resulting in straighter sampling trajectories and faster convergence.

Wan-VAE

A key architectural component is the Wan-VAE, a 3D causal variational autoencoder designed specifically for video. Standard image VAEs compress 2D spatial data; the Wan-VAE operates causally along the temporal axis, meaning earlier frames can be processed without needing to see future frames. This design choice preserves temporal information and avoids artifacts that arise when applying frame-independent compression to video. The Wan-VAE can encode and decode videos of unlimited length at up to 1080P resolution.

Diffusion transformer (DiT)

The core generation network is a diffusion transformer with the following components:

Text encoding: A T5 encoder processes the input text prompt, supporting multilingual input (English and Chinese).
Cross-attention conditioning: Each transformer block embeds the encoded text via cross-attention, distributing language guidance throughout the model depth.
Time modulation: A shared MLP with a linear layer and SiLU activation processes time embeddings and predicts six modulation parameters per block. Each block learns distinct biases on top of this shared structure.
Full spatio-temporal attention: The model attends across both spatial and temporal dimensions, allowing it to model motion and object persistence across frames.

The 1.3B model has 30 layers, 12 attention heads, and a model dimension of 1,536. The 14B model has 40 layers, 40 attention heads, and a dimension of 5,120.

Training data

The training data pipeline uses a four-step cleaning procedure applied to a large proprietary corpus of images and videos. The pipeline filters by fundamental quality dimensions (sharpness, exposure, noise), visual quality, and motion quality. Data is stratified across training stages to balance coverage of different content types. The 14B model was trained on billions of images and videos.

Bilingual text generation

One of Wan 2.1's most distinctive capabilities is generating readable text directly within video frames in both English and Chinese. The team described this as a first for the video generation model category at the time of release. Prior video generation models typically struggled to produce coherent written text in output frames, producing blurry or garbled characters even for short words.

Wan 2.1 can render dynamic text effects, stylized typography, and standard body text in both scripts. This makes the model useful for generating videos with readable signage, product labels, UI mockups, title cards, and multilingual promotional content without requiring post-production compositing.

The bilingual capability reflects the model's use of a multilingual T5 encoder that processes Chinese and English at the language conditioning stage, along with training data that includes sufficient text-bearing images and videos in both scripts.

License

Wan 2.1 is released under the Apache 2.0 license. This is a permissive open-source license that allows free use, modification, and redistribution, including for commercial applications, provided that the license notice and attribution requirements are preserved. There are no usage-based restrictions on inference volume, no requirements to share derivative model weights, and no prohibitions on commercial deployment.

This licensing choice contrasts with several competing open-source video models that use more restrictive licenses, some of which prohibit commercial use or require specific attribution in outputs. The Apache 2.0 license aligned Wan 2.1 with models like Llama 3 (which uses a custom permissive license) and Meta's general open-weights approach, broadening its appeal to commercial developers and enterprise users.

Hardware requirements

The following table summarizes the hardware requirements and generation specifications for each major Wan 2.1 variant.

Model	Parameters	VRAM required	Target GPU	Output resolution	Notes
T2V-1.3B	1.3B	8.19 GB	RTX 4060 (with optimization), RTX 4090	480P	5-sec clip in ~4 min on RTX 4090
T2V-14B	14B	24 GB+	RTX 4090, A100	480P, 720P	CPU offload flag for single-GPU 720P
I2V-14B-480P	14B	24 GB+	RTX 4090, A100	480P	Image conditioning
I2V-14B-720P	14B	24 GB+	A100 or multi-GPU	720P	Image conditioning
FLF2V-14B	14B	24 GB+	A100 or multi-GPU	720P	Two-image conditioning
VACE-1.3B	1.3B	~8 GB	RTX 4060+	480P	Video editing tasks
VACE-14B	14B	24 GB+	RTX 4090, A100	480P, 720P	Multi-modal editing

For multi-GPU inference, Wan 2.1 supports FSDP combined with xDiT's Ulysses and Ring sequence parallelism strategies, scaling efficiently across up to eight GPUs. The xDiT library (version 0.4.1 or higher) must be installed separately for this mode.

Memory optimization options include CPU offloading for model weights, T5 encoder offloading to CPU (--t5_cpu flag), and quantized (INT8/FP8) checkpoints distributed by the community. These techniques allow running the 14B model on hardware with less than 24 GB VRAM at the cost of increased generation latency.

Benchmarks

The Wan team evaluated Wan 2.1 against other video generation models using the VBench benchmark suite and an internal benchmark called Wan-Bench.

VBench

VBench is a comprehensive benchmark for video generative models developed by Tsinghua University's Vchitect lab, covering 16 fine-grained dimensions of video quality and semantic alignment. Wan 2.1 achieved an overall VBench score of 84.7%, ranking first on the public VBench leaderboard at the time of its release.

Model	VBench score	Access
Wan 2.1 (14B)	84.7%	Open-source
Google Veo 2	83.0%	Closed, limited access
OpenAI Sora	82.0%	Closed, subscription
Wan 2.1 (1.3B)	Lower than 14B	Open-source

Wan 2.1 performed particularly well in VBench dimensions related to dynamic degree (range of motion), spatial relationships, and multi-object interaction. These categories are demanding because they require the model to maintain consistent physics and object identity across frames rather than relying on static or slow-moving content to achieve visual quality scores.

Wan-Bench

The Wan team also developed an internal evaluation framework they called Wan-Bench, which evaluates video quality across 14 major dimensions and 26 sub-dimensions. The benchmark includes 1,035 evaluation prompts covering diverse content categories. Final scores are computed using a weighted sum of per-dimension scores, with weights derived from human preference studies indicating which dimensions matter most to viewers.

Across Wan-Bench evaluations, the 14B model achieved the highest overall weighted score (0.724) among all models tested, including commercial systems. Evaluated dimensions include motion generation quality, temporal stability, physical plausibility, multi-object handling, text adherence, and aesthetic quality.

Comparison with other video generation models

The table below compares Wan 2.1 against major competing video generation models that were available at or near its release date.

Feature	Wan 2.1 (14B)	Sora 2	Veo 3	Kling 2.1
Release	February 2025	May 2025	May 2025	June 2025
Open weights	Yes (Apache 2.0)	No	No	No
VBench score	84.7%	~82%	Not published	Not published
Max resolution	720P native	Up to 1080P	Up to 4K	Up to 1080P
Max duration	~10 seconds	Up to 20 seconds	Up to 60 seconds	Up to 15 seconds
Bilingual text	Yes (EN + ZH)	English only	English only	Yes (EN + ZH)
Image-to-video	Yes	Limited	Yes	Yes
Price	Free (self-hosted)	$20-$200/month	Usage-based	Subscription
Primary language	Chinese/English	English	English	Chinese/English
Hardware needed	8.19 GB+ VRAM	Cloud only	Cloud only	Cloud only

The key differentiator for Wan 2.1 is its open-weight, self-hostable nature. Sora, Veo 3, and Kling 2.1 are all closed commercial products requiring subscriptions or cloud API access. Wan 2.1 allows researchers, developers, and independent creators to run inference locally and fine-tune the weights on custom datasets without ongoing usage costs or API rate limits.

In terms of raw output quality, Wan 2.1 at 14B is competitive with these commercial systems in benchmark tests, particularly on motion quality and object consistency. It falls behind on maximum video length and maximum output resolution compared to newer closed models like Veo 3 and Kling 2.1, which added 4K output and longer durations in their respective updates after Wan 2.1's release.

Wan 2.2

Wan 2.2 was released on July 28, 2025 as the direct successor to Wan 2.1. Its most significant architectural change is the adoption of a Mixture-of-Experts (MoE) architecture, making it the first video diffusion model to apply MoE across the denoising timestep dimension.

The MoE design in Wan 2.2 uses two specialized expert models that handle different stages of the denoising process: a high-noise expert that focuses on overall layout and composition during the early diffusion steps, and a low-noise expert that refines details in later steps. Each expert has approximately 14B parameters, giving the full model 27B total parameters but only 14B active parameters at any given inference step. This design improves generation quality without proportionally increasing computational cost per step.

Additionally, Wan 2.2 was trained on substantially more data than Wan 2.1, with 65.6% more images and 83.2% more videos. The expanded training corpus included data with fine-grained aesthetic labels for lighting, color tone, composition, and contrast, enabling more controllable cinematic style generation.

Wan 2.2 retained the same Apache 2.0 licensing as Wan 2.1 and maintained backward compatibility with community tooling including ComfyUI and Diffusers integrations.

Ecosystem and community integrations

Wan 2.1 accumulated a large third-party ecosystem rapidly after release.

Official integrations

Within two weeks of the initial release, Wan 2.1 was integrated into:

Hugging Face Diffusers: T2V and I2V pipelines added to the Diffusers library, allowing standard Python usage with DiffusionPipeline.from_pretrained()
ComfyUI: Native node support added February 27, 2025, enabling workflow-based video generation with the model
ModelScope: Alibaba's own model-sharing platform hosts all weights alongside Hugging Face

Community projects

The GitHub repository lists several downstream research projects that built on Wan 2.1's weights:

Helios: A video generation model based on Wan 2.1 that achieves minute-scale video synthesis at 19.5 FPS on a single H100 GPU.
Video-As-Prompt: A semantic-controlled video generation model using a Mixture-of-Transformers architecture built on Wan 2.1-14B-I2V.
Wan-Move: A framework for fine-grained point-level motion control in image-to-video generation, accepted to NeurIPS 2025.
EchoShot: A multi-shot portrait video generation model based on T2V-1.3B that enables consistent character identity across multiple video clips.
AniCrafter and HyperMotion: Human-centric animation models for character and body motion generation.

Acceleration tools

Several inference acceleration libraries were adapted to support Wan 2.1:

TeaCache: Adds approximately 2x inference speed with minimal quality degradation
CFG-Zero: Enhances classifier-free guidance behavior
DiffSynth-Studio: Provides LoRA training support, FP8 quantization, and additional VRAM optimization for Wan 2.1 variants
Wan2GP: An independent implementation designed for low-VRAM inference targeting the RTX 4060 and similar consumer cards

Use cases

Wan 2.1's combination of open weights, consumer-compatible 1.3B variant, and multi-task model suite supports a broad range of applications.

Content creation: The text-to-video and image-to-video models are used for short-form social media content, thumbnail animation, and branded video production. The bilingual text rendering capability is particularly useful for content targeting both Chinese and English-speaking audiences.

Film and advertising post-production: The VACE editing model supports replacing background elements, actor substitution, and adding visual effects to existing footage without requiring manual masking in traditional compositing software.

Animation production: FLF2V enables animators to work in a keyframe-centric workflow where start and end poses or scenes are defined as still images, and the model generates the intervening motion.

Research and development: The Apache 2.0 license and open weights make Wan 2.1 a common foundation model for academic video generation research, including studies on motion control, personalized generation, and efficient inference.

Fine-tuning and customization: Developers have fine-tuned Wan 2.1 on proprietary datasets for specific domains including product visualization, virtual try-on, and autonomous driving simulation. The DiffSynth-Studio library provides LoRA-based fine-tuning for the 14B models.

Educational content: The V2A (video-to-audio) capability allows educators to create instructional videos with matched audio effects, while the text rendering supports generating diagrams and annotated video sequences.

Reception

Wan 2.1 received substantial attention from the AI developer community following its release. The GitHub repository reached 4,000 stars within two days, and the models accumulated over 2.2 million downloads on Hugging Face and ModelScope combined in the months following release.

Coverage in AI publications noted the benchmark performance relative to commercial models, particularly Sora, as significant. Several independent evaluators tested Wan 2.1 against closed systems and found it competitive or superior on motion smoothness and temporal stability, while acknowledging that closed models maintained advantages in maximum resolution and video duration.

The open-source AI community received it warmly as one of the strongest open-weight video generation models available at the time. Comparisons to HunyuanVideo (released by Tencent in December 2024) were frequent, with users describing Wan 2.1 as generally superior on prompt adherence and text rendering while being comparable on raw motion quality.

Chinese AI media emphasized the release as evidence of the competitiveness of Chinese AI research in the video generation space, noting that Wan 2.1 was among the first open-source video models to outperform a major Western commercial product (Sora) on a public benchmark.

Limitations

Documented limitations of Wan 2.1 include:

Video duration: The model generates clips of approximately 5 to 10 seconds. Generating longer sequences requires stitching multiple clips together. Quality and consistency degrade over longer generation runs.

Resolution ceiling: The 1.3B model is limited to 480P output and shows increased instability at this resolution compared to the 14B model. The 14B model tops out at 720P natively; achieving 1080P requires upscaling post-processing rather than native generation.

Complex motion coherence: The model can struggle with scenes involving many objects in rapid simultaneous motion, intricate choreography requiring precise timing, or cause-and-effect sequences where physical interactions chain together over multiple seconds.

Prompt sensitivity: Generation quality degrades with very long or highly detailed prompts. Users have reported noticeably lower quality with complex prompts compared to shorter, clearer instructions.

Hardware threshold: While the 1.3B model is consumer-friendly, the 14B model requires 24GB of VRAM for smooth generation without aggressive offloading. This limits high-quality output to users with professional-grade GPUs or multi-GPU setups.

Pixelation at lower resolutions: The 1.3B model can produce visible pixelation artifacts, particularly at non-native aspect ratios or when generating fast-moving subjects.

Several of these limitations were addressed in Wan 2.2, which introduced the MoE architecture and additional training data to improve motion coherence and overall visual quality.

References

Wan-Video/Wan2.1 GitHub repository. https://github.com/Wan-Video/Wan2.1
Wan-AI/Wan2.1-T2V-14B on Hugging Face. https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
"Alibaba Unveils its Latest Open-Source Video Generation Model." Alibaba Cloud Community Blog. https://www.alibabacloud.com/blog/alibaba-unveils-its-latest-open-source-video-generation-model_602167
"Alibaba Cloud Unveiled Wanx 2.1: Redefining AI-Driven Video Generation." Alibaba Cloud Community Blog. https://www.alibabacloud.com/blog/alibaba-cloud-unveiled-wanx-2-1-redefining-ai-driven-video-generation_601930
Wan arxiv paper: "Wan: Open and Advanced Large-Scale Video Generative Models." arXiv:2503.20314. https://arxiv.org/abs/2503.20314
"Alibaba Open-Sources its Video Generation Model, Wanxiang 2.1: 14B and 1.3B Versions Released." AI Base. https://www.aibase.com/news/15723
"Wan 2.1 Open Source: Alibaba's Game-Changing AI Video Model Takes on Sora, Minimax, Kling, and Google Veo 2." Anakin AI. https://anakin.ai/blog/wan-2-1-open-source-alibabas-game-changing-ai-video-model-takes-on-sora-minimax-kling-and-google-veo-2/
VBench Leaderboard. Hugging Face. https://huggingface.co/spaces/Vchitect/VBench_Leaderboard
"Alibaba Cloud Open Sources its AI Models for Video Generation." Alibaba Group. https://www.alibabagroup.com/en-US/document-1831486012178563072
VACE GitHub repository. https://github.com/ali-vilab/VACE
Wan2.2 GitHub repository. https://github.com/Wan-Video/Wan2.2
"Alibaba's Wan 2.2 Video Models Adopt a New Architecture." DeepLearning.AI The Batch. https://www.deeplearning.ai/the-batch/alibabas-wan-2-2-video-models-adopt-a-new-architecture-to-sort-noisy-from-less-noisy-inputs/
wan2.video official site. https://wan2.video/

Background: Tongyi Wanxiang

Release

Model variants

T2V-1.3B

T2V-14B

I2V-14B

FLF2V-14B

VACE models

Architecture

Wan-VAE

Diffusion transformer (DiT)

Training data

Bilingual text generation

License

Hardware requirements

Benchmarks

VBench

Wan-Bench

Comparison with other video generation models

Wan 2.2

Ecosystem and community integrations

Official integrations

Community projects

Acceleration tools

Use cases

Reception

Limitations

See also

References

Improve this article

Related Articles

Wan 2.1-VACE

Qwen3

Seedance

Wan 2.5

NVIDIA Picasso

OpenClaw

Background: Tongyi Wanxiang

Release

Model variants

T2V-1.3B

T2V-14B

I2V-14B

FLF2V-14B

VACE models

Architecture

Wan-VAE

Diffusion transformer (DiT)

Training data

Bilingual text generation

License

Hardware requirements

Benchmarks

VBench

Wan-Bench

Comparison with other video generation models

Wan 2.2

Ecosystem and community integrations

Official integrations

Community projects

Acceleration tools

Use cases

Reception

Limitations

See also

References

Related Articles

Wan 2.1-VACE

Qwen3

Seedance

Wan 2.5

NVIDIA Picasso

OpenClaw