Wan 2.1 (also written Wan2.1, from the Chinese Tongyi Wanxiang or 通义万象) is a family of open-weights video generation models developed by Alibaba's Tongyi Wanxiang team and released on February 25, 2025. The family includes 1.3-billion and 14-billion parameter variants covering text-to-video, image-to-video, first-and-last-frame video, and video creation and editing tasks. It is released under the Apache 2.0 license, making the weights freely available for commercial and research use.
Upon release, Wan 2.1 topped the VBench leaderboard with an overall score of 84.7%, outperforming both open-source competitors and closed commercial systems including Sora at that time. Within two days of release, the GitHub repository accumulated over 4,000 stars, and total downloads across Hugging Face and ModelScope exceeded 2.2 million. Its successor, Wan 2.2, was released in July 2025 with a Mixture-of-Experts architecture.
Tongyi Wanxiang (通义万象) is Alibaba Cloud's multimodal foundation model, first introduced in July 2023 as part of Alibaba's broader Tongyi family of AI models. The name translates roughly as "all things in one" and reflects the platform's goal of handling diverse modalities including text, images, and video within a single model lineage.
The Tongyi Wanxiang team sits within Alibaba Cloud's AI research division and has focused on scaling video generation from early text-to-image experiments toward full video synthesis. The project went through multiple internal and commercial releases before Wan 2.1 became the version that was fully open-sourced for the global AI community.
Alibaba Cloud is one of the largest cloud providers in Asia and a major player in the global AI model race. Its decision to release Wan 2.1 under a permissive open-source license placed it in direct competition with models like HunyuanVideo from Tencent and international offerings from OpenAI (Sora 2) and Google (Veo 3), while also contributing to the broader open-source AI ecosystem.
The initial open-source release of Wan 2.1 occurred on February 25, 2025, when Alibaba published inference code and model weights for four models simultaneously: T2V-14B, T2V-1.3B, I2V-14B-720P, and I2V-14B-480P. The code was published to the GitHub repository at Wan-Video/Wan2.1, and weights were hosted on both Hugging Face (under the Wan-AI organization) and ModelScope.
This was followed by a rapid sequence of integrations and expansions:
The open-source release was a deliberate strategic move. Alibaba Cloud positioned Wan 2.1 as a direct challenge to proprietary models, noting that the weights and code were available for free modification and distribution without usage-based pricing barriers that affect access to models like Sora and Veo 2.
The Wan 2.1 family consists of several distinct models targeting different hardware profiles and creative tasks.
The 1.3-billion parameter text-to-video model is the entry-level variant, designed to run on consumer-grade graphics cards. It requires 8.19 GB of VRAM, making it compatible with cards such as the RTX 4060 (with optimization) and RTX 4090. On an RTX 4090, a five-second 480P video generates in approximately four minutes without quantization or acceleration. The model operates at 480P resolution.
This model was widely praised for making capable AI video generation accessible outside of professional workstations. Earlier open-source video models either required 24GB or more of VRAM or produced noticeably lower quality output at this memory level.
The 14-billion parameter text-to-video model is the full-sized variant. It supports both 480P and 720P resolution output and requires high-memory GPUs. The recommended configuration is a single GPU with 24GB or more VRAM (such as an NVIDIA A100 or RTX 4090), though single-GPU inference at 720P typically requires model CPU offloading via the --offload_model True flag. Multi-GPU inference is supported through FSDP and xDiT with Ulysses or Ring parallelization strategies for up to eight GPUs.
The T2V-14B has 40 transformer layers, 40 attention heads, a model dimension of 5,120, and a feedforward dimension of 13,824.
The image-to-video models (I2V-14B-480P and I2V-14B-720P) take a single image as a conditioning input along with a text prompt and animate it into a video. The 720P variant targets higher-fidelity output. These models are useful for animating still illustrations, product photos, and portrait images, and they were among the four original models released on February 25, 2025.
Released on April 17, 2025, the First-Last-Frame-to-Video model takes two images as input: a starting frame and an ending frame. Rather than morphing between them, the model generates realistic intermediate motion informed by a text prompt describing the action or style. This enables creators to set a precise visual beginning and ending while letting the model determine how to bridge the gap with natural motion. Output is at 720P. Common use cases include cinematic scene transitions, character movement sequences, and animation workflows where keyframes are defined in advance.
The VACE (Video creation and Editing) models were released May 14, 2025 in both 1.3B and 14B sizes. VACE stands for an all-in-one framework for video creation and editing. These models accept multi-modal inputs including text, image, video, and structural control signals such as depth maps, optical flow, grayscale, line drawings, and pose estimation.
VACE supports a wide set of editing operations including replacing actors or objects in existing footage, animating still characters, expanding video frames spatially, repainting masked regions with new content, and generating video from human pose sequences. The framework was accepted to ICCV 2025, indicating peer-reviewed recognition of its technical contributions.
Wan 2.1 is built on the diffusion transformer (DiT) paradigm, using a Flow Matching framework for training and inference. Flow Matching is an alternative to standard score-matching diffusion that parameterizes a transport map between noise and data directly, often resulting in straighter sampling trajectories and faster convergence.
A key architectural component is the Wan-VAE, a 3D causal variational autoencoder designed specifically for video. Standard image VAEs compress 2D spatial data; the Wan-VAE operates causally along the temporal axis, meaning earlier frames can be processed without needing to see future frames. This design choice preserves temporal information and avoids artifacts that arise when applying frame-independent compression to video. The Wan-VAE can encode and decode videos of unlimited length at up to 1080P resolution.
The core generation network is a diffusion transformer with the following components:
The 1.3B model has 30 layers, 12 attention heads, and a model dimension of 1,536. The 14B model has 40 layers, 40 attention heads, and a dimension of 5,120.
The training data pipeline uses a four-step cleaning procedure applied to a large proprietary corpus of images and videos. The pipeline filters by fundamental quality dimensions (sharpness, exposure, noise), visual quality, and motion quality. Data is stratified across training stages to balance coverage of different content types. The 14B model was trained on billions of images and videos.
One of Wan 2.1's most distinctive capabilities is generating readable text directly within video frames in both English and Chinese. The team described this as a first for the video generation model category at the time of release. Prior video generation models typically struggled to produce coherent written text in output frames, producing blurry or garbled characters even for short words.
Wan 2.1 can render dynamic text effects, stylized typography, and standard body text in both scripts. This makes the model useful for generating videos with readable signage, product labels, UI mockups, title cards, and multilingual promotional content without requiring post-production compositing.
The bilingual capability reflects the model's use of a multilingual T5 encoder that processes Chinese and English at the language conditioning stage, along with training data that includes sufficient text-bearing images and videos in both scripts.
Wan 2.1 is released under the Apache 2.0 license. This is a permissive open-source license that allows free use, modification, and redistribution, including for commercial applications, provided that the license notice and attribution requirements are preserved. There are no usage-based restrictions on inference volume, no requirements to share derivative model weights, and no prohibitions on commercial deployment.
This licensing choice contrasts with several competing open-source video models that use more restrictive licenses, some of which prohibit commercial use or require specific attribution in outputs. The Apache 2.0 license aligned Wan 2.1 with models like Llama 3 (which uses a custom permissive license) and Meta's general open-weights approach, broadening its appeal to commercial developers and enterprise users.
The following table summarizes the hardware requirements and generation specifications for each major Wan 2.1 variant.
| Model | Parameters | VRAM required | Target GPU | Output resolution | Notes |
|---|---|---|---|---|---|
| T2V-1.3B | 1.3B | 8.19 GB | RTX 4060 (with optimization), RTX 4090 | 480P | 5-sec clip in ~4 min on RTX 4090 |
| T2V-14B | 14B | 24 GB+ | RTX 4090, A100 | 480P, 720P | CPU offload flag for single-GPU 720P |
| I2V-14B-480P | 14B | 24 GB+ | RTX 4090, A100 | 480P | Image conditioning |
| I2V-14B-720P | 14B | 24 GB+ | A100 or multi-GPU | 720P | Image conditioning |
| FLF2V-14B | 14B | 24 GB+ | A100 or multi-GPU | 720P | Two-image conditioning |
| VACE-1.3B | 1.3B | ~8 GB | RTX 4060+ | 480P | Video editing tasks |
| VACE-14B | 14B | 24 GB+ | RTX 4090, A100 | 480P, 720P | Multi-modal editing |
For multi-GPU inference, Wan 2.1 supports FSDP combined with xDiT's Ulysses and Ring sequence parallelism strategies, scaling efficiently across up to eight GPUs. The xDiT library (version 0.4.1 or higher) must be installed separately for this mode.
Memory optimization options include CPU offloading for model weights, T5 encoder offloading to CPU (--t5_cpu flag), and quantized (INT8/FP8) checkpoints distributed by the community. These techniques allow running the 14B model on hardware with less than 24 GB VRAM at the cost of increased generation latency.
The Wan team evaluated Wan 2.1 against other video generation models using the VBench benchmark suite and an internal benchmark called Wan-Bench.
VBench is a comprehensive benchmark for video generative models developed by Tsinghua University's Vchitect lab, covering 16 fine-grained dimensions of video quality and semantic alignment. Wan 2.1 achieved an overall VBench score of 84.7%, ranking first on the public VBench leaderboard at the time of its release.
| Model | VBench score | Access |
|---|---|---|
| Wan 2.1 (14B) | 84.7% | Open-source |
| Google Veo 2 | 83.0% | Closed, limited access |
| OpenAI Sora | 82.0% | Closed, subscription |
| Wan 2.1 (1.3B) | Lower than 14B | Open-source |
Wan 2.1 performed particularly well in VBench dimensions related to dynamic degree (range of motion), spatial relationships, and multi-object interaction. These categories are demanding because they require the model to maintain consistent physics and object identity across frames rather than relying on static or slow-moving content to achieve visual quality scores.
The Wan team also developed an internal evaluation framework they called Wan-Bench, which evaluates video quality across 14 major dimensions and 26 sub-dimensions. The benchmark includes 1,035 evaluation prompts covering diverse content categories. Final scores are computed using a weighted sum of per-dimension scores, with weights derived from human preference studies indicating which dimensions matter most to viewers.
Across Wan-Bench evaluations, the 14B model achieved the highest overall weighted score (0.724) among all models tested, including commercial systems. Evaluated dimensions include motion generation quality, temporal stability, physical plausibility, multi-object handling, text adherence, and aesthetic quality.
The table below compares Wan 2.1 against major competing video generation models that were available at or near its release date.
| Feature | Wan 2.1 (14B) | Sora 2 | Veo 3 | Kling 2.1 |
|---|---|---|---|---|
| Release | February 2025 | May 2025 | May 2025 | June 2025 |
| Open weights | Yes (Apache 2.0) | No | No | No |
| VBench score | 84.7% | ~82% | Not published | Not published |
| Max resolution | 720P native | Up to 1080P | Up to 4K | Up to 1080P |
| Max duration | ~10 seconds | Up to 20 seconds | Up to 60 seconds | Up to 15 seconds |
| Bilingual text | Yes (EN + ZH) | English only | English only | Yes (EN + ZH) |
| Image-to-video | Yes | Limited | Yes | Yes |
| Price | Free (self-hosted) | $20-$200/month | Usage-based | Subscription |
| Primary language | Chinese/English | English | English | Chinese/English |
| Hardware needed | 8.19 GB+ VRAM | Cloud only | Cloud only | Cloud only |
The key differentiator for Wan 2.1 is its open-weight, self-hostable nature. Sora, Veo 3, and Kling 2.1 are all closed commercial products requiring subscriptions or cloud API access. Wan 2.1 allows researchers, developers, and independent creators to run inference locally and fine-tune the weights on custom datasets without ongoing usage costs or API rate limits.
In terms of raw output quality, Wan 2.1 at 14B is competitive with these commercial systems in benchmark tests, particularly on motion quality and object consistency. It falls behind on maximum video length and maximum output resolution compared to newer closed models like Veo 3 and Kling 2.1, which added 4K output and longer durations in their respective updates after Wan 2.1's release.
Wan 2.2 was released on July 28, 2025 as the direct successor to Wan 2.1. Its most significant architectural change is the adoption of a Mixture-of-Experts (MoE) architecture, making it the first video diffusion model to apply MoE across the denoising timestep dimension.
The MoE design in Wan 2.2 uses two specialized expert models that handle different stages of the denoising process: a high-noise expert that focuses on overall layout and composition during the early diffusion steps, and a low-noise expert that refines details in later steps. Each expert has approximately 14B parameters, giving the full model 27B total parameters but only 14B active parameters at any given inference step. This design improves generation quality without proportionally increasing computational cost per step.
Additionally, Wan 2.2 was trained on substantially more data than Wan 2.1, with 65.6% more images and 83.2% more videos. The expanded training corpus included data with fine-grained aesthetic labels for lighting, color tone, composition, and contrast, enabling more controllable cinematic style generation.
Wan 2.2 retained the same Apache 2.0 licensing as Wan 2.1 and maintained backward compatibility with community tooling including ComfyUI and Diffusers integrations.
Wan 2.1 accumulated a large third-party ecosystem rapidly after release.
Within two weeks of the initial release, Wan 2.1 was integrated into:
DiffusionPipeline.from_pretrained()The GitHub repository lists several downstream research projects that built on Wan 2.1's weights:
Several inference acceleration libraries were adapted to support Wan 2.1:
Wan 2.1's combination of open weights, consumer-compatible 1.3B variant, and multi-task model suite supports a broad range of applications.
Content creation: The text-to-video and image-to-video models are used for short-form social media content, thumbnail animation, and branded video production. The bilingual text rendering capability is particularly useful for content targeting both Chinese and English-speaking audiences.
Film and advertising post-production: The VACE editing model supports replacing background elements, actor substitution, and adding visual effects to existing footage without requiring manual masking in traditional compositing software.
Animation production: FLF2V enables animators to work in a keyframe-centric workflow where start and end poses or scenes are defined as still images, and the model generates the intervening motion.
Research and development: The Apache 2.0 license and open weights make Wan 2.1 a common foundation model for academic video generation research, including studies on motion control, personalized generation, and efficient inference.
Fine-tuning and customization: Developers have fine-tuned Wan 2.1 on proprietary datasets for specific domains including product visualization, virtual try-on, and autonomous driving simulation. The DiffSynth-Studio library provides LoRA-based fine-tuning for the 14B models.
Educational content: The V2A (video-to-audio) capability allows educators to create instructional videos with matched audio effects, while the text rendering supports generating diagrams and annotated video sequences.
Wan 2.1 received substantial attention from the AI developer community following its release. The GitHub repository reached 4,000 stars within two days, and the models accumulated over 2.2 million downloads on Hugging Face and ModelScope combined in the months following release.
Coverage in AI publications noted the benchmark performance relative to commercial models, particularly Sora, as significant. Several independent evaluators tested Wan 2.1 against closed systems and found it competitive or superior on motion smoothness and temporal stability, while acknowledging that closed models maintained advantages in maximum resolution and video duration.
The open-source AI community received it warmly as one of the strongest open-weight video generation models available at the time. Comparisons to HunyuanVideo (released by Tencent in December 2024) were frequent, with users describing Wan 2.1 as generally superior on prompt adherence and text rendering while being comparable on raw motion quality.
Chinese AI media emphasized the release as evidence of the competitiveness of Chinese AI research in the video generation space, noting that Wan 2.1 was among the first open-source video models to outperform a major Western commercial product (Sora) on a public benchmark.
Documented limitations of Wan 2.1 include:
Video duration: The model generates clips of approximately 5 to 10 seconds. Generating longer sequences requires stitching multiple clips together. Quality and consistency degrade over longer generation runs.
Resolution ceiling: The 1.3B model is limited to 480P output and shows increased instability at this resolution compared to the 14B model. The 14B model tops out at 720P natively; achieving 1080P requires upscaling post-processing rather than native generation.
Complex motion coherence: The model can struggle with scenes involving many objects in rapid simultaneous motion, intricate choreography requiring precise timing, or cause-and-effect sequences where physical interactions chain together over multiple seconds.
Prompt sensitivity: Generation quality degrades with very long or highly detailed prompts. Users have reported noticeably lower quality with complex prompts compared to shorter, clearer instructions.
Hardware threshold: While the 1.3B model is consumer-friendly, the 14B model requires 24GB of VRAM for smooth generation without aggressive offloading. This limits high-quality output to users with professional-grade GPUs or multi-GPU setups.
Pixelation at lower resolutions: The 1.3B model can produce visible pixelation artifacts, particularly at non-native aspect ratios or when generating fast-moving subjects.
Several of these limitations were addressed in Wan 2.2, which introduced the MoE architecture and additional training data to improve motion coherence and overall visual quality.