Wan 2.5
Last reviewed
Jun 5, 2026
Sources
36 citations
Review status
Source-backed
Revision
v3 · 5,616 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 5, 2026
Sources
36 citations
Review status
Source-backed
Revision
v3 · 5,616 words
Add missing citations, update stale details, or suggest a clearer explanation.
Wan 2.5 is a multimodal generative video model developed by Alibaba Cloud's Tongyi Lab and previewed at the company's Apsara 2025 conference in Hangzhou in late September 2025.[1] The model belongs to the Wan family of visual foundation models, succeeding Wan 2.2 (July 2025) and the speech-driven variant Wan2.2-S2V (August 2025).[4] Wan 2.5 is notable for being the first model in the family to natively generate synchronized audio together with video, for extending single-clip duration from 5 seconds to 10 seconds, and for delivering native 1080p output at 24 frames per second.[9]
Unlike its predecessors, Wan 2.5 was not released with open weights at launch. Alibaba moved the model into a closed commercial-preview phase delivered through Alibaba Cloud's Bailian platform (also marketed in English as Model Studio) and the public-facing Tongyi Wanxiang website. A formal commercial launch event for Wan2.5-Preview was held on November 11, 2025,[2] the following major version Wan 2.6 followed on December 16, 2025, with multi-shot storytelling and reference-to-video features,[3] and Wan 2.7 arrived in April 2026 with a planning-style Thinking Mode and first-and-last-frame control.[13] Wan 2.5 sits in the same competitive bracket as OpenAI's Sora 2, Google's Veo 3, ByteDance's Seedance, and MiniMax's Hailuo AI, and is generally positioned by reviewers as a price-competitive challenger from China rather than a state-of-the-art leader on physics realism.[24]
Alibaba's video-generation work sits inside Tongyi Lab, the research division behind the Tongyi family of foundation models that also produces the Qwen large language models and the Tongyi Wanxiang image generators. The Wan series began with Wan 2.1 in February 2025, a diffusion-transformer model that Alibaba released under the Apache 2.0 license. Wan 2.2 followed in July 2025, again open-sourced, and added a mixture-of-experts variant alongside a 5-billion-parameter lightweight version that fit on a single consumer GPU.[26] The earlier Wan 2.1-VACE variant focused on video editing and creation tasks. Between Wan 2.2 and Wan 2.5, Alibaba also shipped two specialized open-weight variants of the Wan 2.2 base: Wan2.2-S2V for speech-driven avatar video in late August 2025,[4] and Wan2.2-Animate for character animation and replacement in mid-September 2025.[5]
By the time Wan 2.5 was announced, the broader Wan series had been downloaded more than 6.9 million times across Hugging Face and ModelScope, according to Alibaba Cloud, and had spawned more than 170,000 derivative models trained by community users.[1] That open footprint mattered for the way Wan 2.5 was received: when Alibaba kept Wan 2.5 weights closed, threads on the official GitHub repository (issue #184 on Wan-Video/Wan2.2, for example) filled up with users asking whether the model would ever ship with downloadable weights, and several community members complained that Alibaba had pivoted toward a more commercial posture for its frontier visual model.[23]
The move was part of a broader pattern. Alibaba had begun reserving its top-tier Qwen and Wan releases for paid API access while still open-sourcing the prior generation. For Wan 2.5, that meant Wan 2.1 and Wan 2.2 stayed downloadable from Hugging Face under permissive licenses, while Wan 2.5 itself was accessible only through Alibaba Cloud's Bailian gateway and through resellers that licensed the model. The same split has continued into Wan 2.6 and Wan 2.7, with Alibaba indicating that the planned Wan 3.0 release (60 billion parameters, 4K output, expected mid-2026) is intended to return the flagship line to open weights under Apache 2.0.[36]
Wan2.5-Preview was unveiled at the Apsara 2025 conference, Alibaba Cloud's annual flagship developer event held in Hangzhou.[1] Alibaba's own communications, the alizila.com newsroom, and conference coverage from Fintech News Hong Kong place the announcement around September 23 to 26, 2025.[7][8] The preview release came with four model variants: Wan2.5-T2V-Preview (text-to-video), Wan2.5-I2V-Preview (image-to-video), Wan2.5-T2I-Preview (text-to-image), and Wan2.5-I2I-Preview (image editing).[10] Together these four variants are pitched as a unified multimodal stack rather than four loosely linked products, since they share a backbone trained jointly across text, image, video, and audio.[9]
Following the September preview, Alibaba ran a formal commercial launch event for Wan2.5-Preview on November 11, 2025, advertising the model as ready for enterprise use in advertising, film pre-production, and short-form content.[2] Wider availability through additional platforms (Fal.ai, WaveSpeed AI, Higgsfield, Kie.ai, Pollo AI, and others) followed during October and November 2025.
Wan 2.6 launched on December 16, 2025, less than three months after Wan 2.5.[3] The 2.6 release introduced a reference-to-video model called Wan2.6-R2V, multi-shot storytelling from a single prompt, and clip lengths up to 15 seconds.[12] Because Wan 2.6 ships under the same API-only commercial model as Wan 2.5, Wan 2.5 has effectively functioned as an interim preview product rather than a long-lived flagship.
Wan 2.7 followed in early April 2026, with the text-to-video model going live on April 3 and the full four-variant suite (text-to-video, image-to-video, reference-to-video, instruction-based video editing) announced on April 6.[13] The 2.7 release introduced a Thinking Mode reasoning stage, first-and-last-frame interpolation, video-to-video instruction editing, and an expanded prompt window.[32] As with 2.5 and 2.6, the headline Wan 2.7 endpoints are sold through Alibaba Cloud and third-party hosts rather than published as open weights.[31]
| Release | Date | Format | License | Highlight |
|---|---|---|---|---|
| Wan 2.1 | February 2025 | Open weights | Apache 2.0 | Diffusion-transformer text-to-video and image-to-video |
| Wan 2.1-VACE | Spring 2025 | Open weights | Apache 2.0 | Video editing and reference-driven generation |
| Wan 2.1 updates | March to May 2025 | Open weights | Apache 2.0 | Iterative quality and resolution improvements |
| Wan 2.2 (A14B, TI2V-5B) | July 28, 2025 | Open weights | Apache 2.0 | First open-source MoE video model, 5B lightweight variant [26] |
| Wan2.2-S2V | August 26, 2025 | Open weights | Apache 2.0 | Speech-to-video digital human avatar generation [4] |
| Wan2.2-Animate | September 19, 2025 | Open weights | Apache 2.0 | Unified character animation and replacement [5] |
| Wan2.5-Preview | September 23 to 26, 2025 | Closed API | Commercial | Native audio, 10-second clips, 1080p, unified multimodal stack [1] |
| Wan2.5-Preview commercial launch | November 11, 2025 | Closed API | Commercial | Enterprise launch event in Hangzhou [2] |
| Wan 2.6 | December 16, 2025 | Closed API | Commercial | Reference-to-video, multi-shot stories, 15-second clips [3] |
| Wan 2.7 | April 3 to 6, 2026 | Closed API | Commercial | Thinking Mode, first-and-last-frame, video-to-video editing [13] |
| Wan 3.0 (pre-announced) | Expected mid-2026 | Open weights (pre-announced) | Apache 2.0 (pre-announced) | 60B parameters, 4K resolution, 30-second clips [36] |
The headline addition in Wan 2.5 is native audio-visual synchronization. Earlier Wan models could produce silent video, with audio added in a separate pass; Wan 2.5 generates voice, sound effects, and background music in lockstep with the picture.[19] The model can do lip-synced dialogue in Chinese and English, ambient audio that matches the on-screen environment, and music beds, all from a single text or image prompt. Alibaba's marketing for the model leans on the phrase "native multimodality" to describe this property.[9]
Other capabilities are upgrades rather than fresh additions. Single-clip duration moves to 10 seconds, double the 5-second ceiling on Wan 2.2 and consistent with Sora 2 and Veo 3 short-form output.[10] Native resolution is 1080p at 24 frames per second, with 4K output advertised as available in select preview endpoints.[9] Camera control instructions, multilingual text rendering, and complex structural content (charts, architectural diagrams, typographic compositions) are all listed by Alibaba Cloud and by third-party reviewers as improved over the prior generation.[17]
| Capability | Wan 2.5 specification |
|---|---|
| Text-to-video | Text-to-video at 1080p, up to 10 seconds, 24 fps |
| Image-to-video | Image-to-video from a single reference image, up to 10 seconds, 1080p |
| Native audio | Synchronized voice, sound effects, and music generated jointly with video |
| Lip-sync | Phoneme-level mouth movement for Chinese and English dialogue |
| Resolution | 480p, 720p, 1080p standard; 4K in select preview endpoints |
| Frame rate | 24 frames per second |
| Languages (prompt) | Chinese and English, with partial support for other languages |
| Text-to-image | Photorealistic and stylized outputs, with multilingual text rendering |
| Image editing | Instruction-based edits with multi-image reference and high consistency |
| Camera controls | Pan, zoom, focus pulls, and director-style cinematic instructions |
| Access model | Closed-weight API through Alibaba Cloud Bailian and Tongyi Wanxiang |
The audio side of Wan 2.5 has had the most attention from creators because it changes the production workflow. Before Wan 2.5, a typical pipeline involved generating silent video with one model, then routing the result through a text-to-speech engine and a sound-design pass. Wan 2.5 collapses those steps into one inference, which reduces latency and avoids the lip-sync mismatch that comes from gluing separately generated assets together. Reviewers from getimg.ai, Curious Refuge, and Pollo AI singled out audio as the most disruptive feature of the release.[19][18][20]
Before the Wan 2.5 multimodal flagship arrived, Alibaba published two specialized open-weight models that extended the Wan 2.2 backbone into audio-driven and character-driven workflows. Both shipped under Apache 2.0 with full code on GitHub and weights on Hugging Face and ModelScope. They remain the most-used open-weight alternatives for users who want self-hosted, locally controllable video generation.
Wan2.2-S2V is an open-source speech-to-video model released by Alibaba on August 26, 2025, which Alibaba positions as a digital-human and AI-avatar tool.[4] The model takes a single still portrait and an audio clip (spoken dialogue, singing, or any other vocal recording) and produces a film-quality animated video in which the character in the photo lip-syncs and gestures in time with the audio.[28] It supports portrait, bust, and full-body framings, vertical and horizontal aspect ratios, and multi-character scenes where more than one figure in the input image is animated against a coherent backdrop.[4]
Technically, Wan2.2-S2V combines text-guided global motion control with audio-driven fine-grained local movements. Alibaba's research team built a dedicated large-scale audio-visual dataset for film and television production scenarios and trained the model with multi-resolution sampling so that the same checkpoint can output everything from short-form vertical clips for TikTok and Douyin to 16:9 broadcast format.[4] A central engineering trick in Wan2.2-S2V is compression of arbitrarily long histories of past frames into a compact latent representation, which is what allows the model to remain stable across the kind of long sustained takes that talking-head video typically requires.[28]
At launch, Wan2.2-S2V weights and inference code were available through the official Wan-Video GitHub organization, Hugging Face, and Alibaba Cloud's ModelScope. The release was framed by Alibaba as a continuation of the open-source posture established by Wan 2.1 and Wan 2.2, and it has been adopted as the de facto open implementation of audio-driven video for users who do not want to depend on closed APIs such as HeyGen or D-ID for digital-avatar workflows.
Wan2.2-Animate is a unified character-animation and character-replacement model released on September 19, 2025, just a few days before the Wan 2.5 announcement at Apsara.[5] The model is built on a 14-billion-parameter backbone derived from Wan 2.2 and supports two operating modes from the same checkpoint.[27] In animation mode, Wan2.2-Animate takes a character image and a reference video and animates the character to match the expressions and movements in the reference at 720p and 24 frames per second.[6] In replacement mode, the model takes an existing video and swaps the on-screen character for an animated version of the supplied reference image, while preserving the reference video's lighting, color grading, and camera moves so that the replacement does not look pasted in.[27]
Alibaba published the Wan2.2-Animate weights and inference code under Apache 2.0, similarly available on GitHub, Hugging Face, and ModelScope. The release post on the official Wan blog framed the model as a one-stop solution for high-fidelity character animation;[5] reviewers on 302.AI's deep dive and Atlas Cloud's Wan-2.2-Animate-Mix benchmark both highlighted the precision of facial-expression transfer and the smooth handoff between scenes when the model replaces moving characters. Wan2.2-Animate has become a popular open-weight comparison point for closed alternatives such as Runway Act-One and the character-animation features inside Sora 2 and Hailuo, partly because it can be fine-tuned with LoRA adapters on top of the public weights.
A brief summary of how the two specialized 2.2-series variants compare:
| Variant | Released | Input | Output | Best use case |
|---|---|---|---|---|
| Wan2.2-S2V | August 26, 2025 | Portrait image plus audio clip | Talking, singing, or performing avatar video | Digital humans, dubbed avatars, vertical short-form |
| Wan2.2-Animate (animation mode) | September 19, 2025 | Character image plus reference video | 720p 24 fps animated character | Animation of static characters using a motion reference |
| Wan2.2-Animate (replacement mode) | September 19, 2025 | Reference video plus character image | Reference video with original character swapped out | Face and body replacement with preserved lighting and scene |
Both variants pre-date Wan 2.5 by a few weeks and are still actively maintained on the Wan-Video GitHub organization. They are the practical reason that, even after Alibaba pivoted to a closed-API distribution model for the 2.5 through 2.7 flagships, the open-weight Wan ecosystem has continued to grow rather than stall.
Wan 2.5, Wan 2.6, and Wan 2.7 share an architecture family and a commercial-API distribution model, but each release targets different production needs. Wan 2.5 introduced native audio-visual generation and 10-second clips for the first time.[9] Wan 2.6 holds onto those features and adds multi-shot narrative generation, longer 15-second outputs, and a reference-to-video pipeline (Wan2.6-R2V) that uses a user-supplied reference video to preserve a specific character's face and voice across newly generated scenes.[3] Wan 2.7 keeps the 15-second clip ceiling and the audio system, then layers on a Thinking Mode reasoning pass, first-and-last-frame interpolation that lets a user pin both endpoints of a clip, an instruction-based video editing endpoint, and an expanded prompt window for longer, more structured descriptions.[32]
| Dimension | Wan 2.5-Preview | Wan 2.6 | Wan 2.7 |
|---|---|---|---|
| Announcement date | September 23 to 26, 2025 (Apsara 2025) | December 16, 2025 | April 3 to 6, 2026 |
| Maximum clip length | 10 seconds | Up to 15 seconds | Up to 15 seconds |
| Audio generation | Native synchronized audio | Enhanced audio with improved lip-sync | Native synchronized audio with reference voice cloning |
| Reference-to-video | Not supported | Supported via Wan2.6-R2V | Supported with multi-reference consistency |
| First-and-last-frame | Not supported | Not supported | Supported (pin both endpoints) |
| Thinking Mode | Not supported | Not supported | Supported (planning stage before generation) |
| Instruction editing | Image editing only (I2I) | Image editing only (I2I) | Full instruction-based video editing |
| Multi-shot storytelling | Single-shot output | Multi-shot scenes from a single prompt | Multi-shot scenes with sharper continuity |
| Image and text models | Wan2.5-T2I-Preview and Wan2.5-I2I-Preview | Comprehensive upgrades to all four existing models | Refresh of text-to-image and image-editing variants |
| Distribution | Closed API on Alibaba Cloud Bailian | Closed API on Alibaba Cloud Model Studio and Wan website | Closed API on Alibaba Cloud, WaveSpeed, Fal.ai, OpenRouter, Together AI |
| Positioning | First multimodal flagship in the Wan family | Commercial follow-up emphasising consistency over multiple shots | First Wan release with an explicit reasoning stage |
For practical use, Wan 2.6 fixes most of the rough edges that creators reported in Wan 2.5. The video-to-video editing platform Videoweb.ai compared the two models directly and noted that Wan 2.6 reduced face drift in image-to-video pipelines, held visual coherence across the full clip rather than degrading after 5 to 7 seconds, and handled multi-step prompt instructions more faithfully.[14] The trade-off is that Wan 2.6 demands more from users on the prompt side; it expects explicit scene structuring to take advantage of the multi-shot system.
Wan 2.7 shifts the burden in the opposite direction. With Thinking Mode active, the model spends a small amount of additional inference time building an internal compositional plan, deciding how prompt elements should relate in space and time and what the narrative logic of the sequence should be before any pixels are generated.[32] Reviewers at Tellers, InVideo, and Cliprise have described this as the most significant architectural change in the 2.x line, on the grounds that it changes Wan from a single-pass diffusion model into a model that does something closer to chain-of-thought planning.[31][32][30] First-and-last-frame control is the second major Wan 2.7 addition: users supply two reference frames and let the model interpolate the action between them, which is a workflow that ByteDance's Seedance and Kling have offered for several releases.[29] Video-to-video instruction editing closes the gap with Runway Gen-3's editing surface and brings the Wan family into rough functional parity with the western flagships.
Wan 2.5's licensing has been a sore point with the Wan community. Wan 2.1 (February 2025) and Wan 2.2 (July 2025) both shipped under the Apache 2.0 license with full code and weights on GitHub and Hugging Face under the Wan-Video and Wan-AI organizations.[22] Wan 2.5 broke that pattern. As of late 2025 and into 2026, the official Wan 2.5 weights have not been published to Hugging Face, the Wan-Video GitHub organization, or ModelScope, and Alibaba has not publicly committed to an open-source release for the 2.5 generation.[23] Wan 2.6 and Wan 2.7 have continued that closed-API pattern: at launch, neither model shipped with downloadable weights, and Alibaba has framed both as commercial offerings sold through DashScope and partner platforms.[31]
A few third-party repositories labelled as "Wan 2.5" (including community uploads of FP16 inference shards) appeared on Hugging Face during late 2025, but those are not endorsed Alibaba releases. Reviewers and community blogs (including Flowith and AI Central's open-source coverage) describe Wan 2.5 as Alibaba's first true "commercial-only" entry in the Wan family.[36] Wan 2.6 and Wan 2.7 have continued the same pattern, with no Alibaba-published open weights and access available only through paid API surfaces.[35]
For users who want open weights from the Wan family, Wan 2.2 remains the most recent fully open flagship release. That includes the Wan2.2-T2V-A14B mixture-of-experts variant, the Wan2.2-I2V-A14B image-to-video model, and the lightweight Wan2.2-TI2V-5B model that can run on a single 24 GB GPU.[26] The two specialized 2.2-derived releases (Wan2.2-S2V for speech-to-video and Wan2.2-Animate for character animation and replacement) are also fully open under Apache 2.0.[4][5] Wan 2.5, Wan 2.6, and Wan 2.7 all sit behind paid APIs with no Alibaba-blessed offline option for self-hosted inference.
The community's expectation, based on Alibaba's communications around the Apsara 2026 keynote and the Wan 3.0 pre-announcement, is that the flagship line will revert to Apache 2.0 with the Wan 3.0 release expected in mid-2026. Wan 3.0 has been described in pre-launch material as a 60-billion-parameter model with 4K output and 30-second clip durations, and Alibaba has indicated that it intends to ship weights publicly.[36] Whether that pre-announcement holds remains a point of debate inside the community, but it is the strongest signal so far that the Wan 2.5 through 2.7 commercial-only stretch is not a permanent reorientation.
Alibaba has shared less about Wan 2.5's internals than it did for Wan 2.1 and Wan 2.2, both of which came with detailed technical reports. From the disclosed material, Wan 2.5 is built on a unified multimodal backbone that processes text, image, video, and audio inside the same network rather than chaining separate models. Alibaba's launch communications describe this design as "native multimodal architecture, deep alignment,"[9] and the model card discussions on partner platforms (WaveSpeed, Fal.ai, Higgsfield) emphasize a joint diffusion-transformer trunk that emits aligned audio and video tokens at inference time.[17]
A second disclosed component is reinforcement learning from human feedback (RLHF) applied on top of paired multimodal datasets, which Alibaba credits with the model's improved instruction adherence and reduced visual artifacts. Beyond those points, parameter counts, training compute, and dataset composition for Wan 2.5 have not been published in the level of detail that Alibaba provided for prior open-weight Wan releases. The technical report that historically accompanied a Wan release has not been issued for Wan 2.5 at the time of writing.
The Wan 2.6 and Wan 2.7 model cards on partner platforms describe both as continuations of the same unified multimodal backbone, with Wan 2.6 adding a longer temporal context window for multi-shot scene generation and Wan 2.7 adding an explicit Thinking Mode planning pass implemented as a separate reasoning head before the diffusion-transformer decoder.[13] None of the three closed releases has a published parameter count, but third-party teardowns and Alibaba's own framing suggest that Wan 2.5 through 2.7 sit between the open Wan 2.2 A14B (~27 billion total parameters, ~14 billion active per step) and the pre-announced Wan 3.0 (60 billion parameters).[36]
Wan 2.5 is sold through Alibaba Cloud's Bailian platform (Model Studio in English) and the Tongyi Wanxiang consumer website. Bailian exposes the four preview endpoints, including wan2.5-t2v-preview, wan2.5-i2v-preview, wan2.5-t2i-preview, and wan2.5-i2i-preview, plus their commercial-launch counterparts.[10] The API uses Alibaba Cloud's DashScope SDK conventions, with API key authentication and async polling for video jobs. Enterprise customers also have access to a managed deployment path via Alibaba Cloud accounts. Wan 2.6 and Wan 2.7 endpoints follow the same conventions, with names such as wan2.6-t2v, wan2.6-r2v, wan2.7-t2v, and wan2.7-vace surfaced on DashScope alongside the 2.5 endpoints.[13]
Pricing varies by reseller, since Wan 2.5 is distributed through multiple partner platforms. The most widely cited official rate is from Alibaba Cloud DashScope. Other figures below are pulled from pricing pages of partner platforms that license Wan 2.5 directly from Alibaba.[16]
| Platform | Wan 2.5 pricing (USD) | Notes |
|---|---|---|
| Alibaba Cloud DashScope | About $0.105 per second of video [16] | Pay-for-success billing; failed generations not charged |
| Kie.ai pay-as-you-go | $0.06 per second at 720p, $0.10 per second at 1080p [16] | Credits convert to per-second pricing |
| Higgsfield AI | 12 credits per 720p 5-second clip; 20 credits per 1080p 5-second clip [17] | Bundled into $9 or $17.40 monthly plans |
| Wan-ai.co | $9.90 for 100 non-expiring credits, $99 for 1,250 credits | Resold consumer-grade access |
| Tongyi Wanxiang web platform | $1.50 for 30 credits up to $100 for 3,900 credits | Pro $5 per month (300 credits), Premium $20 per month (1,200 credits) |
| OpenRouter (Wan 2.7) | About $0.10 per second of 1080p video [34] | Wan 2.7 image-to-video and text-to-video |
| WaveSpeed AI (Wan 2.7) | About $0.10 to $0.15 per second [13] | Wan 2.7 full-suite reseller |
Reviewers in the Wan 2.5 cost guides consistently call the model the "budget king" of high-end video generation, citing pricing that runs about half of Sora 2 and Veo 3 for comparable resolution and duration.[15] As of late 2025, the Akool cost guide put Wan 2.5 at roughly $0.25 to $1.50 per generation versus $2 to $5 for Sora 2 or Veo 3 on the same workloads.[15] Whether that gap holds depends on which reseller a customer goes through; the official Alibaba Cloud rate is closer to parity with Western providers, while resold consumer platforms tend to be cheaper.
No VolcEngine-hosted endpoint for Wan 2.5 has been confirmed by Alibaba. ByteDance's VolcEngine cloud carries ByteDance's own Seedance video model rather than Wan, so claims of VolcEngine pricing for Wan 2.5 should be treated with caution.
Wan 2.5 competes against the western video flagships (Sora 2 and Veo 3) and against other Chinese providers (Seedance from ByteDance, Hailuo AI from MiniMax, and Kling from Kuaishou). Reviewers at goenhance.ai, Curious Refuge, and Management Works Media tend to score these on three axes: physics and world simulation, cinematic camera control, and audio.[25][18][24]
| Model | Provider | Native audio | Max clip length | Resolution | Open weights |
|---|---|---|---|---|---|
| Wan 2.5 | Alibaba | Yes, synchronized | 10 seconds | 1080p, 4K preview | No |
| Wan 2.6 | Alibaba | Yes, improved lip-sync | 15 seconds | 1080p | No |
| Wan 2.7 | Alibaba | Yes, with reference voice cloning | 15 seconds | 1080p | No |
| Sora 2 | OpenAI | Yes | 12 seconds (Sora 2 Pro) | 1080p | No |
| Veo 3 | Google DeepMind | Yes | 8 seconds | 1080p | No |
| Seedance | ByteDance | Limited | 10 seconds | 1080p | Partial (Seedance 1.0 weights) |
| Hailuo AI | MiniMax | Yes (separate pass) | 10 seconds | 1080p | No |
On physics, Sora 2 generally leads. The model is positioned by OpenAI as a "world model" and reviewers consistently flag its handling of gravity, fluid dynamics, and object collisions as ahead of the field. Wan 2.5 is competitive on stationary or stylized scenes but tends to lose coherence in crowd shots or in scenes with multiple moving subjects interacting; the Curious Refuge and Apatero reviews both flag crowd scenes and object collisions as Wan 2.5's weak points.[18][21] Wan 2.7's Thinking Mode is the company's first explicit attempt to close that physics and continuity gap by adding a planning step that should, in theory, reduce the number of physically implausible compositions the diffusion stage has to clean up.[32]
On cinematic control, Veo 3 is widely considered the strongest, partly because Google's training corpus is heavy in film and broadcast material. Wan 2.5 includes director-style camera instructions and benefits from carrying over Wan 2.2's strong pan and tracking-shot support, but reviewers describe it as somewhere behind Veo 3 on lighting nuance and lens emulation.[25] Wan 2.7 picks up first-and-last-frame control, which is a direct response to the same feature in Seedance and Kling.[29]
On audio, Wan 2.5 and Veo 3 are roughly comparable. Both can do native dialogue, sound effects, and music in one inference pass. Sora 2 also does native audio. Hailuo and Seedance handle audio less elegantly, with separate model passes for sound. For multilingual dialogue, Wan 2.5 has an advantage in Chinese, which is the primary training language and the one Alibaba optimizes for.
Pricing is where Wan 2.5 most clearly separates from the competition. At roughly $0.10 per second of 1080p video on Alibaba Cloud DashScope, Wan 2.5 undercuts Sora 2 by about half, which has made it the default choice for high-volume social-media producers and small studios that need lots of iterations rather than a few hero shots.[15]
Reaction to Wan 2.5 has been generally positive on capability and pricing, mixed on stability, and frustrated on licensing. The getimg.ai review described Wan 2.5 as "not just another incremental update" and credited the native audio system with a genuinely disruptive workflow change.[19] Curious Refuge's review called it "one of the most capable open or semi-open video generators available," though that framing leans heavily on the assumption that Wan 2.5 would eventually be open-sourced like Wan 2.2, which has not happened.[18]
Multiple reviewers flagged the same set of weaknesses. Wan 2.5's physics handling for water, hands, and small object collisions is rated as "slightly off" by Curious Refuge.[18] Pollo AI's hands-on review noted that complex multi-character scenes can lose visual coherence past about 7 seconds, which is one of the issues that Wan 2.6 explicitly targeted.[20] The Apatero review of Wan 2.6 framed the 2.6 release as a polish pass on 2.5 rather than a leap forward, which implicitly says that Wan 2.5's failure modes were specific and addressable rather than fundamental.[21]
Reception of Wan 2.7 has been more mixed. Cliprise framed the 2.7 suite as "the most complete open-source video production stack in existence,"[30] but several reviewers including Tellers and Apatero pushed back on the open-source framing and noted that the headline Wan 2.7 endpoints are closed-weight.[31] InVideo's hands-on tests of Thinking Mode reported visibly improved scene composition for prompts that contain multiple subjects and temporal sequencing, but also a longer wait time per generation.[32] MindStudio's comparison of Wan 2.7 against Seedance and Kling concluded that Wan 2.7 had closed most of the feature gap with the leading Chinese competitors while remaining the cheapest of the three on a per-second basis.[29]
The open-source dispute has been the most visible point of community pushback. On the Wan-Video GitHub repository, issue #184 ("Is WAN 2.5 going to be open source?") and issue #291 ("Wan 2.5 weights? Will be open-sourced?") have collected dozens of comments from users questioning Alibaba's commercial pivot.[23] Issue #181 on the same repository, titled "Thank you Alibaba for deceiving and using the open source community," captured the most aggressive end of that reaction. The community has continued to download Wan 2.2 weights in large numbers (Wan 2.2 surpassed 1 million Hugging Face downloads within six days of its release, according to Alibaba's own retrospectives) while waiting to see whether the 2.5, 2.6, or 2.7 weights will eventually become available. The Wan 3.0 pre-announcement, with its explicit return to Apache 2.0, has cooled some of the worst friction but not removed it.[36]
Despite the licensing friction, Wan 2.5 has been adopted by partner platforms more quickly than any prior Wan model, in part because its commercial-API distribution model is friendlier to resellers than Apache 2.0 open weights. As of early 2026, Wan 2.5 is available through Fal.ai, WaveSpeed AI, Higgsfield, Kie.ai, Pollo AI, Atlas Cloud, Imagine.art, VideoMaker.me, Vadoo, FluxPro, and Monet Vision, in addition to the official Alibaba surfaces. Wan 2.7 has expanded that footprint further, with OpenRouter, Together AI, Eachlabs, and Somake also listing 2.7 endpoints.[13] That breadth of distribution is one reason Wan 2.5 became the default Chinese competitor cited in Western coverage of the Sora 2 release.[24]