Seedance 2.0
Last reviewed
Jun 2, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,721 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,721 words
Add missing citations, update stale details, or suggest a clearer explanation.
Seedance 2.0 is a multimodal video generation model developed by ByteDance, released in February 2026 as the second major version of the company's Seedance line. Its defining feature is native audio-video joint generation: rather than producing silent footage and adding sound afterward, the model outputs synchronized visuals and audio together, including lip-synced speech, ambient sound, and music. It accepts four input modalities (text, image, audio, and video) and is exposed to consumers through ByteDance's Dreamina and Jimeng creative apps and the CapCut editor, with an API priced through the company's Volcengine cloud platform.[1][2][3]
Seedance 2.0 generates short video clips of 4 to 15 seconds at native resolutions of 480p and 720p (measured on the shorter edge), across six aspect ratios. A single generation can span multiple camera shots with cuts inside one clip, and the accompanying audio track is produced in the same pass as the picture rather than dubbed on afterward.[1][3] ByteDance positions the model for narrative and short-form use cases such as short dramas, advertising, and social video, and the company describes its target output as "cinematic," with controls over performance, lighting, shadow, and camera movement.[2]
The release sits within an intense period of competition among Chinese technology firms building AI video tools. When the model surfaced in China, shares of several Chinese media and content companies rallied, with Huace Media rising roughly 7 percent and Perfect World about 10 percent, as investors weighed the implications of cheaper AI-assisted production for film and television.[4]
The model was built by ByteDance Seed, the company's foundation-model research group, which also produces the Seedream family of image models. Seedance 2.0 follows earlier Seedance releases, including Seedance 1.0 and the intermediate Seedance 1.5 Pro, and a technical report frames 2.0 as a substantial step up across the main dimensions of video and audio quality rather than an incremental update.[3]
The Seedance branding is shared across ByteDance's product surfaces. On the company's Chinese consumer app, Jimeng, and the international app, Dreamina (both tied to the CapCut and Jianying editors), the model powers AI video features; on the developer side it is referenced as Doubao-Seedance-2.0, aligning it with ByteDance's Doubao consumer assistant and model brand.[5][6]
ByteDance Seed announced Seedance 2.0 on February 12, 2026, describing it on the company's research site as adopting "a unified multimodal audio-video joint generation architecture that supports text, image, audio, and video inputs."[1][2] The model had already begun circulating in China over the preceding weekend, drawing attention on social platforms before the formal announcement.[4]
The launch was not entirely smooth. On February 10, 2026, ByteDance suspended a feature that could synthesize a person's voice from a facial photograph alone. The capability drew scrutiny after a technology commentator uploaded his own photo and reported that the model produced audio nearly identical to his real voice without any voice sample, raising concerns about deepfakes, fraud, and impersonation. In response, ByteDance barred the use of realistic human photos or videos as reference subjects and added a live-verification step, requiring users to record their own image and voice before creating a digital avatar.[7]
Distribution expanded over the following weeks. Through CapCut's paid tier, Dreamina Seedance 2.0 reached users across Southeast Asia, Latin America, Africa, the Middle East, parts of Europe, Japan, and the United States, while a developer-facing API was published with pricing on Volcengine.[5][6]
Seedance 2.0's central capability is generating video and its soundtrack together in one model pass. The audio includes dialogue with lip synchronization in multiple languages, along with ambient effects and music, and is produced as a dual-channel track.[1][3] Because picture and sound are generated jointly, the model can keep speech, on-screen action, and scene cuts aligned within a clip.[6]
The model takes mixed-modality input. A single request can combine natural-language instructions with reference media: ByteDance's open platform allows up to 9 images, 3 video clips, and 3 audio clips per generation.[1][3] These references support what the company calls director-level control, letting users steer character performance, lighting, camera motion, and other elements using example media rather than text alone.[2]
Beyond generation from scratch, Seedance 2.0 supports video extension and editing. ByteDance describes stable and controllable extension of existing footage and targeted modification of specified clips, characters, actions, and storylines, positioning editing as a first-class function alongside fresh generation.[1][3]
The following table summarizes the model's disclosed specifications.
| Attribute | Specification |
|---|---|
| Output type | Joint audio plus video, single pass |
| Duration | 4 to 15 seconds |
| Native resolution | 480p and 720p (shorter edge) |
| Aspect ratios | Six, including landscape and portrait |
| Audio | Dual-channel; dialogue, ambient sound, music |
| Input modalities | Text, image, audio, video |
| Reference limits | Up to 9 images, 3 video clips, 3 audio clips |
| Multi-shot | Multiple shots with cuts within one clip |
| Editing | Video extension and targeted clip and character edits |
ByteDance has released limited architectural detail. The company and an associated technical report describe a "unified, highly efficient, and large-scale architecture" for multimodal audio-video joint generation supporting the four input modalities, and note a lower-latency "Fast" variant intended for scenarios that prioritize speed.[1][3] A research paper titled "Seedance 2.0: Advancing Video Generation for World Complexity," credited to ByteDance Seed and posted in April 2026, accompanies the release; observers have characterized it as closer to a product and benchmark showcase than a detailed account of training data, infrastructure, or model internals.[3]
ByteDance evaluates the model with its own benchmark suite, SeedVideoBench-2.0, which it reports covers multiple task types and quality dimensions and on which it places Seedance 2.0 in a leading position.[2] Because that benchmark is the developer's own, the independent leaderboard results discussed below are a more useful external reference.
Seedance 2.0 is offered to consumers through ByteDance's creative apps and to developers through its cloud. Consumer access runs through Jimeng in China and Dreamina internationally, both connected to the CapCut and Jianying editors, generally on paid tiers. For developers, Volcengine published per-token pricing for the Doubao-Seedance-2.0 model, billed by token consumption rather than by clip; ByteDance reported that generating a 15-second clip consumes roughly 308,880 tokens, which works out to about 1 yuan, or roughly 0.14 US dollars, per second of pure generation. At the time the pricing was disclosed in early March 2026, the API was described as in limited or internal release rather than openly available to all third-party developers.[6][8]
| Channel | Platform | Audience | Access and pricing notes |
|---|---|---|---|
| Consumer app (China) | Jimeng (with Jianying / CapCut) | General users in China | Paid creative tiers |
| Consumer app (international) | Dreamina (with CapCut) | Users across SE Asia, Latin America, Africa, Middle East, parts of Europe, Japan, US | Rolled out on CapCut paid tier |
| Developer API | Volcengine (Doubao-Seedance-2.0) | Developers in China | 28 yuan / million tokens with video input (editing); 46 yuan / million tokens for pure generation; about 1 yuan (~$0.14) per second; limited release at launch |
Pricing reported by third-party resellers differs from the first-party figures above and is not used here. The yuan-to-dollar conversions are approximate and reflect the rates cited at the time of the announcement.[8]
Seedance 2.0 was received as a strong entrant in AI video. On the Artificial Analysis Video Arena, which ranks models by Elo from blind human preference votes, the entry listed as "Dreamina Seedance 2.0 720p" placed at or very near the top of the categories restricted to models that produce audio output. Early after launch it was reported as narrowly leading both the text-to-video and image-to-video arenas; on later snapshots of the audio-output text-to-video category it sat around the second position with an Elo near 1,213, in a close race with other newly added systems. The arena is continuously re-rated as votes accumulate and new models are added, so its standings shift over time.[9][10]
Press coverage emphasized the realism of the output. The South China Morning Post quoted an early tester who said the reality enhancements made it "very hard to tell whether a video is generated by AI" and praised the storytelling and visual quality.[4] An industry newsletter described the model as state of the art and noted that its arrival coincided with OpenAI's decision, announced in March 2026, to wind down its Sora app and API, framing ByteDance's expansion as a contrast to that retreat.[5]
Several constraints follow from the model's design and rollout. Output is capped at 15 seconds per clip, and native resolution tops out at 720p on the shorter edge, lower than some competing systems advertise.[3] Open-platform inputs are bounded to 9 images, 3 video clips, and 3 audio clips per request.[1]
The voice-cloning incident illustrates a broader risk surface for a model that synthesizes realistic faces and voices: ByteDance withdrew the photo-to-voice feature and added identity verification after it was shown to reproduce a real person's voice without consent.[7] Access is also uneven. Consumer availability has expanded across many regions through CapCut's paid tiers, but the developer API was characterized as a limited or internal release at the time its pricing was published, rather than a generally open service.[6][8] Finally, ByteDance has disclosed little about the model's training data, scale, or internal architecture, so independent assessment rests largely on output quality and third-party preference rankings rather than published technical detail.[3]