Doubao Seedance
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,061 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,061 words
Add missing citations, update stale details, or suggest a clearer explanation.
Doubao-Seedance is a family of video generation foundation models developed by ByteDance Seed, the AI research division of Chinese technology conglomerate ByteDance. The flagship model, Seedance 1.0 Pro, was announced by ByteDance's cloud subsidiary Volcano Engine on June 11, 2025 at the company's FORCE Original Power Conference in Beijing, alongside the Doubao 1.6 large language model.[^1] Seedance supports both text-to-video and image-to-video generation at up to 1080p resolution, with native support for coherent multi-shot narratives, and the accompanying technical report claims a roughly 10x inference speedup over comparable systems.[^2] At launch, ByteDance positioned the model as a direct competitor to Google's Veo 3, OpenAI's Sora, Kuaishou's Kling and Runway's video models, citing first-place rankings on the Artificial Analysis video arena leaderboards.[^3] The model is exposed commercially through Volcano Engine's "Ark" API platform and is integrated into ByteDance's consumer-facing Doubao chatbot and the Jimeng (Dreamina) creative tool.[^1]
| Field | Value |
|---|---|
| Developer | ByteDance Seed (Seed Vision Team) |
| Type | Text-to-video and image-to-video foundation model |
| First public release | June 11, 2025 (Seedance 1.0 Pro)[^1] |
| Lite variant release | May 13, 2025 (Seedance 1.0 Lite)[^4] |
| Technical report | arXiv 2506.09113, June 10, 2025[^2] |
| Maximum resolution | 1080p (Pro), 720p (Lite)[^5] |
| Maximum clip duration | 10 seconds[^5] |
| Distribution | Volcano Engine API, Doubao app, Jimeng/Dreamina[^1] |
| Pricing (launch) | 3.67 yuan per 5-second 1080p clip via Volcano Engine[^1] |
ByteDance's Seed research group, established in 2023 as the company's foundation-model arm, organizes its generative-media work under product-family names beginning with "Seed". Within that umbrella the Seedream series covers text-to-image and image editing, while Seedance is the parallel line for video.[^6] The first publicly named variant was Seedance 1.0 Lite, a "small-parameter" model that Volcano Engine unveiled on May 13, 2025 at the Shanghai stop of its FORCE LINK AI Innovation Tour. The Lite model supported 5-second and 10-second clips at 480p or 720p, was made available via the Volcano Ark enterprise platform, and was offered to consumers through the Doubao app and the Jimeng (Dreamina) creative platform.[^4]
The headline release came roughly four weeks later. On June 11, 2025 at the FORCE Original Power Conference in Beijing, Volcano Engine president Tan Dai introduced Seedance 1.0 Pro together with Doubao 1.6 and Doubao 1.6-vision. The Pro variant raised the maximum resolution to 1080p and emphasized multi-shot narrative generation, character consistency across shots, and cinematic camera control.[^1] On the same day, a 44-author technical report titled "Seedance 1.0: Exploring the Boundaries of Video Generation Models" appeared on arXiv (preprint 2506.09113), authored by ByteDance's Seed Vision Team and led by Yu Gao and Haoyuan Guo.[^2]
The June 2025 launch came during a period of rapid concentration in the Chinese video-generation market. Kuaishou's Kling, MiniMax's Hailuo, Alibaba's Wan, and Tencent's Hunyuan had all shipped competitive systems earlier in 2025, while Google had released Veo 3 in May 2025 with synchronized audio generation. Press coverage emphasized that Seedance was explicitly framed as a benchmark-topping response: ByteDance announced that the model ranked first on both the text-to-video and image-to-video boards of Artificial Analysis's video arena on June 10, 2025, ahead of Veo 3, Kling 2.1, and OpenAI's Sora.[^2][^3]
Seedance development continued after the 1.0 launch. Volcano Engine announced Seedance 1.0 Pro updates including first-and-last-frame conditioning, and at the December 18, 2025 FORCE conference released Doubao 1.8 alongside Seedance 1.5 Pro, an audio-video creation model.[^7] Seedance 2.0, a unified multimodal audio-video model, was released by the Doubao team on February 9, 2026.[^8] Coverage in TechCrunch in early 2026 described a phased global rollout of the Seedance 2.0 model through the CapCut video editing platform, with initial markets including Brazil, Indonesia, Malaysia, Mexico, the Philippines, Thailand and Vietnam.[^9]
The Seedance 1.0 report frames the system as a video foundation model that "simultaneously balances prompt following, motion plausibility, and visual quality," and bundles four contributions: a curated multi-source data pipeline, an efficient architecture with a tailored training paradigm, a post-training regime that combines supervised fine-tuning with video-specific reinforcement learning from human feedback, and inference acceleration through distillation and systems-level engineering.[^2]
Seedance 1.0 adopts a two-stage pipeline: a base diffusion model produces 480p video latents, and a learned diffusion refiner upscales them to 720p or 1080p, an approach the authors describe as a "cascaded framework". The video tokenizer is a temporally causal variational autoencoder with compression ratios of (4, 16, 16) along the temporal, height, and width dimensions, projecting video into a 48-channel latent space.[^2] The generator itself is a diffusion transformer using the MMDiT (Multi-Modality Diffusion Transformer) design, with decoupled spatial and temporal layers: spatial layers attend within frames, while temporal layers attend across frames. Multi-modal rotary position embeddings (MM-RoPE, an extension of rotary positional encoding) provide positional structure across text, time and image axes, and the authors argue this interleaved encoding is what allows the model to natively generate coherent multi-shot videos rather than concatenating independent clips.[^2]
The data pipeline applies shot-aware temporal segmentation (with a 12-second cap per clip), removes blurry or low-aesthetic content via quality and safety filtering, rectifies visual overlays such as logos and watermarks, and performs semantic deduplication using video embeddings. Captions integrate dynamic features (actions, camera movements) with static descriptors (appearance, aesthetics, style).[^2]
Training proceeds in four phases:
The report attributes the headline "~10x speedup" to a stack of techniques rather than a single optimization: Trajectory Segmented Consistency Distillation (TSCD), score distillation derived from a "RayFlow" objective, a thinner VAE decoder that doubles decode speed, mixed-precision quantization, adaptive hybrid parallelism, and asynchronous offloading.[^2] On NVIDIA L20 hardware, Seedance 1.0 generates a 5-second 1080p clip in 41.4 seconds, a figure the authors highlight as roughly two to four times faster than comparable commercial systems at similar resolution.[^2]
A distinguishing claim of the Seedance 1.0 paper is that the model handles multi-shot generation natively rather than through post hoc stitching. The MM-RoPE positional scheme is the principal mechanism: by encoding shot boundaries into the position embedding alongside spatial and temporal coordinates, the model can plan view transitions while preserving subject and style identity across cuts.[^2] The architecture also unifies text-to-image, text-to-video and image-to-video tasks in a single backbone, which the authors argue improves prompt following and stylistic transfer.[^2]
Seedance 1.0 is publicly documented in two named editions.
| Variant | Release date | Max resolution | Durations | Notes |
|---|---|---|---|---|
| Seedance 1.0 Lite | May 13, 2025[^4] | 480p, 720p[^4] | 5 s, 10 s[^4] | Lower-parameter model optimized for speed and cost; available via Volcano Ark, Doubao app, and Jimeng/Dreamina[^4] |
| Seedance 1.0 Pro | June 11, 2025[^1] | 1080p[^1] | up to 10 s[^1] | Flagship model; first-place rankings on Artificial Analysis arenas at launch[^2][^3] |
ByteDance has subsequently released Seedance 1.5 Pro (December 18, 2025), which extends Seedance 1.0 Pro with synchronized audio output, and Seedance 2.0 (February 9, 2026), a unified multimodal model.[^7][^8] Those successors are documented in their own articles and are not the subject of this entry.
Seedance is distributed through three primary surfaces, all controlled by ByteDance:
doubao-seedance-1-0-pro and a corresponding Lite endpoint.[^1][^2]At the June 2025 launch, Volcano Engine quoted Seedance 1.0 Pro pricing at 0.015 yuan per 1,000 tokens, which the company illustrated as roughly 3.67 yuan (approximately fifty US cents) for a 5-second 1080p clip.[^1] ByteDance presented the pricing as substantially undercutting Western competitors, although direct per-second comparisons depend on resolution, length, and feature flags. The accompanying Doubao 1.6 large language model dropped input pricing to 0.8 yuan per million tokens in the 0-32K input range, framing the announcement as a coordinated price reduction across the Seed product family.[^1]
In the same announcement Volcano Engine also released the Doubao 1.6-vision model (a tool-calling multimodal LLM) and an upgrade to its AI cloud-native services, including a Model Context Protocol service and the PromptPilot prompt tool, presenting Seedance as one node in a broader generative-AI stack rather than a standalone product.[^1]
The Seedance 1.0 paper reports evaluation on two benchmark suites: the publicly visible Artificial Analysis Video Arena, and ByteDance's internal SeedVideoBench-1.0.[^2]
On the Artificial Analysis arena snapshot of June 10, 2025, Seedance 1.0 ranked first on both the text-to-video and image-to-video leaderboards. The authors specifically note an Elo margin exceeding 100 points over Veo 3 and Kling 2.0 on the image-to-video board.[^2] Independent coverage by Time magazine confirmed the leaderboard placement, characterizing Seedance 1.0 as outperforming OpenAI Sora, Veo 3, Runway's Gen-4, Kling 2.1 and Alibaba's Wan family on the arena's public-voting metric.[^3]
The internal SeedVideoBench-1.0 evaluation uses 300 prompts each for text-to-video and image-to-video, scored on four axes: motion quality (structural accuracy, plausibility, stability, vividness), prompt following (action responsiveness, fidelity, style conformity), aesthetic quality (texture, material detail, artistic expression), and preservation (subject and style consistency in the image-to-video setting).[^2] The authors situate Seedance 1.0 as offering balanced performance: they note that Kling 2.1 excels at motion quality and visual fidelity but is weaker on prompt following, that Veo 3 is stronger on prompt following and photorealism but weaker on motion, and that Sora underperforms on both motion quality and prompt adherence in their evaluation.[^2]
These claims are vendor-disclosed and use a vendor-curated prompt set. Independent technical reviewers have generally corroborated the high Artificial Analysis ranking through 2025 and into 2026, although successor releases (Veo 3.1, Sora 2, Kling 3.0, Wan 2.5) have since shifted the relative ordering.[^3]
Seedance is one of several "Doubao" models hosted on Volcano Engine. The Doubao name belongs to ByteDance's consumer chatbot, but the company applies the brand to its full enterprise model line, including the Doubao Seed 1.6 large language model, the Seedance video models, the Seedream image models, and a number of specialized variants (Seed Code, Seed-VL, Seed-Embedding).[^6] Within the Doubao chatbot, users can invoke Seedance for video generation directly in the chat interface.[^4]
Seedance also integrates with two of ByteDance's largest consumer media properties:
The Seedance technical report and Volcano Engine pages do not describe the relationship between consumer Doubao prompts and the underlying Ark API beyond noting that both surfaces invoke the same model family.[^2]
Seedance and Seedream are sister model families within ByteDance Seed's generative-media line, designed to share data curation infrastructure and a common architectural lineage rooted in diffusion transformers, but specialized for different modalities. Seedream 3.0, released April 17, 2025 on Volcano Engine, is a bilingual text-to-image model that competes with Midjourney, Imagen, Ideogram and the imaging modes of OpenAI's GPT-4o.[^6] Seedance 1.0 then extends the family to video. Both lines emphasize Chinese-English bilingual prompting, multi-resolution output, and a unified architecture for related sub-tasks (image generation plus editing for Seedream; text-to-video plus image-to-video for Seedance).[^2][^6]
| Family | First named release | Modality | Bilingual | Distribution |
|---|---|---|---|---|
| Seedream (3.0) | April 17, 2025[^6] | Text-to-image, image editing | Yes (zh, en)[^6] | Volcano Engine API, Doubao, Jimeng[^6] |
| Seedance (1.0 Lite) | May 13, 2025[^4] | Text-to-video, image-to-video | Yes (zh, en)[^2] | Volcano Engine API, Doubao, Jimeng[^4] |
| Seedance (1.0 Pro) | June 11, 2025[^1] | Text-to-video, image-to-video, multi-shot | Yes (zh, en)[^2] | Volcano Engine API, Doubao, Jimeng[^1] |
The Seedance 1.0 technical report explicitly benchmarks against contemporary closed and open systems. The comparisons below are drawn from the paper's evaluation and from contemporaneous third-party coverage.
| System | Vendor | Notable strengths cited in Seedance evaluation |
|---|---|---|
| Veo 3 | Google DeepMind | Strong prompt following and photorealism; weaker motion quality per SeedVideoBench[^2] |
| Sora | OpenAI | Underperforms on motion quality and prompt adherence in SeedVideoBench[^2] |
| Kling 2.1 (Master) | Kuaishou | Strong motion quality and visual fidelity; limited prompt following per SeedVideoBench[^2] |
| Hailuo AI | MiniMax | Competing Chinese text-to-video model; not directly tabulated in Seedance paper[^3] |
| Runway Gen-3 Alpha / Gen-4 | Runway | Listed in Artificial Analysis arena comparisons below Seedance 1.0[^3] |
| Wan (2.1) | Alibaba | Substantially outperformed across SeedVideoBench dimensions per the paper[^2] |
| Pika | Pika Labs | Not benchmarked in the Seedance paper |
| Stable Video Diffusion | Stability AI | Earlier open-weights baseline; not benchmarked in the Seedance paper |
| Open-Sora | HPC-AI Tech and contributors | Open-weights research baseline; not benchmarked in the Seedance paper |
The Time magazine coverage of ByteDance's visual-model strategy summarized Seedance's positioning as that of a Chinese system "challenging the entire Western tier" of video models, citing the Artificial Analysis arena results as evidence and noting that ByteDance also leads on text-to-image with the Seedream line.[^3]
Public assessments of Seedance 1.0 identify several caveats:
Seedance 1.0 was widely reported as a watershed for Chinese video generation. ByteDance's combination of an aggressive low Volcano Engine price (3.67 yuan per 5-second 1080p clip), a top-of-leaderboard claim on a third-party arena, and immediate distribution through both an enterprise API and the country's most-used consumer chatbot positioned the model as a credible Sora and Veo competitor inside three weeks of Google's Veo 3 launch.[^1][^3] Independent commentary frequently framed the release as evidence that the Chinese AI ecosystem had closed, and in some surveys exceeded, the gap with Western leaders on video generation.[^3]
The model also served as a reference point for the rest of ByteDance Seed's generative-media stack. Subsequent releases in the same product line, particularly the audio-augmented Seedance 1.5 Pro and the multimodal Seedance 2.0, retained the architectural skeleton (a diffusion transformer with MM-RoPE positional encoding, cascaded refinement, and RLHF-style post-training) and extended it to audio joint generation and richer reference inputs.[^7][^8]