Doubao Seedream
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,981 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,981 words
Add missing citations, update stale details, or suggest a clearer explanation.
Doubao-Seedream is the family of text-to-image generation foundation models developed by the ByteDance Seed team and shipped through ByteDance's Doubao product line and the company's Volcano Engine cloud platform. The first widely covered international release was Seedream 2.0, whose technical report was posted to arXiv on March 10, 2025, after the model had already been deployed in ByteDance's Doubao chatbot and the Jimeng (Dreamina) creative app in early December 2024.[^1][^2] Seedream is best known for native Chinese-English bilingual prompt understanding and for legible rendering of Chinese characters inside generated images, two areas where Western text-to-image systems such as Midjourney, Imagen 3, gpt-image-1, and FLUX have historically been weak.[^1][^3] Successive releases (Seedream 3.0 in April 2025 and Seedream 4.0 in September 2025) added native 2K and then 4K output, unified image editing with generation, and pushed the model to first place on the Artificial Analysis text-to-image leaderboard in late 2025.[^4][^5][^6]
| Attribute | Detail |
|---|---|
| Developer | ByteDance Seed (Doubao Team) |
| Type | Text-to-image diffusion transformer |
| Languages | Native Chinese and English prompts |
| First public deployment | December 2024 (Doubao, Jimeng) |
| Seedream 2.0 technical report | arXiv:2503.07703, March 10, 2025 |
| Seedream 3.0 technical report | arXiv:2504.11346, April 15, 2025 |
| Seedream 4.0 technical report | arXiv:2509.20427, September 24, 2025 |
| Distribution channels | Doubao, Jimeng / Dreamina, CapCut, Volcano Engine API |
| Reported native resolutions | 1K and 2K (3.0); up to 4K (4.0) |
Doubao-Seedream sits alongside the Doubao Seed language and multimodal models and the Doubao-Seedance video generation family inside ByteDance's broader Seed foundation-model program, which was established in 2023 as the company's fundamental AI research division.[^7]
ByteDance reorganized its AI research in 2023 and established the Seed team as a dedicated unit for fundamental large-model work, with a research scope spanning language, speech, vision, world models, and AI infrastructure.[^7] In February 2025, former Google DeepMind vice president Wu Yonghui joined ByteDance as head of foundational research for Seed, taking a role described as similar to a chief scientist.[^8] The team's image and video stack is led by Jianchao Yang, head of the Multimodal Foundation Model group, who has been publicly identified as a driving force behind Seedream and Seedance.[^9]
The first Seedream variant that gained Western press coverage was Seedream 2.0, although ByteDance had been iterating internally and shipping earlier versions through its Chinese consumer products. The Seedream 2.0 technical report explicitly states that "as of early December 2024, Seedream 2.0 has been incorporated into various platforms exemplified by Doubao (豆包)" and the Jimeng / Dreamina creative tool, serving a large Chinese user base before the international announcement.[^1]
The Seedream 2.0 technical report was uploaded to arXiv on March 10, 2025 as paper 2503.07703, titled "Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model."[^1] Volcano Engine and the Doubao team published a corresponding technical disclosure on March 12, 2025, marking the model's formal international unveiling.[^2] The paper lists 28 named contributors, with Lixue Gong as lead author and Jianchao Yang and Weilin Huang as senior contributors.[^1]
Seedream 2.0's headline pitch was that prior Western and Chinese open systems, including FLUX, Stable Diffusion 3 / 3.5, and Midjourney, "still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances."[^1] The model was designed from the start to ingest both Chinese and English prompts at a native level, rather than relying on English-only encoders with a translation layer in front of them.
Seedream 3.0 was released on the Doubao chat platform and the Jimeng tool in early April 2025, with a technical report posted to arXiv on April 15, 2025 (paper 2504.11346) and a public blog post on the ByteDance Seed site.[^4][^5] The release framed itself as addressing concrete weaknesses of Seedream 2.0: "alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions."[^4]
Headline upgrades in 3.0 included native 2K (2048 by 2048) output without a separate refiner pass and approximately three-second generation times for 1K images.[^5] ByteDance reported internal text-availability rates of roughly 94 percent for both Chinese and English characters, up sharply from the 78 percent Chinese rate cited in the 2.0 paper.[^1][^5]
Shortly after the 3.0 launch, Artificial Analysis listed Seedream 3.0 at the top of its blind-vote text-to-image arena with an Elo of approximately 1158, narrowly ahead of GPT-4o's image mode at 1157 and well ahead of Midjourney v6.1 at around 1047.[^5] This was the first time a Chinese closed model held the number-one spot on the Artificial Analysis leaderboard.[^5][^6]
In September 2025 ByteDance announced Seedream 4.0 with a technical report on arXiv (paper 2509.20427, submitted September 24, 2025) and integrations into the Doubao app and Jimeng platform.[^6] Seedream 4.0 unified text-to-image generation, image editing, and multi-image composition inside a single diffusion transformer architecture and a new variational autoencoder, supporting native generation up to 4K resolution and "billions of text-image pairs" in pretraining.[^6] On the Artificial Analysis arena, Seedream 4.0 ranked first across both the text-to-image and image-editing leaderboards as of September 18, 2025.[^6]
Seedream 4.0 also introduced an acceleration framework combining adversarial distillation, distribution matching, hardware-aware quantization, and speculative decoding to bring generation latency low enough for production workflows.[^6]
ByteDance later shipped Seedream 4.5 in December 2025 as a refinement focused on character consistency across multiple reference images and professional-grade typography.[^10] A 5.0 generation followed in early 2026, with Seedream 5.0 Lite released alongside Seedance 2.0 on the Jimeng / Dreamina platform and the broader Seed 2.0 launch on Volcano Engine on February 14, 2026.[^11][^12] The 5.0 generation added real-time web search and multi-turn image-and-text editing to the model line.[^11]
Doubao-Seedream is implemented as a diffusion transformer (DiT) that operates in the latent space of a variational autoencoder, with conditioning provided by a self-developed bilingual large language model that acts as the text encoder.[^1] Rather than reusing a CLIP or T5 encoder, ByteDance fine-tunes its own LLM on image-text pairs so that representations of Chinese cultural concepts and idiomatic English are kept in a shared embedding space.[^1]
The Seedream 2.0 paper describes a two-encoder design in which features from the bilingual LLM are concatenated with features from a Glyph-Aligned ByT5 model that operates at the byte / character level.[^1] ByT5 is used specifically to provide accurate, character-level supervision for in-image text, which is necessary for handling the large number of distinct Chinese glyphs and for keeping small English captions legible at 1K and 2K output sizes.[^1] This is one of the architectural choices ByteDance highlights as the source of Seedream's relative strength at text rendering compared with Midjourney, Stable Diffusion 3, and FLUX base models.[^1]
Both Seedream 2.0 and Seedream 3.0 use a "Scaled" rotary position embedding (RoPE) scheme designed so that patches near the image center share similar position identifiers across different resolutions, allowing the network to generalize to aspect ratios and pixel counts it has not seen during training.[^1] Seedream 3.0 generalizes this further with cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling that conditions the noise schedule on the target output size.[^4]
The Seedream 2.0 training recipe runs in five stages:[^1]
Seedream 3.0 doubles the effective dataset by combining a defect-aware training paradigm with a dual-axis collaborative sampling framework, and replaces the human-only RLHF reward model with a vision-language-model-based reward that can scale to larger output sizes.[^4] Seedream 4.0 pushes this further by jointly training text-to-image generation and image editing on billions of pairs and by reducing the number of latent tokens per image through a more aggressive VAE compression scheme.[^6]
| Model | Resolution | Reported text availability | Artificial Analysis rank at release |
|---|---|---|---|
| Seedream 2.0 | up to 1K (refiner) | 78% Chinese, higher English | Not yet on leaderboard at paper time[^1] |
| Seedream 3.0 | Native 2K | 94% Chinese and English | #1 with Elo ~1158 (April 2025)[^5] |
| Seedream 4.0 | Up to 4K | Not separately reported | #1 in T2I and editing arenas (September 2025)[^6] |
ByteDance evaluations also report that Seedream 2.0 collected roughly 500,000 pairwise human comparisons and obtained the highest total Elo score in those evaluations for both Chinese and English prompts, though the paper does not publish the absolute Elo values for competitors.[^1]
Seedream models are exposed to end users and developers through several distinct surfaces:
The international consumer chatbot for the Doubao family is branded Dola (formerly Cici); users searching for an "English Doubao" are typically routed to Dola, which exposes Seedream image generation through that interface.[^13]
Doubao-Seedance is the video-generation sibling of Seedream and ships out of the same Seed Multimodal Foundation Model group at ByteDance.[^9] Seedance 1.0 was the first version to gain widespread coverage in mid-2025 with text-to-video and image-to-video support inside Doubao and Jimeng, and Seedance 2.0 was launched on February 10, 2026 as a limited beta on Jimeng before being included in the broader Seed 2.0 / Volcano Engine release on February 14, 2026.[^12][^14] Seedance 2.0 uses a unified multimodal architecture that ingests text, image, audio, and video inputs and produces joint audio-video output.[^15]
In ByteDance's product hierarchy Seedream and Seedance are paired together: the same Doubao or Jimeng workflow can produce a still image with Seedream, refine it with SeedEdit (an image editor released alongside Seedream 3.0), and then animate it into video using Seedance, all under a single account on Volcano Engine.[^4][^15] CapCut's Dreamina-branded video features are powered by Seedance, and its image generation features are powered by Seedream.[^13]
| Aspect | Doubao-Seedream | Doubao-Seedance |
|---|---|---|
| Modality output | Still images | Video (with audio in 2.0) |
| Latest major version | Seedream 5.0 / 5.0 Lite (Feb 2026) | Seedance 2.0 (Feb 2026) |
| Native max resolution | Up to 4K (4.0) | Variable; clip length and frame rate vary by tier |
| First broad arXiv report | 2503.07703 (Seedream 2.0) | Reported in Seedance technical posts on the ByteDance Seed site |
| Primary integration apps | Doubao, Jimeng / Dreamina, CapCut, TikTok | Doubao, Jimeng / Dreamina, CapCut |
Seedream's design choices map directly to several concrete use cases:
Seedream models inherit several limitations common to diffusion-based text-to-image systems and have a few that are specific to ByteDance's design choices:
| Model | Developer | Native max resolution | Bilingual Chinese/English | Strengths cited by independent reviewers |
|---|---|---|---|---|
| Midjourney v7 | Midjourney | Internal upscaling | English-primary | Aesthetic quality, stylization |
| Imagen 3 | Google DeepMind | Up to 2K | English-primary | Photorealism, instruction following |
| gpt-image-1 | OpenAI | Up to ~4K | Multilingual | Text rendering, integration with ChatGPT |
| FLUX.1 / FLUX.2 | Black Forest Labs | High resolution | English-primary | Open-weight options, sharp detail |
| Ideogram 3.0 | Ideogram | High resolution | English-primary | In-image text rendering |
| Stable Diffusion 3.5 | Stability AI | Variable | English-primary | Open weights, customization |
| Hunyuan (image) | Tencent | High resolution | Chinese / English | Chinese cultural concepts |
| Seedream 3.0 / 4.0 | ByteDance | 2K (3.0) / 4K (4.0) | Native Chinese / English | Bilingual prompts, Chinese typography, Artificial Analysis top ranks at launch |
Seedream's most distinctive position in this landscape is the combination of native bilingual handling, in-image Chinese typography, and deep integration into ByteDance's consumer surfaces (Doubao, Jimeng, TikTok, CapCut), rather than any single benchmark number.[^1][^5][^13]