ByteDance Seed3D 2.0
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,390 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,390 words
Add missing citations, update stale details, or suggest a clearer explanation.
Seed3D 2.0 is a 3D-asset generation model released by ByteDance's Seed research team on April 23, 2026. It produces a complete, textured 3D model from a single input image, and ByteDance positions it as a step toward "simulation-ready" 3D content, meaning assets that can drop directly into physics engines and robotics simulators rather than just sitting in a viewer. The accompanying technical report, "Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation," was posted to arXiv on April 22, 2026, and the model went live the next day through the Volcano Engine cloud platform under the API name Doubao-Seed3D-2.0.[1][2][3]
The release sits at the intersection of two things ByteDance has been investing in heavily: a generative-media stack that already spans images and video, and a growing interest in embodied AI, where cheap, physically plausible 3D environments are a bottleneck for training robots. Seed3D 2.0 builds on Seed3D 1.0, the team's earlier single-image-to-3D system, and the headline pitch is sharper geometry, more physically accurate materials, and outputs that are ready for simulation out of the box.[1][2]
Seed3D 2.0 takes one reference image and generates a 3D mesh with physically based rendering (PBR) materials. PBR is the standard used by game engines and renderers, where surfaces carry separate maps describing color, metalness, and roughness so they respond correctly to different lighting. The model outputs the geometry plus a unified set of PBR texture maps, which is what lets a generated chair or vase look convincing under a moving light instead of flat and painted-on.[2][3]
Beyond single objects, the system extends to scene-level generation and part-aware decomposition. It can plan a layout and compose multiple objects into a scene, split an asset into its constituent parts, and, notably, generate articulation without additional training, so a cabinet can come out with doors that are modeled as separate movable pieces. The team highlights export paths into formats used by robotics and simulation tooling, including URDF (the Unified Robot Description Format) and workflows compatible with NVIDIA Isaac Sim. That focus on articulated, part-level assets is the clearest signal that the model is aimed at embodied-AI training data, not only at artists and game studios.[1][2][3]
Seed3D 2.0 is the 3D entry in a broader family of generative models from ByteDance Seed, most of which are commercialized through Volcano Engine and the consumer-facing Doubao brand. The image side is covered by Seedream, which reached Seedream 5.0 in 2026 and added multimodal understanding, reasoning, and editing in a single model. Video is handled by Seedance, with Seedance 2.0 launching in February 2026 as a unified audio-video generator. Seed3D fills out the third axis of that lineup by handling spatial, three-dimensional content.[4][5]
| Seed model | Modality | Notable 2026 version |
|---|---|---|
| Seedream | Image generation | Seedream 5.0 |
| Seedance | Video generation | Seedance 2.0 |
| Seed3D | 3D-asset generation | Seed3D 2.0 |
Grouping these together matters because ByteDance is clearly building toward content pipelines that move between modalities, and because the company controls the distribution: TikTok, CapCut, Jianying, and the Doubao app all sit downstream of the same Seed models.
The technical report describes two main architectural changes over the previous version, one for shape and one for materials.[1][2]
Geometry uses a coarse-to-fine, two-stage pipeline built on a diffusion transformer the team calls Seed3D-DiT. The first stage generates the overall structure of the object, and the second stage recovers high-frequency detail such as sharp edges and thin-walled surfaces, using locality-aware priors and voxelized positional encoding. Decoupling global shape from fine detail is meant to fix a recurring failure mode in single-image 3D generation, where edges get rounded off and thin features collapse. The pipeline also relies on a locality-aware variational autoencoder (VAE) that achieves a higher spatial compression ratio and more efficient decoding, which is what makes generating crisp detail at scale tractable.[1][2]
Materials moved to a single unified PBR model rather than a cascade of separate networks. It jointly produces the multi-view albedo and metallic-roughness maps conditioned on the reference image and the generated geometry. Two ideas reinforce it: a Mixture-of-Experts (MoE) design, which expands the parameter count and working resolution while keeping inference cost in check through sparse expert routing, and conditioning from a vision-language model (VLM). The VLM reads the input image, describes the likely material types and physical properties in words, and injects those descriptions into the generator as control signals, so a surface that looks like brushed metal is treated as metal rather than guessed from pixels alone.[1][2]
ByteDance evaluated Seed3D 2.0 through a blind human study: 60 raters with 3D-modeling experience ran paired comparisons over more than 200 image prompts, judging the model against five recent commercial systems plus its own predecessor. The named baselines were Hunyuan3D-2.5, Hunyuan3D-3.1, Tripo 3.0, Rodin Gen2 v1.9, and HiTem v2.0, alongside Seed3D 1.0.[1][2]
On shape-only generation, the reported win rates were 98.3% over Seed3D 1.0, 92.8% over Tripo 3.0, 89.6% over Rodin Gen2 v1.9, 79.2% over HiTem v2.0, and 55.2% over Hunyuan3D-3.1. The margin over the strongest competitor is much thinner than over the weaker ones, which is worth keeping in mind: Seed3D 2.0 wins, but against Hunyuan3D-3.1 it is close to a coin flip on geometry alone. For textured assets, the report cites consistent win rates ranging from 69.0% to 89.9% against the same group, including 69.0% over Hunyuan3D-3.1 and 89.9% over Rodin Gen2 v1.9.[1][2]
| Comparison (Seed3D 2.0 vs.) | Shape-only win rate | Textured-asset win rate |
|---|---|---|
| Seed3D 1.0 | 98.3% | reported highest |
| Tripo 3.0 | 92.8% | within 69.0%–89.9% |
| Rodin Gen2 v1.9 | 89.6% | 89.9% |
| HiTem v2.0 | 79.2% | within 69.0%–89.9% |
| Hunyuan3D-3.1 | 55.2% | 69.0% |
These are preference scores from a vendor-run study, so they describe which output humans liked better in side-by-side viewing, not an objective geometric error metric. The takeaway ByteDance draws is that Seed3D 2.0 leads on geometry and texture at the same time, whereas no single competitor topped both axes in the same evaluation.[1][2]
The broader 2026 3D-generation field is crowded. Tencent's Hunyuan3D line, Tripo, Rodin, Meshy, and several open systems all target single-image and text-to-3D generation, mostly for games, e-commerce, and design. What separates the Seed3D 2.0 pitch is less the raw mesh quality and more the simulation framing: articulated, part-level, physics-ready assets. That overlaps with the goals of interactive world models such as Google DeepMind's Genie 3, though the approaches differ. World models tend to generate explorable environments frame by frame, while Seed3D 2.0 produces explicit, reusable 3D assets that an engine can simulate. The two lines of work are converging on the same need, which is large volumes of cheap, controllable 3D worlds to train and test agents in.[1][6]
Seed3D 2.0 is offered through Volcano Engine, ByteDance's cloud arm, as the Doubao-Seed3D-2.0 API. ByteDance's documentation routes developers through the Volcano Ark experience center, selecting the vision-model category and the 3D-generation option to reach the model. The technical report is public on arXiv, but the model weights themselves were released as a hosted API rather than as open weights, which fits ByteDance's general pattern of monetizing Seed models as cloud services while publishing the research.[1][3]
The company frames the launch around production use in industrial manufacturing, simulation training, game development, and embodied AI, the recurring theme being that generating one usable asset at a time is no longer the constraint. The harder problems, and the ones the report spends most of its effort on, are physical consistency and getting assets into a state where a simulator can actually act on them.[1][2]