HunyuanWorld 1.0
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,098 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,098 words
Add missing citations, update stale details, or suggest a clearer explanation.
HunyuanWorld 1.0 is an open model from Tencent that generates explorable 3D worlds from a text prompt or a single image. The Hunyuan team released it on July 26, 2025, and it belongs to the family of generative AI systems that try to build whole scenes rather than single objects or flat pictures. The pitch is easy to state and hard to deliver. You describe a place, or you hand the model one photo of a place, and you get back a 3D environment you can walk through, render, and edit with ordinary graphics tools.
The project frames itself as "the first open-source, simulation-capable, immersive 3D world generation model" that is also compatible with existing computer graphics pipelines. [1][2] That last part matters most in practice. Plenty of research systems can produce something that looks like a 3D scene, but the output is often locked inside a custom renderer or a point cloud that game engines cannot use directly. HunyuanWorld 1.0 exports meshes, so its worlds drop into the same pipelines artists already use for games, film, and virtual reality.
The model works in two modes. In text-to-world mode you write a prompt, something like a sunlit medieval courtyard or a neon street at night, and the system invents a matching scene. In image-to-world mode you supply a single photo or rendered frame, and the model extends it into a full surrounding environment. [1][3] In both cases the result is a 360 degree world. You are not stuck looking at the scene from one fixed camera. You can turn around and the world is there behind you.
The output is mesh based, which is the choice that separates this work from video-style world generators. A mesh is the standard way computer graphics describes a surface, a set of vertices and faces, and almost every engine and modeling tool speaks that language. Because the scene comes out as meshes, you can light it, apply materials, run collision against it, and import it into tools like Blender, Unity, or Unreal Engine without converting from some exotic format first. [1][3]
The technical report describes the system as a framework that combines the strengths of two earlier approaches. [3] Video-based world generators give you rich variety but tend to drift out of 3D consistency, and they are expensive to render. Pure 3D methods stay geometrically consistent but are starved for training data and lean on memory-hungry representations. HunyuanWorld 1.0 tries to take the useful parts of each.
The pipeline runs in stages. First the model generates a panorama, a single wraparound image that captures the whole scene in 360 degrees. This panorama acts as what the authors call a world proxy, a 2D stand-in for the 3D world that is cheap to produce and easy to keep coherent. [3] The panorama generator is a diffusion transformer, referred to as Panorama-DiT, and the open release is built on the FLUX text-to-image diffusion model. The authors note the method can be adapted to other image generators such as Stable Diffusion. [4]
Second comes the part the authors lean on hardest, a semantically layered 3D mesh representation. Instead of treating the panorama as one flat shell, the system separates it into meaningful layers. Sky goes in one layer, the distant background in another, and individual foreground objects in their own layers. [3] The decomposition is handled by a vision-language model in an agentic loop, with object detection from Grounding DINO and segmentation from ZIM. Each layer then gets a depth estimate from a geometry model such as MoGe, and the layers are turned into separate meshes by sheet warping, a grid-mesh technique drawn from earlier work called WorldSheet. The meshes are composited into one navigable scene. [4] This layered reconstruction is what gives the world real depth instead of looking like a photo pasted on the inside of a sphere.
Third, the layering pays off as interactivity. Because foreground objects are disentangled into their own meshes rather than baked into the background, you can select a tree or a crate and move it, replace it, or animate it. [3] For the foreground objects the pipeline can call an image-to-3D generator such as Tencent's own Hunyuan 3D to produce clean standalone assets. [4] The public release ships four model checkpoints, two for panorama generation from text or image and two for inpainting the scene and the sky. [5]
The headline advantages from the report line up with these stages. [3]
| Capability | What it provides |
|---|---|
| 360 degree panoramic world proxy | Immersive scenes you can look around in any direction |
| Mesh export | Direct compatibility with standard graphics pipelines |
| Semantic layering | Sky, background, and foreground reconstructed as separate meshes |
| Disentangled objects | Individual items can be moved, swapped, or animated |
| Text and image input | World generation from a prompt or from a single picture |
The team evaluated HunyuanWorld 1.0 against earlier systems on two tasks. For panorama generation it compared against methods such as Diffusion360, MVDiffusion, PanFusion, and LayerPano3D. For full 3D world generation it compared against WonderJourney, DimensionX, and LayerPano3D. [3] The metrics are the usual mix for this area. There are no-reference image quality scores such as BRISQUE, NIQE, and Q-Align, plus CLIP-based scores for how well a result matches the text prompt and the input image. [3] The paper reports state of the art results across these comparisons. For text-to-world generation, for instance, it lists a BRISQUE of 34.6 and a CLIP text score of 24.0 against LayerPano3D's 35.3 and 22.0, and for image-to-world it reports a BRISQUE of 36.2 against WonderJourney's 51.8. [6]
These are the authors' own numbers from the technical report, so they should be read as the developers' evaluation rather than an independent benchmark. The broader point is that HunyuanWorld 1.0 holds up on image quality while also delivering a usable 3D mesh, which most of the baselines do not.
The model is aimed at any workflow that needs a 3D environment in a hurry. The paper names virtual reality, physical simulation, game development, and interactive content creation. [3] For VR the 360 degree panoramic worlds are a natural fit, since the whole point is to look around freely. For games and film the mesh output and disentangled objects let a small team rough in a level or a set and then refine it by hand. For robotics and embodied AI the meshes can serve as training environments, places where an agent can be dropped to learn navigation or manipulation, which is where the simulation-capable label comes from.
Tencent published the code and weights openly, with the repository on GitHub and the model on Hugging Face. [1][2] The release covers both generation modes and the supporting pipeline. It is not a hosted demo behind an API, it is a model you download and run yourself.
Running the full model at high resolution takes a serious GPU, and the original release leaned toward data-center hardware. To widen access, on August 15, 2025 the team published a quantized build, HunyuanWorld-1.0-lite, that runs on consumer cards such as an NVIDIA RTX 4090. [1] That makes local use practical for people without a workstation-class accelerator, at some cost in fidelity.
The license is the Tencent HunyuanWorld-1.0 Community License Agreement, identified as tencent-hunyuanworld-1.0-community. [7] It allows commercial use without a fee, with two limits worth knowing. The grant applies to a defined Territory that excludes the European Union, the United Kingdom, and South Korea, so the standard terms do not cover users in those places. [7] And if a product built on the model passes 1 million monthly active users, the user has to request a separate license from Tencent, which Tencent can grant or refuse at its discretion. [7] This is the same broad shape as the licenses on other Hunyuan releases, permissive for most people in practice while carving out the largest deployments and a few jurisdictions.
It is easy to confuse HunyuanWorld with Hunyuan 3D, and the two are related but do different jobs. Hunyuan 3D generates a single 3D object from an image or a prompt, a chair, a character, a prop. HunyuanWorld generates a whole scene, the room that holds the chair and everything around it. [1] In fact HunyuanWorld can call Hunyuan 3D to make the foreground objects inside its scenes, so the object model is a component of the world model rather than a competitor to it. [4] The two share the Hunyuan brand with other Tencent generators such as HunyuanVideo for video, but each targets a separate output type.
In the wider research landscape, HunyuanWorld 1.0 belongs to the push toward a generative world model, a system that produces navigable environments rather than fixed media. It contrasts with video-first world generators that synthesize frames of a moving camera but do not hand you reusable geometry, and with object-level 3D generation that makes single assets. Its distinguishing trait is the pairing of an explorable 360 degree scene with an exported mesh that ordinary tools can read.
Tencent kept building on the line after the 1.0 release, and the follow-ups branch in two directions. HunyuanWorld-Voyager, released on September 2, 2025, takes the video route. It is an interactive RGBD video generation model conditioned on camera input that generates world-consistent 3D point-cloud sequences from a single image along a user-defined camera path, producing aligned color and depth frames for direct 3D reconstruction. [8] Tencent described it as the industry's first ultra-long-range world model with native 3D reconstruction, and it ranked first on Stanford's WorldScore benchmark. [8] Where 1.0 builds a static mesh world you explore freely, Voyager focuses on long camera traversals with depth, which puts it closer to the video-based branch the original paper had set out to complement.
The numbered line continued separately. HunyuanWorld 1.1, also called WorldMirror, arrived on October 22, 2025 and reconstructs 3D scenes from videos or multi-view images. [1] Later updates followed, including a real-time interactive system and a 2.0 release, which shows how quickly the area moved over the months after the first model.
The table below sorts out the family members that are easy to mix up.
| Release | Date | What it does |
|---|---|---|
| HunyuanWorld 1.0 | July 2025 | Text or image to a layered, mesh-based 3D world |
| HunyuanWorld-1.0-lite | August 2025 | Quantized build for consumer GPUs |
| HunyuanWorld-Voyager | September 2025 | Camera-controllable RGBD video with native 3D reconstruction |
| HunyuanWorld 1.1 (WorldMirror) | October 2025 | 3D reconstruction from video or multi-view images |
The honest limits sit alongside the strengths. The full model wants a large GPU, so high-resolution generation is out of reach for most individual users without the quantized lite build or rented cloud hardware. [1] The output is a reconstruction from a single panorama, so geometry far from the original viewpoint is inferred rather than observed, and surfaces that were never visible in the proxy can look thin or incomplete when you move deep into the scene. Disentangled objects help interactivity, but the separation is only as good as the segmentation, so cluttered scenes can leave objects fused or mislabeled. And the licensing carve-outs mean teams in the European Union, the United Kingdom, and South Korea, or anyone planning a product at million-user scale, cannot simply assume the default terms cover them. [7] None of this undercuts the core idea. For a freely available model, getting an editable, engine-ready 3D world out of a sentence is a real step, and the open release lets the rest of the field build on it.