Hunyuan Image 3.0
Last reviewed
Jun 2, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,762 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,762 words
Add missing citations, update stale details, or suggest a clearer explanation.
Hunyuan Image 3.0 (styled HunyuanImage 3.0) is a text-to-image generation model released by Tencent as part of its Hunyuan family of foundation models. Open-sourced on 28 September 2025, it is a mixture-of-experts (MoE) model with roughly 80 billion total parameters, of which about 13 billion are activated per token during inference. Tencent describes it as the largest and most powerful open-source image generation model released to date.[1][2][3] Unlike most contemporary image generators built on a standalone diffusion transformer, HunyuanImage 3.0 is a native multimodal model that performs image generation inside an autoregressive large language model backbone.[3]
HunyuanImage 3.0 generates images from natural-language prompts and is distinguished by three properties Tencent emphasizes: very large scale for an open model, a unified autoregressive architecture rather than a separate diffusion network, and strong rendering of legible text inside images.[1][3] The model and its inference code were published openly on GitHub and Hugging Face, and a technical report was posted to arXiv on the same day as the release.[2][3][4]
The model occupies a different niche from Tencent's other Hunyuan visual models. It is the dedicated image generator, separate from HunyuanVideo (video generation) and Hunyuan 3D (3D asset generation). It supersedes the earlier closed HunyuanImage 2.1 as Tencent's flagship for still images.[3]
Hunyuan is Tencent's brand for its in-house foundation models, spanning large language models, image, video, and 3D generation. HunyuanImage 3.0 is built directly on top of one of those language models: its base is Hunyuan-A13B, a pre-trained MoE LLM with more than 80 billion total parameters and 13 billion activated per token.[3] Reusing a language-model backbone, rather than training an image model from scratch, is central to the design. It lets the system inherit the LLM's world knowledge and reasoning, which Tencent argues improves prompt understanding and the model's ability to reason about what a scene should contain.[1][3]
Within the broader 2025 landscape, the technical report positions HunyuanImage 3.0 against leading proprietary systems including ByteDance's Seedream 4.0, Google's Nano Banana (Gemini 2.5 Flash Image), OpenAI's GPT Image 1, and Alibaba's Qwen-Image (from the Qwen team), as well as Tencent's own HunyuanImage 2.1.[3]
Tencent open-sourced the base model, HunyuanImage-3.0, on 28 September 2025, releasing inference code and model weights together with the technical report.[2][3] Coverage in the technology press described it as the industry's largest open-source image-generation model at the time of release.[5][6]
On 26 January 2026, Tencent released two follow-up checkpoints. HunyuanImage-3.0-Instruct adds chain-of-thought reasoning for automatic prompt enhancement and supports image-to-image editing and the fusion of up to three input images. HunyuanImage-3.0-Instruct-Distil is a distilled variant tuned for faster generation, with around eight inference steps recommended.[2][7]
| Variant | Release date | Notable additions |
|---|---|---|
| HunyuanImage-3.0 | 28 September 2025 | Base text-to-image model, open weights[2][3] |
| HunyuanImage-3.0-Instruct | 26 January 2026 | Reasoning, prompt self-rewriting, image-to-image and multi-image fusion[2][7] |
| HunyuanImage-3.0-Instruct-Distil | 26 January 2026 | Distilled for fewer inference steps (~8)[2] |
HunyuanImage 3.0 is a native multimodal model that unifies image understanding and generation within a single autoregressive framework, departing from the standalone diffusion model designs that dominate text-to-image generation.[3] Tencent and the ComfyUI documentation describe the design as an "MoE + Transfusion" architecture.[3][6]
The starting point is the Hunyuan-A13B language model. To let the LLM handle images, Tencent augments it with a pre-trained vision encoder and a variational autoencoder (VAE), each fitted with a projection layer that maps image features into the same embedding space as the model's word embeddings. For image understanding, the LLM conditions its next-token prediction on those joint image features. For image generation, diffusion-based modeling over the VAE's image features is incorporated directly into the LLM, following the approach of Transfusion and JanusFlow rather than running a separate diffusion network.[3] Because the backbone is an LLM, Tencent applies chain-of-thought training and inference to both understanding and generation, which underpins the model's prompt self-rewriting and reasoning behavior.[3]
The MoE design uses 64 experts, with only a subset active per token, so that the model retains the capacity implied by its 80 billion total parameters while activating roughly 13 billion parameters at inference.[1][2] Tencent's analysis of expert activation found that experts become increasingly specialized by modality, with distinct experts tending to handle image versus text tokens, which the authors suggest is one reason the MoE approach helps multimodal modeling.[3]
| Property | Value | Source |
|---|---|---|
| Total parameters | ~80 billion | [3] |
| Activated parameters per token | ~13 billion | [3] |
| Number of experts | 64 | [1][2] |
| Base model | Hunyuan-A13B (MoE LLM) | [3] |
| Generation framework | Autoregressive LLM with diffusion image modeling on VAE features ("MoE + Transfusion") | [3][6] |
The model generates images from text prompts across multiple aspect ratios and supports configurable output sizes such as 1024x1024 and 1280x768, with an automatic mode that selects dimensions based on the prompt.[2] Tencent highlights several strengths:
Tencent evaluated the model with two methods. SSAE (Structured Semantic Alignment Evaluation) is an automatic metric that uses multimodal LLMs to score image-text alignment across 3,500 key points in 12 categories. GSB (Good/Same/Bad) is a human study in which more than 100 professional evaluators compared images generated for 1,000 prompts, with each model run once per prompt and no cherry-picking.[3] On GSB, the report gives relative win rates of HunyuanImage 3.0 over each baseline; positive values indicate HunyuanImage 3.0 was preferred more often than the comparison model.
| Comparison (GSB human evaluation) | HunyuanImage 3.0 relative win rate | Notes | Source |
|---|---|---|---|
| vs. HunyuanImage 2.1 | +14.10% | Previous best open-source model | [3] |
| vs. Seedream 4.0 | +1.17% | Closed-source baseline | [3] |
| vs. Nano Banana (Gemini 2.5 Flash Image) | +2.64% | Closed-source baseline | [3] |
| vs. GPT-Image | +5.00% | Closed-source baseline | [3] |
On the automatic SSAE metric, the report states that HunyuanImage 3.0 performs on par with the leading models across the fine-grained categories measured.[3] Separately, on the public crowd-voting platform LMArena, the model reached the top position in the text-to-image rankings in early October 2025, ahead of Google's Nano Banana, according to the South China Morning Post.[5]
The model weights and inference code are distributed under the Tencent Hunyuan Community License Agreement (identifier tencent-hunyuan-community), which permits commercial use, modification, and redistribution subject to several conditions.[2][8] Two clauses are commonly noted: the license does not grant rights for use in the European Union, the United Kingdom, or South Korea, and any product or service with more than 100 million monthly active users must obtain a separate license from Tencent.[8] Running the model is hardware-intensive: the base model requires at least three 80 GB GPUs (four recommended), while the Instruct variants are documented as needing at least eight 80 GB GPUs, with roughly 170 GB of storage for the weights.[2][6]
| Aspect | Detail | Source |
|---|---|---|
| Weights | Open, published on Hugging Face (tencent/HunyuanImage-3.0) | [2] |
| Code | Open, published on GitHub (Tencent-Hunyuan/HunyuanImage-3.0) | [4] |
| License | Tencent Hunyuan Community License Agreement | [2][8] |
| Commercial use | Permitted, with conditions | [8] |
| Geographic exclusions | EU, UK, South Korea | [8] |
| Large-deployment clause | Separate license required above 100M monthly active users | [8] |
| Minimum hardware (base) | 3x 80 GB GPU (4x recommended), ~170 GB storage | [2][6] |
| Minimum hardware (Instruct) | 8x 80 GB GPU | [2] |
The release drew attention chiefly for its scale and for being openly licensed at a time when the strongest image generators were closed. Technology outlets characterized it as the largest open-source image-generation model then available.[5][6] On the LMArena leaderboard the model briefly held the overall top spot for text-to-image generation, surpassing Google DeepMind's Nano Banana, which the South China Morning Post reported in October 2025; Tencent said the model was "completely comparable to the industry's flagship closed-source models."[5] Within the open-source tooling community, the model was integrated into image-generation frontends following its release.[6]
The most practical constraint is hardware. With a multi-GPU requirement of three to eight 80 GB accelerators, the model is out of reach for typical consumer machines and is heavier to run than dense open models a fraction of its size, a direct consequence of the 80-billion-parameter scale.[2][6] The Tencent Hunyuan Community License, while permissive for most users, is not a standard open-source license: the geographic carve-outs for the EU, UK, and South Korea and the 100-million-user threshold mean it does not meet the Open Source Initiative's definition of open source.[8] On quality, the margins reported in the GSB study over the strongest closed-source baselines are small (roughly one to five percentage points), so the model's edge in head-to-head human preference is narrow rather than decisive.[3]