Hunyuan Image 3.0

Chinese AI Generative AI Image Generation

9 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v1 · 1,762 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Hunyuan Image 3.0 (styled HunyuanImage 3.0) is a text-to-image generation model released by Tencent as part of its Hunyuan family of foundation models. Open-sourced on 28 September 2025, it is a mixture-of-experts (MoE) model with roughly 80 billion total parameters, of which about 13 billion are activated per token during inference. Tencent describes it as the largest and most powerful open-source image generation model released to date.^[1]^[2]^[3] Unlike most contemporary image generators built on a standalone diffusion transformer, HunyuanImage 3.0 is a native multimodal model that performs image generation inside an autoregressive large language model backbone.^[3]

Overview

HunyuanImage 3.0 generates images from natural-language prompts and is distinguished by three properties Tencent emphasizes: very large scale for an open model, a unified autoregressive architecture rather than a separate diffusion network, and strong rendering of legible text inside images.^[1]^[3] The model and its inference code were published openly on GitHub and Hugging Face, and a technical report was posted to arXiv on the same day as the release.^[2]^[3]^[4]

The model occupies a different niche from Tencent's other Hunyuan visual models. It is the dedicated image generator, separate from HunyuanVideo (video generation) and Hunyuan 3D (3D asset generation). It supersedes the earlier closed HunyuanImage 2.1 as Tencent's flagship for still images.^[3]

Tencent Hunyuan and the model family

Hunyuan is Tencent's brand for its in-house foundation models, spanning large language models, image, video, and 3D generation. HunyuanImage 3.0 is built directly on top of one of those language models: its base is Hunyuan-A13B, a pre-trained MoE LLM with more than 80 billion total parameters and 13 billion activated per token.^[3] Reusing a language-model backbone, rather than training an image model from scratch, is central to the design. It lets the system inherit the LLM's world knowledge and reasoning, which Tencent argues improves prompt understanding and the model's ability to reason about what a scene should contain.^[1]^[3]

Within the broader 2025 landscape, the technical report positions HunyuanImage 3.0 against leading proprietary systems including ByteDance's Seedream 4.0, Google's Nano Banana (Gemini 2.5 Flash Image), OpenAI's GPT Image 1, and Alibaba's Qwen-Image (from the Qwen team), as well as Tencent's own HunyuanImage 2.1.^[3]

Release

Tencent open-sourced the base model, HunyuanImage-3.0, on 28 September 2025, releasing inference code and model weights together with the technical report.^[2]^[3] Coverage in the technology press described it as the industry's largest open-source image-generation model at the time of release.^[5]^[6]

On 26 January 2026, Tencent released two follow-up checkpoints. HunyuanImage-3.0-Instruct adds chain-of-thought reasoning for automatic prompt enhancement and supports image-to-image editing and the fusion of up to three input images. HunyuanImage-3.0-Instruct-Distil is a distilled variant tuned for faster generation, with around eight inference steps recommended.^[2]^[7]

Variant	Release date	Notable additions
HunyuanImage-3.0	28 September 2025	Base text-to-image model, open weights^[2]^[3]
HunyuanImage-3.0-Instruct	26 January 2026	Reasoning, prompt self-rewriting, image-to-image and multi-image fusion^[2]^[7]
HunyuanImage-3.0-Instruct-Distil	26 January 2026	Distilled for fewer inference steps (~8)^[2]

Architecture

HunyuanImage 3.0 is a native multimodal model that unifies image understanding and generation within a single autoregressive framework, departing from the standalone diffusion model designs that dominate text-to-image generation.^[3] Tencent and the ComfyUI documentation describe the design as an "MoE + Transfusion" architecture.^[3]^[6]

The starting point is the Hunyuan-A13B language model. To let the LLM handle images, Tencent augments it with a pre-trained vision encoder and a variational autoencoder (VAE), each fitted with a projection layer that maps image features into the same embedding space as the model's word embeddings. For image understanding, the LLM conditions its next-token prediction on those joint image features. For image generation, diffusion-based modeling over the VAE's image features is incorporated directly into the LLM, following the approach of Transfusion and JanusFlow rather than running a separate diffusion network.^[3] Because the backbone is an LLM, Tencent applies chain-of-thought training and inference to both understanding and generation, which underpins the model's prompt self-rewriting and reasoning behavior.^[3]

The MoE design uses 64 experts, with only a subset active per token, so that the model retains the capacity implied by its 80 billion total parameters while activating roughly 13 billion parameters at inference.^[1]^[2] Tencent's analysis of expert activation found that experts become increasingly specialized by modality, with distinct experts tending to handle image versus text tokens, which the authors suggest is one reason the MoE approach helps multimodal modeling.^[3]

Parameter count as disclosed

Property	Value	Source
Total parameters	~80 billion	^[3]
Activated parameters per token	~13 billion	^[3]
Number of experts	64	^[1]^[2]
Base model	Hunyuan-A13B (MoE LLM)	^[3]
Generation framework	Autoregressive LLM with diffusion image modeling on VAE features ("MoE + Transfusion")	^[3]^[6]

Capabilities

The model generates images from text prompts across multiple aspect ratios and supports configurable output sizes such as 1024x1024 and 1280x768, with an automatic mode that selects dimensions based on the prompt.^[2] Tencent highlights several strengths:

Complex semantic understanding. Because generation runs through a large language model, the system is reported to follow long, detailed prompts and to reason about scene content, rather than treating the prompt as a simple description.^[1]^[5]
Text rendering inside images. The model is designed to produce legible, well-placed text, which Tencent illustrates with poster titles, infographic annotations, and brand logos in both Chinese and English.^[1]^[6]
Prompt self-rewriting and reasoning. The base model can optionally rewrite prompts through an external service, while the Instruct variant performs this natively using chain-of-thought reasoning ("think and recaption").^[2]^[7]
Image editing and fusion. The Instruct variant adds image-to-image editing and can combine up to three input images into a single output.^[2]^[7]

Benchmarks

Tencent evaluated the model with two methods. SSAE (Structured Semantic Alignment Evaluation) is an automatic metric that uses multimodal LLMs to score image-text alignment across 3,500 key points in 12 categories. GSB (Good/Same/Bad) is a human study in which more than 100 professional evaluators compared images generated for 1,000 prompts, with each model run once per prompt and no cherry-picking.^[3] On GSB, the report gives relative win rates of HunyuanImage 3.0 over each baseline; positive values indicate HunyuanImage 3.0 was preferred more often than the comparison model.

Comparison (GSB human evaluation)	HunyuanImage 3.0 relative win rate	Notes	Source
vs. HunyuanImage 2.1	+14.10%	Previous best open-source model	^[3]
vs. Seedream 4.0	+1.17%	Closed-source baseline	^[3]
vs. Nano Banana (Gemini 2.5 Flash Image)	+2.64%	Closed-source baseline	^[3]
vs. GPT-Image	+5.00%	Closed-source baseline	^[3]

On the automatic SSAE metric, the report states that HunyuanImage 3.0 performs on par with the leading models across the fine-grained categories measured.^[3] Separately, on the public crowd-voting platform LMArena, the model reached the top position in the text-to-image rankings in early October 2025, ahead of Google's Nano Banana, according to the South China Morning Post.^[5]

License and availability

The model weights and inference code are distributed under the Tencent Hunyuan Community License Agreement (identifier tencent-hunyuan-community), which permits commercial use, modification, and redistribution subject to several conditions.^[2]^[8] Two clauses are commonly noted: the license does not grant rights for use in the European Union, the United Kingdom, or South Korea, and any product or service with more than 100 million monthly active users must obtain a separate license from Tencent.^[8] Running the model is hardware-intensive: the base model requires at least three 80 GB GPUs (four recommended), while the Instruct variants are documented as needing at least eight 80 GB GPUs, with roughly 170 GB of storage for the weights.^[2]^[6]

Aspect	Detail	Source
Weights	Open, published on Hugging Face (`tencent/HunyuanImage-3.0`)	^[2]
Code	Open, published on GitHub (`Tencent-Hunyuan/HunyuanImage-3.0`)	^[4]
License	Tencent Hunyuan Community License Agreement	^[2]^[8]
Commercial use	Permitted, with conditions	^[8]
Geographic exclusions	EU, UK, South Korea	^[8]
Large-deployment clause	Separate license required above 100M monthly active users	^[8]
Minimum hardware (base)	3x 80 GB GPU (4x recommended), ~170 GB storage	^[2]^[6]
Minimum hardware (Instruct)	8x 80 GB GPU	^[2]

Reception

The release drew attention chiefly for its scale and for being openly licensed at a time when the strongest image generators were closed. Technology outlets characterized it as the largest open-source image-generation model then available.^[5]^[6] On the LMArena leaderboard the model briefly held the overall top spot for text-to-image generation, surpassing Google DeepMind's Nano Banana, which the South China Morning Post reported in October 2025; Tencent said the model was "completely comparable to the industry's flagship closed-source models."^[5] Within the open-source tooling community, the model was integrated into image-generation frontends following its release.^[6]

Limitations

The most practical constraint is hardware. With a multi-GPU requirement of three to eight 80 GB accelerators, the model is out of reach for typical consumer machines and is heavier to run than dense open models a fraction of its size, a direct consequence of the 80-billion-parameter scale.^[2]^[6] The Tencent Hunyuan Community License, while permissive for most users, is not a standard open-source license: the geographic carve-outs for the EU, UK, and South Korea and the 100-million-user threshold mean it does not meet the Open Source Initiative's definition of open source.^[8] On quality, the margins reported in the GSB study over the strongest closed-source baselines are small (roughly one to five percentage points), so the model's edge in head-to-head human preference is narrow rather than decisive.^[3]

References

Tencent Open Sources Hunyuan Image 3.0 - World's Largest Open-Source Text-to-Image Model, ComfyUI Wiki, 27 September 2025. ↩
tencent/HunyuanImage-3.0, Hugging Face model card (model details, variants, license, hardware). ↩
HunyuanImage 3.0 Technical Report, arXiv:2509.23951, Tencent Hunyuan, September 2025. ↩
Tencent-Hunyuan/HunyuanImage-3.0, official GitHub repository. ↩
Tencent's AI model Hunyuan Image 3.0 tops leaderboard, beating Google's Nano Banana, South China Morning Post, 6 October 2025. ↩
Hunyuan Image 3.0 release coverage, ComfyUI Wiki (architecture, text rendering, hardware, license). ↩
HunyuanImage-3.0-Instruct model card, Hugging Face (reasoning, image-to-image, January 2026 release). ↩
HunyuanImage-3.0 LICENSE (Tencent Hunyuan Community License Agreement), GitHub. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Seedream Tencent

Overview

Tencent Hunyuan and the model family

Release

Architecture

Parameter count as disclosed

Capabilities

Benchmarks

License and availability

Reception

Limitations

See also

References

Improve this article

Related Articles

Seedream

Seedream 4.0

Doubao Seedream

Seedream 5.0

Jimeng (Dreamina)

HiDream

What links here

Related Articles

Seedream

Seedream 4.0

Doubao Seedream

Seedream 5.0

Jimeng (Dreamina)

HiDream

What links here