# Hunyuan Image 3.0

> Source: https://aiwiki.ai/wiki/hunyuan_image_3
> Updated: 2026-06-02
> Categories: Chinese AI, Generative AI, Image Generation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Hunyuan Image 3.0** (styled **HunyuanImage 3.0**) is a [text-to-image](/wiki/text-to-image_models) generation model released by [Tencent](/wiki/tencent) as part of its Hunyuan family of foundation models. Open-sourced on 28 September 2025, it is a [mixture-of-experts](/wiki/mixture_of_experts) (MoE) model with roughly 80 billion total parameters, of which about 13 billion are activated per token during inference. Tencent describes it as the largest and most powerful open-source image generation model released to date.[1][2][3] Unlike most contemporary image generators built on a standalone diffusion transformer, HunyuanImage 3.0 is a native multimodal model that performs image generation inside an autoregressive [large language model](/wiki/large_language_model) backbone.[3]

## Overview

HunyuanImage 3.0 generates images from natural-language prompts and is distinguished by three properties Tencent emphasizes: very large scale for an open model, a unified autoregressive architecture rather than a separate diffusion network, and strong rendering of legible text inside images.[1][3] The model and its inference code were published openly on [GitHub](/wiki/github) and Hugging Face, and a technical report was posted to arXiv on the same day as the release.[2][3][4]

The model occupies a different niche from Tencent's other Hunyuan visual models. It is the dedicated image generator, separate from [HunyuanVideo](/wiki/hunyuan_video) (video generation) and [Hunyuan 3D](/wiki/hunyuan_3d) (3D asset generation). It supersedes the earlier closed HunyuanImage 2.1 as Tencent's flagship for still images.[3]

## Tencent Hunyuan and the model family

Hunyuan is Tencent's brand for its in-house foundation models, spanning large language models, image, video, and 3D generation. HunyuanImage 3.0 is built directly on top of one of those language models: its base is [Hunyuan-A13B](/wiki/hunyuan_a13b), a pre-trained MoE LLM with more than 80 billion total parameters and 13 billion activated per token.[3] Reusing a language-model backbone, rather than training an image model from scratch, is central to the design. It lets the system inherit the LLM's world knowledge and reasoning, which Tencent argues improves prompt understanding and the model's ability to reason about what a scene should contain.[1][3]

Within the broader 2025 landscape, the technical report positions HunyuanImage 3.0 against leading proprietary systems including ByteDance's [Seedream 4.0](/wiki/seedream_4), Google's Nano Banana ([Gemini 2.5 Flash Image](/wiki/gemini_2_5_flash)), OpenAI's [GPT Image 1](/wiki/gpt_image_1), and Alibaba's Qwen-Image (from the [Qwen](/wiki/qwen) team), as well as Tencent's own HunyuanImage 2.1.[3]

## Release

Tencent open-sourced the base model, HunyuanImage-3.0, on 28 September 2025, releasing inference code and model weights together with the technical report.[2][3] Coverage in the technology press described it as the industry's largest open-source image-generation model at the time of release.[5][6]

On 26 January 2026, Tencent released two follow-up checkpoints. HunyuanImage-3.0-Instruct adds chain-of-thought reasoning for automatic prompt enhancement and supports image-to-image editing and the fusion of up to three input images. HunyuanImage-3.0-Instruct-Distil is a distilled variant tuned for faster generation, with around eight inference steps recommended.[2][7]

| Variant | Release date | Notable additions |
| --- | --- | --- |
| HunyuanImage-3.0 | 28 September 2025 | Base text-to-image model, open weights[2][3] |
| HunyuanImage-3.0-Instruct | 26 January 2026 | Reasoning, prompt self-rewriting, image-to-image and multi-image fusion[2][7] |
| HunyuanImage-3.0-Instruct-Distil | 26 January 2026 | Distilled for fewer inference steps (~8)[2] |

## Architecture

HunyuanImage 3.0 is a native multimodal model that unifies image understanding and generation within a single autoregressive framework, departing from the standalone [diffusion model](/wiki/diffusion_model) designs that dominate text-to-image generation.[3] Tencent and the ComfyUI documentation describe the design as an "MoE + Transfusion" architecture.[3][6]

The starting point is the Hunyuan-A13B language model. To let the LLM handle images, Tencent augments it with a pre-trained vision encoder and a [variational autoencoder](/wiki/variational_autoencoder) (VAE), each fitted with a projection layer that maps image features into the same embedding space as the model's word embeddings. For image understanding, the LLM conditions its next-token prediction on those joint image features. For image generation, diffusion-based modeling over the VAE's image features is incorporated directly into the LLM, following the approach of Transfusion and JanusFlow rather than running a separate diffusion network.[3] Because the backbone is an LLM, Tencent applies [chain-of-thought](/wiki/chain_of_thought) training and inference to both understanding and generation, which underpins the model's prompt self-rewriting and reasoning behavior.[3]

The MoE design uses 64 experts, with only a subset active per token, so that the model retains the capacity implied by its 80 billion total parameters while activating roughly 13 billion parameters at inference.[1][2] Tencent's analysis of expert activation found that experts become increasingly specialized by modality, with distinct experts tending to handle image versus text tokens, which the authors suggest is one reason the MoE approach helps multimodal modeling.[3]

### Parameter count as disclosed

| Property | Value | Source |
| --- | --- | --- |
| Total parameters | ~80 billion | [3] |
| Activated parameters per token | ~13 billion | [3] |
| Number of experts | 64 | [1][2] |
| Base model | Hunyuan-A13B (MoE LLM) | [3] |
| Generation framework | Autoregressive LLM with diffusion image modeling on VAE features ("MoE + Transfusion") | [3][6] |

## Capabilities

The model generates images from text prompts across multiple aspect ratios and supports configurable output sizes such as 1024x1024 and 1280x768, with an automatic mode that selects dimensions based on the prompt.[2] Tencent highlights several strengths:

- **Complex semantic understanding.** Because generation runs through a large language model, the system is reported to follow long, detailed prompts and to reason about scene content, rather than treating the prompt as a simple description.[1][5]
- **Text rendering inside images.** The model is designed to produce legible, well-placed text, which Tencent illustrates with poster titles, infographic annotations, and brand logos in both Chinese and English.[1][6]
- **Prompt self-rewriting and reasoning.** The base model can optionally rewrite prompts through an external service, while the Instruct variant performs this natively using chain-of-thought reasoning ("think and recaption").[2][7]
- **Image editing and fusion.** The Instruct variant adds image-to-image editing and can combine up to three input images into a single output.[2][7]

## Benchmarks

Tencent evaluated the model with two methods. SSAE (Structured Semantic Alignment Evaluation) is an automatic metric that uses multimodal LLMs to score image-text alignment across 3,500 key points in 12 categories. GSB (Good/Same/Bad) is a human study in which more than 100 professional evaluators compared images generated for 1,000 prompts, with each model run once per prompt and no cherry-picking.[3] On GSB, the report gives relative win rates of HunyuanImage 3.0 over each baseline; positive values indicate HunyuanImage 3.0 was preferred more often than the comparison model.

| Comparison (GSB human evaluation) | HunyuanImage 3.0 relative win rate | Notes | Source |
| --- | --- | --- | --- |
| vs. HunyuanImage 2.1 | +14.10% | Previous best open-source model | [3] |
| vs. Seedream 4.0 | +1.17% | Closed-source baseline | [3] |
| vs. Nano Banana (Gemini 2.5 Flash Image) | +2.64% | Closed-source baseline | [3] |
| vs. GPT-Image | +5.00% | Closed-source baseline | [3] |

On the automatic SSAE metric, the report states that HunyuanImage 3.0 performs on par with the leading models across the fine-grained categories measured.[3] Separately, on the public crowd-voting platform [LMArena](/wiki/lmarena_org), the model reached the top position in the text-to-image rankings in early October 2025, ahead of Google's Nano Banana, according to the South China Morning Post.[5]

## License and availability

The model weights and inference code are distributed under the Tencent Hunyuan Community License Agreement (identifier `tencent-hunyuan-community`), which permits commercial use, modification, and redistribution subject to several conditions.[2][8] Two clauses are commonly noted: the license does not grant rights for use in the European Union, the United Kingdom, or South Korea, and any product or service with more than 100 million monthly active users must obtain a separate license from Tencent.[8] Running the model is hardware-intensive: the base model requires at least three 80 GB GPUs (four recommended), while the Instruct variants are documented as needing at least eight 80 GB GPUs, with roughly 170 GB of storage for the weights.[2][6]

| Aspect | Detail | Source |
| --- | --- | --- |
| Weights | Open, published on Hugging Face (`tencent/HunyuanImage-3.0`) | [2] |
| Code | Open, published on GitHub (`Tencent-Hunyuan/HunyuanImage-3.0`) | [4] |
| License | Tencent Hunyuan Community License Agreement | [2][8] |
| Commercial use | Permitted, with conditions | [8] |
| Geographic exclusions | EU, UK, South Korea | [8] |
| Large-deployment clause | Separate license required above 100M monthly active users | [8] |
| Minimum hardware (base) | 3x 80 GB GPU (4x recommended), ~170 GB storage | [2][6] |
| Minimum hardware (Instruct) | 8x 80 GB GPU | [2] |

## Reception

The release drew attention chiefly for its scale and for being openly licensed at a time when the strongest image generators were closed. Technology outlets characterized it as the largest open-source image-generation model then available.[5][6] On the LMArena leaderboard the model briefly held the overall top spot for text-to-image generation, surpassing Google DeepMind's Nano Banana, which the South China Morning Post reported in October 2025; Tencent said the model was "completely comparable to the industry's flagship closed-source models."[5] Within the open-source tooling community, the model was integrated into image-generation frontends following its release.[6]

## Limitations

The most practical constraint is hardware. With a multi-GPU requirement of three to eight 80 GB accelerators, the model is out of reach for typical consumer machines and is heavier to run than dense open models a fraction of its size, a direct consequence of the 80-billion-parameter scale.[2][6] The Tencent Hunyuan Community License, while permissive for most users, is not a standard open-source license: the geographic carve-outs for the EU, UK, and South Korea and the 100-million-user threshold mean it does not meet the Open Source Initiative's definition of open source.[8] On quality, the margins reported in the GSB study over the strongest closed-source baselines are small (roughly one to five percentage points), so the model's edge in head-to-head human preference is narrow rather than decisive.[3]

## See also

- [HunyuanVideo](/wiki/hunyuan_video)
- [Hunyuan 3D](/wiki/hunyuan_3d)
- [Hunyuan-A13B](/wiki/hunyuan_a13b)
- [Seedream 4.0](/wiki/seedream_4)
- [Nano Banana](/wiki/nano_banana)
- [GPT Image 1](/wiki/gpt_image_1)

## References

1. [Tencent Open Sources Hunyuan Image 3.0 - World's Largest Open-Source Text-to-Image Model](https://comfyui-wiki.com/en/news/2025-09-27-tencent-open-source-hunyuan-image-3-0), ComfyUI Wiki, 27 September 2025.
2. [tencent/HunyuanImage-3.0](https://huggingface.co/tencent/HunyuanImage-3.0), Hugging Face model card (model details, variants, license, hardware).
3. [HunyuanImage 3.0 Technical Report](https://arxiv.org/abs/2509.23951), arXiv:2509.23951, Tencent Hunyuan, September 2025.
4. [Tencent-Hunyuan/HunyuanImage-3.0](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0), official GitHub repository.
5. [Tencent's AI model Hunyuan Image 3.0 tops leaderboard, beating Google's Nano Banana](https://www.scmp.com/tech/big-tech/article/3328003/tencents-ai-model-hunyuan-image-30-tops-leaderboard-beating-googles-nano-banana), South China Morning Post, 6 October 2025.
6. [Hunyuan Image 3.0 release coverage](https://comfyui-wiki.com/en/news/2025-09-27-tencent-open-source-hunyuan-image-3-0), ComfyUI Wiki (architecture, text rendering, hardware, license).
7. [HunyuanImage-3.0-Instruct model card](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct), Hugging Face (reasoning, image-to-image, January 2026 release).
8. [HunyuanImage-3.0 LICENSE (Tencent Hunyuan Community License Agreement)](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0/blob/main/LICENSE), GitHub.

