Qwen-Image
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,668 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,668 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen-Image is an open-weight image-generation foundation model released by Alibaba's Qwen team in August 2025. Built as a roughly 20-billion-parameter multimodal diffusion transformer (MMDiT), it is notable above all for state-of-the-art complex text rendering inside generated images, including both fine-grained alphabetic text (such as English) and logographic text (such as Chinese), alongside strong general text-to-image generation and high-fidelity image editing [1][2]. The model was open-sourced under the Apache 2.0 license on Hugging Face and ModelScope, and a companion editing model, Qwen-Image-Edit, was released two weeks later [1][3][4]. At launch the Qwen team positioned Qwen-Image as a leading open image model, claiming it matched or exceeded both open competitors and several closed systems on public benchmarks [1][5].
Qwen-Image is part of the broader Qwen family of models developed by Alibaba's Tongyi Lab. Whereas most prior Qwen releases were large language models or vision-language models, Qwen-Image extends the series into native image generation. The model's defining capability is "native" text rendering: the ability to draw legible, correctly spelled, and well-laid-out text directly inside an image, rather than treating text as an afterthought or relying on a separate optical-character pipeline [1][2]. The Qwen team highlighted use cases such as graphic posters, slides, storefront signage, UI mockups, and bilingual layouts where accurate in-image typography is essential [1].
Beyond text, Qwen-Image is a general-purpose text-to-image generator capable of photorealistic scenes, illustration, and a range of artistic styles, and it supports a variety of aspect ratios and high output resolutions [2][3]. The accompanying Qwen-Image-Edit model adds instruction-based image editing, covering both semantic edits (changing what a scene depicts) and appearance edits (modifying local details), as well as precise editing of text already present in an image [4][6].
The model was developed and released by the Qwen team at Alibaba, the same group responsible for the Qwen series of open language and multimodal models. Qwen-Image was published on August 4, 2025, accompanied by an official blog post, "Qwen-Image: Crafting with Native Text Rendering," and a technical report posted to arXiv (2508.02324) [1][5]. The base text-to-image model was open-sourced under the Apache 2.0 license, making both the weights and commercial use broadly available, and was distributed through GitHub, Hugging Face, and ModelScope, with hosted access via Qwen Chat [1][3].
On August 18, 2025, the Qwen team released Qwen-Image-Edit, an image-editing variant built on the same 20B foundation, also under Apache 2.0 [4][6]. The line was subsequently extended with further iterations, including an updated Qwen-Image-Edit-2509 in September 2025 and later refreshes through late 2025 and a Qwen-Image-2.0 announced in February 2026; this article focuses on the original August 2025 Qwen-Image and Qwen-Image-Edit releases [3]. The open release was explicitly framed as an effort to lower the technical barriers to high-quality visual content creation and to advance open research in image generation [1].
Qwen-Image uses a multimodal diffusion transformer (MMDiT) architecture, the same general design family popularized by models such as Stable Diffusion 3 and FLUX, in which text and image tokens are processed jointly within transformer blocks during the diffusion denoising process [1][2]. The system combines three principal components [1][5]:
For text rendering, the Qwen team employed a progressive training curriculum. Training begins with non-text image generation, then introduces text from simple words to increasingly complex inputs, and finally scales up to paragraph-level descriptions, a curriculum-learning approach the authors credit with substantially improving the model's native text-rendering ability, particularly for dense and logographic text [1][5].
Qwen-Image-Edit extends this design with a dual-encoding mechanism for the input image. The image is fed simultaneously into Qwen2.5-VL, which extracts high-level semantic features used for visual semantic control, and into the VAE encoder, which captures low-level reconstructive (appearance) features; these two streams are combined in the MMDiT image pathway so that edits remain semantically coherent while preserving visual fidelity [4][6]. The editing model was trained with an enhanced multi-task paradigm spanning text-to-image, text-and-image-to-image, and image-to-image reconstruction tasks, which the authors describe as improving editing consistency by better aligning the latent representations of Qwen2.5-VL and the MMDiT [4][5].
The headline capability is text rendering. Qwen-Image is designed to handle multi-line layouts, paragraph-level semantics, and fine-grained typographic detail, and it supports both alphabetic languages such as English and logographic languages such as Chinese with high fidelity [1][2]. The Qwen team reported that the model rivals strong closed systems such as GPT-4o image generation for English text and is best-in-class for Chinese text rendering among the systems they compared [1].
For editing, Qwen-Image-Edit supports two broad categories [4][6]:
A distinctive editing feature is precise text editing inside images: the model can modify text in both Chinese and English while attempting to preserve the original font, size, and style [4][6]. The base model card also lists broader image-understanding and conditional-generation abilities, including object detection, semantic segmentation, depth and edge estimation, novel-view synthesis, and super-resolution, reflecting the breadth of tasks the unified architecture was trained on [3].
The following benchmark results are claims reported by the Qwen team in the official blog post and technical report, and should be read as the developers' own evaluations [1][5]. Independent reproductions may differ.
For general text-to-image generation, Qwen-Image was evaluated on GenEval, DPG, OneIG-Bench, and TIIF. The Qwen team reported a GenEval overall score of 0.91 (after a reinforcement-learning refinement stage) and a DPG score of 88.32, figures that, in their comparisons, exceeded systems such as FLUX.1 and GPT Image 1; for reference, the team's reported DPG figures placed GPT Image 1 at 85.15 and FLUX.1 at 83.84 [5][7]. On OneIG-Bench the model was reported as best overall across both English and Chinese tracks, while on TIIF it ranked second, behind GPT Image 1 [5][7]. For text rendering specifically, the Qwen team reported state-of-the-art results on benchmarks including LongText-Bench, ChineseWord, and TextCraft (with additional evaluation on CVTG-2K), citing especially large margins for Chinese text [1][5].
For image editing, Qwen-Image-Edit was evaluated on GEdit, ImgEdit, and GSO, where the Qwen team again reported state-of-the-art performance among the systems compared [4][5]. Secondary coverage also noted strong placement for Qwen-Image on Alibaba's AI Arena human-preference leaderboard, where it was described as the top-ranked open-source image model at the time [7]; such leaderboard standings change frequently and are best treated as time-bound.
The table below summarizes the model's key specifications and selected reported figures.
| Specification | Detail |
|---|---|
| Developer | Qwen team, Alibaba (Tongyi Lab) |
| Base model release | August 4, 2025 |
| Editing model release | August 18, 2025 (Qwen-Image-Edit) |
| Model type | Text-to-image and image-editing foundation model |
| Architecture | Multimodal diffusion transformer (MMDiT) |
| Parameters | ~20 billion (MMDiT backbone) |
| Text / condition encoder | Qwen2.5-VL (7B used in the editing path) |
| Image latent | Variational autoencoder (VAE) |
| Languages (text rendering) | English and Chinese (alphabetic and logographic) |
| License | Apache 2.0 |
| Distribution | Hugging Face, ModelScope, GitHub, Qwen Chat |
| GenEval (reported) | 0.91 overall (after RL) |
| DPG (reported) | 88.32 |
| Editing benchmarks (reported) | State of the art on GEdit, ImgEdit, GSO |
Qwen-Image was released as a fully open-weight model under the permissive Apache 2.0 license, which allows redistribution and commercial use, distinguishing it from closed image generators such as GPT-4o image generation, Google's Gemini image models (informally nicknamed "Nano Banana"), and ByteDance's Seedream [1][3]. The open release drove rapid ecosystem adoption: the model and its editing variant were integrated into the Hugging Face Diffusers library via dedicated pipelines, supported natively in ComfyUI, and quickly accumulated community fine-tunes and LoRA adapters [3][4][6]. Inference backends including vLLM and SGLang variants added support as well [3].
Within Alibaba's own portfolio, Qwen-Image complements the company's earlier generative-media efforts, including the Tongyi Wanxiang image line and the Wan video-generation models, and it builds directly on the Qwen2.5-VL vision-language model for its multimodal conditioning [4][5]. Its significance lies in two areas. First, it set a new bar for open image models on text rendering, an area where prior open systems such as FLUX and Stable Diffusion 3.5 had been comparatively weak, and where Qwen-Image's Chinese-text capability in particular had few open peers [1][7]. Second, by pairing strong general generation with a tightly integrated, instruction-based editing model under a permissive license, Qwen-Image became a widely used foundation for downstream creative tooling, contributing to the broader momentum of open Chinese AI models in the generative-image space during 2025 [3][7].