Qwen-Image

AI Models Generative AI

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v1 · 1,668 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Qwen-Image is an open-weight image-generation foundation model released by Alibaba's Qwen team in August 2025. Built as a roughly 20-billion-parameter multimodal diffusion transformer (MMDiT), it is notable above all for state-of-the-art complex text rendering inside generated images, including both fine-grained alphabetic text (such as English) and logographic text (such as Chinese), alongside strong general text-to-image generation and high-fidelity image editing ^[1]^[2]. The model was open-sourced under the Apache 2.0 license on Hugging Face and ModelScope, and a companion editing model, Qwen-Image-Edit, was released two weeks later ^[1]^[3]^[4]. At launch the Qwen team positioned Qwen-Image as a leading open image model, claiming it matched or exceeded both open competitors and several closed systems on public benchmarks ^[1]^[5].

Overview

Qwen-Image is part of the broader Qwen family of models developed by Alibaba's Tongyi Lab. Whereas most prior Qwen releases were large language models or vision-language models, Qwen-Image extends the series into native image generation. The model's defining capability is "native" text rendering: the ability to draw legible, correctly spelled, and well-laid-out text directly inside an image, rather than treating text as an afterthought or relying on a separate optical-character pipeline ^[1]^[2]. The Qwen team highlighted use cases such as graphic posters, slides, storefront signage, UI mockups, and bilingual layouts where accurate in-image typography is essential ^[1].

Beyond text, Qwen-Image is a general-purpose text-to-image generator capable of photorealistic scenes, illustration, and a range of artistic styles, and it supports a variety of aspect ratios and high output resolutions ^[2]^[3]. The accompanying Qwen-Image-Edit model adds instruction-based image editing, covering both semantic edits (changing what a scene depicts) and appearance edits (modifying local details), as well as precise editing of text already present in an image ^[4]^[6].

Qwen team and release

The model was developed and released by the Qwen team at Alibaba, the same group responsible for the Qwen series of open language and multimodal models. Qwen-Image was published on August 4, 2025, accompanied by an official blog post, "Qwen-Image: Crafting with Native Text Rendering," and a technical report posted to arXiv (2508.02324) ^[1]^[5]. The base text-to-image model was open-sourced under the Apache 2.0 license, making both the weights and commercial use broadly available, and was distributed through GitHub, Hugging Face, and ModelScope, with hosted access via Qwen Chat ^[1]^[3].

On August 18, 2025, the Qwen team released Qwen-Image-Edit, an image-editing variant built on the same 20B foundation, also under Apache 2.0 ^[4]^[6]. The line was subsequently extended with further iterations, including an updated Qwen-Image-Edit-2509 in September 2025 and later refreshes through late 2025 and a Qwen-Image-2.0 announced in February 2026; this article focuses on the original August 2025 Qwen-Image and Qwen-Image-Edit releases ^[3]. The open release was explicitly framed as an effort to lower the technical barriers to high-quality visual content creation and to advance open research in image generation ^[1].

Architecture (MMDiT)

Qwen-Image uses a multimodal diffusion transformer (MMDiT) architecture, the same general design family popularized by models such as Stable Diffusion 3 and FLUX, in which text and image tokens are processed jointly within transformer blocks during the diffusion denoising process ^[1]^[2]. The system combines three principal components ^[1]^[5]:

A multimodal text/condition encoder based on Qwen2.5-VL, the Qwen team's vision-language model, which encodes the text prompt (and, for editing, the input image) into rich semantic conditioning. The Qwen2.5-VL encoder used in the editing path is the 7-billion-parameter variant ^[4]^[6].
A variational autoencoder (VAE) that compresses images into a latent space and reconstructs them, providing the low-level visual representation the diffusion transformer operates over ^[1]^[5].
The roughly 20-billion-parameter MMDiT backbone, which performs the iterative denoising that turns noise into an image conditioned on the encoded text ^[1]^[2].

For text rendering, the Qwen team employed a progressive training curriculum. Training begins with non-text image generation, then introduces text from simple words to increasingly complex inputs, and finally scales up to paragraph-level descriptions, a curriculum-learning approach the authors credit with substantially improving the model's native text-rendering ability, particularly for dense and logographic text ^[1]^[5].

Qwen-Image-Edit extends this design with a dual-encoding mechanism for the input image. The image is fed simultaneously into Qwen2.5-VL, which extracts high-level semantic features used for visual semantic control, and into the VAE encoder, which captures low-level reconstructive (appearance) features; these two streams are combined in the MMDiT image pathway so that edits remain semantically coherent while preserving visual fidelity ^[4]^[6]. The editing model was trained with an enhanced multi-task paradigm spanning text-to-image, text-and-image-to-image, and image-to-image reconstruction tasks, which the authors describe as improving editing consistency by better aligning the latent representations of Qwen2.5-VL and the MMDiT ^[4]^[5].

Capabilities (text rendering and editing)

The headline capability is text rendering. Qwen-Image is designed to handle multi-line layouts, paragraph-level semantics, and fine-grained typographic detail, and it supports both alphabetic languages such as English and logographic languages such as Chinese with high fidelity ^[1]^[2]. The Qwen team reported that the model rivals strong closed systems such as GPT-4o image generation for English text and is best-in-class for Chinese text rendering among the systems they compared ^[1].

For editing, Qwen-Image-Edit supports two broad categories ^[4]^[6]:

Semantic editing, such as style transfer, object rotation, novel-view synthesis (reported to support rotations up to roughly 180 degrees), and intellectual-property or character-consistent creation, where pixels can change substantially while the subject's identity is preserved.
Appearance editing, such as adding or removing elements and refining local details, where most of the image is kept untouched and only the targeted region changes.

A distinctive editing feature is precise text editing inside images: the model can modify text in both Chinese and English while attempting to preserve the original font, size, and style ^[4]^[6]. The base model card also lists broader image-understanding and conditional-generation abilities, including object detection, semantic segmentation, depth and edge estimation, novel-view synthesis, and super-resolution, reflecting the breadth of tasks the unified architecture was trained on ^[3].

Benchmarks

The following benchmark results are claims reported by the Qwen team in the official blog post and technical report, and should be read as the developers' own evaluations ^[1]^[5]. Independent reproductions may differ.

For general text-to-image generation, Qwen-Image was evaluated on GenEval, DPG, OneIG-Bench, and TIIF. The Qwen team reported a GenEval overall score of 0.91 (after a reinforcement-learning refinement stage) and a DPG score of 88.32, figures that, in their comparisons, exceeded systems such as FLUX.1 and GPT Image 1; for reference, the team's reported DPG figures placed GPT Image 1 at 85.15 and FLUX.1 at 83.84 ^[5]^[7]. On OneIG-Bench the model was reported as best overall across both English and Chinese tracks, while on TIIF it ranked second, behind GPT Image 1 ^[5]^[7]. For text rendering specifically, the Qwen team reported state-of-the-art results on benchmarks including LongText-Bench, ChineseWord, and TextCraft (with additional evaluation on CVTG-2K), citing especially large margins for Chinese text ^[1]^[5].

For image editing, Qwen-Image-Edit was evaluated on GEdit, ImgEdit, and GSO, where the Qwen team again reported state-of-the-art performance among the systems compared ^[4]^[5]. Secondary coverage also noted strong placement for Qwen-Image on Alibaba's AI Arena human-preference leaderboard, where it was described as the top-ranked open-source image model at the time ^[7]; such leaderboard standings change frequently and are best treated as time-bound.

The table below summarizes the model's key specifications and selected reported figures.

Specification	Detail
Developer	Qwen team, Alibaba (Tongyi Lab)
Base model release	August 4, 2025
Editing model release	August 18, 2025 (Qwen-Image-Edit)
Model type	Text-to-image and image-editing foundation model
Architecture	Multimodal diffusion transformer (MMDiT)
Parameters	~20 billion (MMDiT backbone)
Text / condition encoder	Qwen2.5-VL (7B used in the editing path)
Image latent	Variational autoencoder (VAE)
Languages (text rendering)	English and Chinese (alphabetic and logographic)
License	Apache 2.0
Distribution	Hugging Face, ModelScope, GitHub, Qwen Chat
GenEval (reported)	0.91 overall (after RL)
DPG (reported)	88.32
Editing benchmarks (reported)	State of the art on GEdit, ImgEdit, GSO

Availability and significance

Qwen-Image was released as a fully open-weight model under the permissive Apache 2.0 license, which allows redistribution and commercial use, distinguishing it from closed image generators such as GPT-4o image generation, Google's Gemini image models (informally nicknamed "Nano Banana"), and ByteDance's Seedream ^[1]^[3]. The open release drove rapid ecosystem adoption: the model and its editing variant were integrated into the Hugging Face Diffusers library via dedicated pipelines, supported natively in ComfyUI, and quickly accumulated community fine-tunes and LoRA adapters ^[3]^[4]^[6]. Inference backends including vLLM and SGLang variants added support as well ^[3].

Within Alibaba's own portfolio, Qwen-Image complements the company's earlier generative-media efforts, including the Tongyi Wanxiang image line and the Wan video-generation models, and it builds directly on the Qwen2.5-VL vision-language model for its multimodal conditioning ^[4]^[5]. Its significance lies in two areas. First, it set a new bar for open image models on text rendering, an area where prior open systems such as FLUX and Stable Diffusion 3.5 had been comparatively weak, and where Qwen-Image's Chinese-text capability in particular had few open peers ^[1]^[7]. Second, by pairing strong general generation with a tightly integrated, instruction-based editing model under a permissive license, Qwen-Image became a widely used foundation for downstream creative tooling, contributing to the broader momentum of open Chinese AI models in the generative-image space during 2025 ^[3]^[7].

References

Qwen Team. "Qwen-Image: Crafting with Native Text Rendering." Qwen blog, August 4, 2025. https://qwenlm.github.io/blog/qwen-image/ ↩
Qwen Team. "Qwen-Image Technical Report." arXiv:2508.02324, August 2025. https://arxiv.org/abs/2508.02324 ↩
"Qwen/Qwen-Image." Hugging Face model card. https://huggingface.co/Qwen/Qwen-Image ↩
"Qwen/Qwen-Image-Edit." Hugging Face model card. https://huggingface.co/Qwen/Qwen-Image-Edit ↩
Qwen Team. "Qwen-Image Technical Report" (PDF), August 4, 2025. https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf ↩
"Qwen Team Introduces Qwen-Image-Edit: Advanced Capabilities for Semantic and Appearance Editing." The AI Sector, August 19, 2025. https://theaisector.com/2025/08/19/qwen-team-introduces-qwen-image-edit-the-image-editing-version-of-qwen-image-with-advanced-capabilities-for-semantic-and-appearance-editing/ ↩
Gupta, Mehul. "Qwen-Image: Best open-sourced AI image generation is here." Data Science in Your Pocket (Medium), August 2025. https://medium.com/data-science-in-your-pocket/qwen-image-best-open-sourced-ai-image-generation-is-here-d09b6e7f6c71 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Best AI Image Generators HunyuanVideo Text-to-Image Models

Overview

Qwen team and release

Architecture (MMDiT)

Capabilities (text rendering and editing)

Benchmarks

Availability and significance

References

Improve this article

Related Articles

NVIDIA Picasso

Text-to-Image Models

Pika (video generation)

NVIDIA Cosmos

Sora 2

Genie 3

What links here

Related Articles

NVIDIA Picasso

Text-to-Image Models

Pika (video generation)

NVIDIA Cosmos

Sora 2

Genie 3

What links here