Qwen3

Qwen3 is the third-generation family of large language models developed by the Qwen Team at Alibaba Cloud (also known as Tongyi Qianwen lab). Released on April 29, 2025, Qwen3 comprises eight open-weight models spanning dense and Mixture of Experts (MoE) architectures, with parameter counts ranging from 0.6 billion to 235 billion. The family introduces a hybrid thinking capability that lets a single model switch between deliberate, chain-of-thought reasoning and fast, non-thinking response modes within the same deployment. All eight base models carry an Apache 2.0 license, making them freely available for commercial and research use.

At launch, the flagship Qwen3-235B-A22B outperformed OpenAI's o3-mini on AIME math benchmarks and Codeforces programming contests, and performed comparably to Google's Gemini 2.5 Pro on several reasoning tasks. The family was trained on 36 trillion tokens covering 119 languages and dialects, double the dataset used for the previous Qwen 2.5 generation. By the time Qwen3 launched, the broader Qwen family had accumulated over 300 million downloads and more than 100,000 derivative models on Hugging Face. Throughout 2025 the series expanded into a sprawling ecosystem of subfamilies covering coding (Qwen3-Coder), vision-language (Qwen3-VL), audio and video (Qwen3-Omni), retrieval (Qwen3-Embedding and Qwen3-Reranker), the hybrid-attention Qwen3-Next, and the trillion-parameter proprietary Qwen3-Max. By September 2025 the broader Qwen lineage had overtaken Meta's LLaMA as the most-downloaded open-weight model family on Hugging Face according to Stanford's tracking, with several hundred thousand community derivatives on the Hub. By April 2026 cumulative Qwen downloads on Hugging Face approached one billion, with the family accounting for more than half of all open-source model downloads worldwide and over 200,000 derivative models on the Hub.

Background

Alibaba Cloud's Qwen Team began releasing language models in 2023 under the Tongyi Qianwen brand. The original Qwen 1.0 models, released mid-2023, included dense models at 1.8B, 7B, 14B, and 72B parameter sizes, pretrained primarily on Chinese and English text with a focus of approximately 3 trillion tokens. Alibaba had introduced Tongyi Qianwen as a beta service in April 2023 before opening it publicly in September 2023 following regulatory clearance in China.

Qwen 1.5, released in early 2024, expanded the size range to 0.5B through 72B and introduced an early MoE variant (14B total, 2.7B active). It was treated internally as a beta release leading into Qwen 2, and the two generations share architectural lineage.

Qwen 2, released in June 2024, adopted Grouped Query Attention (GQA) across all model sizes. GQA reduces key-value cache memory demands and improves inference throughput compared to standard multi-head attention. The series scaled up to a 72B open-weight model, with Alibaba keeping its largest commercial variants proprietary through the Model Studio API.

Qwen 2.5, released in September 2024, was the largest Qwen open-weight release up to that point. Pretraining data scaled from the 7 trillion tokens used for Qwen 2 to 18 trillion tokens, and the release introduced specialized branches: Qwen2.5-Coder for software development tasks and Qwen2.5-Math for mathematical reasoning. The 72B Qwen2.5-Instruct model gained wide adoption for deployment on single-GPU server configurations, and the Qwen series established itself as one of the dominant open-weight lineages in the global open-source AI community. The Qwen 2.5 generation also supported 29 languages, a significant multilingual improvement over earlier versions.

Qwen3 was previewed in a series of teasers from the Qwen Twitter account in late April 2025. The flagship Qwen3-235B-A22B was first announced on April 28, 2025, and the full open-weight release including dense models from 0.6B to 32B and the 30B-A3B MoE went live on April 29, 2025. The release doubled pretraining data again to 36 trillion tokens, added genuine large-scale MoE models (30B-A3B and 235B-A22B), extended multilingual support to 119 languages, and introduced the hybrid thinking architecture that distinguishes the generation from its predecessors.

Model family

Qwen3 in its initial April 2025 release consists of six dense models and two MoE models. Dense models activate all their parameters during inference; MoE models activate only a fraction of their total parameters per forward pass, reducing compute requirements while maintaining or exceeding the performance of larger dense models.

Model	Type	Total parameters	Active parameters	Native context	Extended context
Qwen3-0.6B	Dense	0.6B	0.6B	32K	32K
Qwen3-1.7B	Dense	1.7B	1.7B	32K	32K
Qwen3-4B	Dense	4B	4B	32K	128K
Qwen3-8B	Dense	8B	8B	128K	128K
Qwen3-14B	Dense	14B	14B	128K	128K
Qwen3-32B	Dense	32B	32B	128K	128K
Qwen3-30B-A3B	MoE	30B	3B	128K	128K
Qwen3-235B-A22B	MoE	235B	22B	32K	128K (YaRN)

The two smallest models (0.6B and 1.7B) target edge deployment and on-device inference scenarios where memory is constrained. They support a hard 32K token context ceiling. The 4B through 32B dense models cover consumer GPU and server GPU configurations; the 4B model runs on hardware with as little as 8GB VRAM when quantized to 4-bit precision, and the 32B fits on a single 80GB A100 or equivalent.

The 30B-A3B MoE model activates only 3 billion parameters per forward pass despite having 30 billion total parameters stored in memory. In compute cost during inference, it is roughly equivalent to a 3B dense model while retaining substantially more capacity due to the conditional routing mechanism. The Qwen Team reports that Qwen3-30B-A3B outperforms QwQ-32B on several reasoning benchmarks despite having only one-tenth as many active parameters.

The flagship 235B-A22B MoE activates 22 billion parameters per token from a total of 235 billion stored parameters. All models are available for download under Apache 2.0 on Hugging Face, ModelScope, GitHub, and Kaggle.

The Qwen Team also reports that the 4B Qwen3 dense model, with knowledge distillation from the larger checkpoints, can rival Qwen2.5-72B-Instruct on many tasks despite using roughly one-eighteenth as many parameters. This compression result is one of the headline efficiency claims tied to the release.

Per-model architectural specifications

The official Qwen3 technical report (arXiv:2505.09388) lists the following layer and attention-head configurations:

Model	Layers	Q heads	KV heads	Experts (total / active)
Qwen3-0.6B	28	16	8	n/a
Qwen3-1.7B	28	16	8	n/a
Qwen3-4B	36	32	8	n/a
Qwen3-8B	36	32	8	n/a
Qwen3-14B	40	40	8	n/a
Qwen3-32B	64	64	8	n/a
Qwen3-30B-A3B	48	32	4	128 / 8
Qwen3-235B-A22B	94	64	4	128 / 8

All variants use Grouped Query Attention, SwiGLU feed-forward blocks, and Rotary Positional Embeddings. The MoE layers in the 30B and 235B models contain 128 experts each, with 8 routed experts per token, plus fine-grained expert segmentation that splits the feed-forward dimension into smaller specialised units.

Hybrid thinking mode

One of the defining design choices in Qwen3 is integrating thinking and non-thinking capabilities within a single model rather than shipping separate reasoning and chat variants. Earlier systems typically required users to pick between a fast general-purpose chat model and a slower but more deliberate reasoning model. In Qwen3, both behaviors exist in the same weights and can be selected at inference time.

In thinking mode, the model generates an internal chain-of-thought (CoT) trace before producing its final answer. This reasoning trace appears between special <think> and </think> delimiters in the raw output. The thinking trace can extend to approximately 38,000 tokens for complex problems, allowing the model to explore multiple reasoning paths, check its work, and revise conclusions before committing to an answer. Users can set a budget cap to limit how long the model thinks, enabling a direct tradeoff between latency and answer quality.

In non-thinking mode, the model responds directly without visible reasoning steps, prioritizing speed and conciseness. This is suited for conversational exchanges, simple lookups, and applications where response time matters more than deliberative accuracy.

Users switch between modes in two ways. At the API level, the enable_thinking parameter controls the mode for a given call. Within a conversation, /think and /no_think tokens can be placed inline in messages to enable or disable thinking for subsequent turns. This per-turn control is called soft switching and allows mixed conversations where some questions receive full reasoning and others receive quick answers.

Sampling recommendations differ by mode. For thinking mode, the Qwen Team recommends temperature 0.6, top-p 0.95, and top-k 20, with greedy decoding explicitly discouraged (it degrades output quality in the thinking regime). For non-thinking mode, temperature 0.7, top-p 0.8, and top-k 20 are recommended.

This dual-mode design was achieved through a four-stage post-training pipeline described in the Training section. By July 2025 the Qwen Team revised this design philosophy: the updated 235B-A22B-2507 checkpoints split the unified model back into two specialised variants (Instruct-2507 and Thinking-2507), with each variant optimized solely for one mode. The team explained that decoupling produced higher quality on each individual axis than the original mixed approach, even though it sacrificed the soft-switching convenience.

Architecture

Dense models

The Qwen3 dense models share a standard transformer architecture with several modifications introduced progressively across the Qwen series:

Grouped Query Attention (GQA): Reduces key-value cache memory by sharing KV heads across multiple query heads, enabling longer effective contexts and faster inference relative to full multi-head attention.
SwiGLU activation function in the feed-forward layers, which consistently outperforms ReLU and GELU activations in transformer language models.
Rotary Positional Embeddings (RoPE) for position encoding, which generalize well to sequence lengths longer than those seen during training.
RMSNorm with pre-normalization for training stability.
QK-Norm applied to attention queries and keys, replacing the QKV-bias from earlier Qwen generations. This stabilizes training at large scale.

MoE models

The two MoE variants (30B-A3B and 235B-A22B) share the same base architecture as the dense models but replace standard feed-forward layers with mixture-of-experts layers. Each MoE layer contains 128 total expert networks. During a forward pass, a learned router selects 8 of these 128 experts for each token, keeping computation constant regardless of how many experts exist in total.

Fine-grained expert segmentation divides the feed-forward dimensions into smaller units per expert, enabling more targeted specialization. Global-batch load balancing distributes token routing across experts to prevent any single expert from becoming a bottleneck during training or inference.

The flagship 235B-A22B MoE model has 94 transformer layers, 64 attention heads for queries, 4 heads for key-value, and uses BF16 tensor precision. It has 235 billion total non-embedding parameters and 234 billion non-embedding parameters (with embeddings excluded from the count).

The Qwen Team reports that the 30B-A3B MoE achieves comparable performance to the Qwen2.5-72B dense model on most benchmarks while activating roughly one-tenth as many parameters per token. This translates directly to lower inference compute cost when measured in floating-point operations per token.

Context length

Context lengths range from 32K tokens (0.6B and 1.7B models) to 128K tokens (8B through 235B). Extended context is enabled via YaRN (Yet another RoPE extensioN) scaling with a scaling factor of 4.0, and Dual Chunk Attention, which improves performance on inputs that exceed the training sequence length.

Community testing has found that effective performance is higher for inputs under 64K to 96K tokens even on models rated for 128K, and that YaRN-scaled inputs (which stress the extension mechanism) perform somewhat worse than inputs that fit within the native window. Users running contexts longer than 64K should expect some degradation relative to shorter inputs.

The later 2507 refresh of the 235B-A22B model lifts the native window to 262,144 tokens (256K) and supports extension up to roughly 1,010,000 tokens through Dual Chunk Attention combined with MInference sparse attention. Alibaba reports a roughly 3x speedup at 1M-token sequence lengths from MInference relative to dense attention, although VRAM requirements scale to the order of 1 TB across multiple GPUs at the longest setting.

Training

Pretraining data

Qwen3 was pretrained on approximately 36 trillion tokens, roughly double the 18 trillion tokens used for Qwen 2.5. The dataset spans 119 languages and dialects organized across Indo-European, Sino-Tibetan, Afro-Asiatic, Austronesian, and other language families. This extended multilingual coverage compares to Qwen 2.5's 29 languages.

The training data composition includes web text, books, academic papers, code repositories, mathematical content, and synthetic data. Rather than relying solely on crawled web data, the Qwen Team developed a multilingual annotation system that classified over 30 trillion tokens by educational value, domain, and safety dimensions. This labeling informed how tokens were weighted and sampled during training.

Synthetic data was generated using earlier Qwen models as a "data factory." Qwen2.5-Math generated mathematical textbooks and question-answer pairs. Qwen2.5-Coder generated synthetic code snippets and programming exercises. Qwen2.5-VL was fine-tuned to extract clean text from PDF documents, with Qwen2.5 then used to refine the OCR output before ingestion. This pipeline allowed Alibaba to convert large volumes of structured documents (scientific papers, technical manuals, textbooks) into clean training text.

Pretraining stages

Pretraining proceeded across three stages:

Stage 1: Over 30 trillion tokens at a 4K token sequence length. This stage covers broad world knowledge across all 119 supported languages and establishes the model's general capabilities.
Stage 2: Approximately 5 trillion tokens focused on high-quality STEM (science, technology, engineering, mathematics) and software development content, also at 4K sequence length. This stage emphasizes reasoning-intensive domains.
Stage 3: Hundreds of billions of long-context tokens at a 32K sequence length. This stage extends the model's ability to attend across longer inputs by exposing it to documents and conversations with long-range dependencies.

Post-training pipeline

To produce the hybrid thinking behavior, Qwen3 uses a four-stage post-training pipeline:

Long-CoT cold start: Supervised fine-tuning on long chain-of-thought demonstrations. This establishes a baseline ability to generate extended, internally consistent reasoning traces.
Reasoning-based reinforcement learning: The GRPO (Group Relative Policy Optimization) algorithm is used to train the model against verifiable tasks in mathematics and coding, where correct answers can be automatically confirmed. This step rewards accurate multi-step reasoning and penalizes incorrect conclusions.
Thinking mode fusion: The model is trained simultaneously on thinking and non-thinking data, with chat templates specifying which mode applies. This teaches the model to produce direct answers when thinking is disabled and to use reasoning traces when it is enabled.
General-domain RL: A final reinforcement learning stage covering broader tasks: instruction following, tool use, creative writing, and multilingual outputs. This step improves overall helpfulness and alignment without sacrificing the reasoning gains from stage 2.

Post-training data curation used a two-phase filtering approach for reasoning data. The query filter removed prompts that were unverifiable or solvable without reasoning. The response filter removed examples with incorrect answers, internal inconsistencies, repetitive traces, or indicators of guesswork. Human annotators assessed cases that automated verifiers could not resolve.

Knowledge distillation from the flagship 235B-A22B teacher into the smaller dense and MoE checkpoints is a key efficiency lever in the Qwen3 pipeline. The technical report describes a strong-to-weak distillation regime where smaller students learn from teacher logits during a portion of pretraining, which is what allows the 4B and 8B dense models to reach quality levels that previously required substantially larger checkpoints in earlier Qwen generations.

License

All eight models in the Qwen3 base family are released under the Apache 2.0 license. This allows commercial use, modification, and redistribution without requiring licensees to open-source their own derived work, subject to preservation of the Apache 2.0 copyright notice and attribution statement.

The Apache 2.0 license distinguishes Qwen3 from several competing open-weight releases. Meta's LLaMA series uses a custom community license that restricts services with more than 700 million monthly active users. Moonshot AI's Kimi K2, released in 2025, uses a modified MIT license that requires negotiation for commercial deployments exceeding 100 million monthly active users or $20 million in monthly revenue. The DeepSeek series uses the MIT license, which is also permissive. Qwen3's Apache 2.0 is the standard choice in the enterprise open-source software world and carries no graduated commercial restrictions.

The Apache 2.0 designation extends to the later Qwen3 subfamilies that ship public weights: Qwen3-Coder (480B-A35B and 30B-A3B), the Qwen3-VL dense and MoE checkpoints, the Qwen3-Omni 30B-A3B variants (Instruct, Thinking, and Captioner), Qwen3-Next-80B-A3B (Instruct and Thinking), Qwen3-Coder-Next, and the Qwen3-Embedding and Qwen3-Reranker series at 0.6B, 4B, and 8B sizes. The proprietary Qwen3-Max is the only headline branch that ships through API access only without weights.

Benchmarks

The following table shows performance of Qwen3-235B-A22B and Qwen3-32B against comparable models at the time of release. Scores are taken from the Qwen3 technical report (arXiv:2505.09388) and the official Alibaba blog post.

Benchmark	Qwen3-235B-A22B	Qwen3-32B	Qwen3-30B-A3B	DeepSeek-R1	GPT-4o	o3-mini	Gemini 2.5 Pro
MMLU-Pro	68.18	65.54	61.49	--	--	--	--
GPQA (5-shot CoT)	47.47	49.49	43.94	--	--	--	--
AIME 2024	85.7	79.0	--	79.8	9.3	63.6	~92
AIME 2025	81.5	--	--	--	--	--	--
LiveCodeBench v5	70.7	65.6	--	65.9	32.3	53.8	--
Codeforces rating	2,056	1,977	--	1,870	759	1,258	--
BFCL v3 (tool use)	70.8	70.0	--	37.0	50.1	48.4	--
SWE-Bench Pro	21.4	--	--	--	--	--	--

The BFCL v3 (Berkeley Function-Calling Leaderboard) scores are particularly notable: both the 235B and 32B models score roughly double what DeepSeek-R1 achieves, reflecting Alibaba's explicit training emphasis on function-calling and agentic capabilities. The gap relative to GPT-4o on this benchmark is also large.

On AIME 2024, the 235B model at 85.7 substantially exceeds o3-mini (63.6) and DeepSeek-R1 (79.8), placing it in competitive range with Gemini 2.5 Pro. For coding, Qwen3-32B at 1,977 Codeforces rating surpasses what OpenAI's o1 model achieved at the same point in time.

For math reasoning that requires extended thinking, both the 235B and 32B models perform well above GPT-4o's baseline of 9.3 on AIME 2024, which reflects that GPT-4o was not a reasoning-specialized model at the time of comparison.

The Qwen3 technical report also shows that the 235B-A22B base model outperforms DeepSeek-V3 Base across 14 of 15 benchmarks while having roughly one-third the total parameters and two-thirds the activated parameters of DeepSeek-V3.

The July 2025 Thinking-2507 refresh raised these numbers further. According to the Hugging Face model card for Qwen3-235B-A22B-Thinking-2507, scores on AIME 2025, GPQA Diamond, and MMLU-Pro climb substantially over the April release:

Benchmark	235B-A22B (April 2025)	235B-A22B-Thinking-2507
AIME 2025	81.5	92.3
GPQA Diamond	--	81.1
MMLU-Pro	68.18	84.4
HMMT 2025	--	83.9
LiveCodeBench v6	--	74.1
LiveBench (2024-11-25)	--	78.4
BFCL v3	70.8	71.9
TAU1-Retail	--	67.8
MultiIF	--	80.6
PolyMATH	--	60.1

VentureBeat reported that Qwen3-235B-A22B-Thinking-2507 "tops OpenAI, Gemini reasoning models on key benchmarks," with the AIME 2025 score of 92.3 leading all reported models on that competition at the time. Further detail from the model card placed Arena-Hard v2 at 79.7 and showed Qwen3-235B-A22B-Thinking-2507 narrowly edging OpenAI's o4-mini (92.7) on AIME 2025 once parallel sampling was disabled and Gemini 2.5 Pro (88.0) on the same configuration.

The non-thinking Qwen3-235B-A22B-Instruct-2507 reaches 70.3 on AIME 2025, 77.5 on GPQA, 83.0 on MMLU-Pro, 79.2 on Arena-Hard v2, 88.7 on IFEval, 70.9 on BFCL v3, and 51.8 on LiveCodeBench v6. Those are non-thinking-mode numbers, so they sit below the Thinking-2507 results on reasoning-heavy benchmarks but ahead on instruction-following and arena-style preference scoring.

Qwen3-Max-Thinking final benchmarks

The Qwen3-Max-Thinking variant that began as an intermediate training checkpoint in October 2025 was finalized in late January 2026. Alibaba reported a sweep of 19 benchmarks where the final Thinking checkpoint matched or beat GPT-5.2-Thinking, Claude Opus 4.5, and Gemini 3 Pro on roughly half of them. Headline numbers were 100 accuracy on AIME 2025 and HMMT February 2025 with Python execution and parallel sampling allowed, 98.0 on HMMT without tools, 87.4 on GPQA Diamond, 85.9 on LiveCodeBench v6, 90.2 on Arena-Hard v2, 75.3 on SWE-Bench Verified, and 82.1 on Tau2-Bench. On Humanity's Last Exam with search enabled, Alibaba reported 49.8, ahead of GPT-5.2 (45.5), Claude Opus 4.5 (43.2), and Gemini 3 Pro (45.8). Artificial Analysis placed the final Thinking checkpoint at 40 on its Intelligence Index, an eight-point jump from the September Preview, with about 79 of every 86 million output tokens consumed by chain-of-thought rather than final-answer text.

Variants

Qwen3 expanded steadily through 2025 into a dense family of subfamilies, each branded with a suffix that signals modality, deployment tier, or training focus.

Qwen3-Max

Qwen3-Max is the commercial API branding used by Alibaba Cloud's Model Studio for the most capable model in the family. The original April 2025 launch tied this label to Qwen3-235B-A22B. Beginning with Qwen3-Max-Preview on September 5, 2025, the Max line moved to a separate trillion-plus parameter dense-MoE model that is closed-weight and accessible only through the API. Alibaba describes Qwen3-Max-Preview as the company's first model with more than one trillion parameters, trained on roughly 36 trillion tokens. Internal Alibaba benchmarks at announcement showed Qwen3-Max-Preview ahead of Qwen3-235B-A22B-2507 across the company's reasoning suite.

The model graduated from preview status at the Apsara Conference on September 24, 2025, alongside a separate Qwen3-Max-Thinking variant designed for agentic workflows. The Thinking variant was originally distributed as an intermediate training checkpoint and only became a fully released product at the end of January 2026, with integrated code execution and search tools that enable the saturation scores on AIME 2025 and HMMT.

Qwen3-Max powers Alibaba's Quark AI super-assistant application in China, which serves over 200 million users. Quark integrates deep search, photo-based problem solving, AI writing, multimodal interactions, and task execution. By May 2026 a software update for the Quark AI Glasses S1 layered proactive AI behavior and deeper Qwen integrations. Model Studio also offers Qwen3-Plus and Qwen3-Turbo tiers at lower per-token rates. See the dedicated Qwen3-Max article for full pricing and benchmark history.

Qwen3-Coder

Qwen3-Coder is a code-specialized branch of the Qwen3 family. The flagship variant, Qwen3-Coder-480B-A35B-Instruct, was released on July 22, 2025, roughly three months after the base family. It has 480 billion total parameters and 35 billion active parameters, with a native context of 256K tokens and support for up to 1 million tokens via YaRN extrapolation. A smaller 30B-A3B coder, sometimes referred to as Qwen3-Coder-Flash in Alibaba's chat interface, followed on July 31, 2025 with 30.5 billion total parameters and 3.3 billion active. Both models are Apache 2.0.

Qwen3-Coder was designed specifically for agentic coding scenarios involving multi-turn tool use, code execution, and long-horizon repository-level problem solving. Pretraining used 7.5 trillion tokens with a 70 percent code-data ratio. During post-training, Alibaba ran Code RL (execution-driven reinforcement learning) and Agent RL (long-horizon reinforcement learning against real code execution environments), scaling training across up to 20,000 parallel environments. Verifiable coding tasks (where correctness can be checked automatically) were preferred, following the "hard to solve, easy to verify" principle from RLVR (Reinforcement Learning with Verifiable Rewards).

At launch, Alibaba reported that Qwen3-Coder achieved state-of-the-art performance among open-source models on SWE-Bench Verified without test-time scaling, with the Qwen Team describing its performance as comparable to Claude Sonnet 4 on agentic coding tasks. An open-source CLI tool called Qwen Code was released alongside the model, adapted from the Gemini Code CLI with customised prompts and function-calling protocols. Qwen Code is compatible with established agent interfaces including Claude Code and Cline. The model is also exposed through OpenAI SDK compatibility on Alibaba Cloud Model Studio, which lets existing OpenAI-API code swap base URL and key without further changes.

Qwen3-Coder is hosted on Together AI, Fireworks AI, NVIDIA NIM, Amazon Bedrock, OpenRouter, and Hyperbolic, among other inference providers. The 30B-A3B variant runs on a 32GB or 64GB Mac via MLX or LM Studio at 4-bit quantization, which made it widely accessible to individual developers without server GPUs.

Qwen3-Coder-Next

Qwen3-Coder-Next, released in March 2026, ports the Qwen3-Next hybrid attention architecture into the coding branch. It pairs gated DeltaNet linear-time attention with gated softmax attention, uses high-sparsity MoE feed-forward blocks, and adopts Multi-Token Prediction to accelerate decoding. The Qwen Team frames the release as a small-model alternative to Qwen3-Coder-480B-A35B that scales agentic training signals rather than raw parameters: rather than pretraining more code tokens, the team expanded the pool of verifiable coding tasks and executable environments used during reinforcement learning. Reported scores on SWE-Bench Verified are 70.6 percent with SWE-Agent, 71.1 percent with MiniSWE-Agent, and 71.3 percent with OpenHands, putting the model above 70 percent across all three popular agent harnesses; on the harder SWE-Bench Pro it reaches 44.3 percent. The model ships Apache 2.0 with day-one support in Cline, Qwen Code, and OpenHands and bindings for vLLM, SGLang, and llama.cpp.

Qwen3-VL

Qwen3-VL is the vision-language branch of the Qwen3 series. The flagship Qwen3-VL-235B-A22B-Instruct and Qwen3-VL-235B-A22B-Thinking checkpoints were released on September 23, 2025. The 30B-A3B variants followed on October 4, 2025, and dense versions at 2B, 4B, 8B, and 32B were rolled out alongside the MoE branch. The Qwen3-VL technical report (arXiv:2511.21631) was published in November 2025.

Qwen3-VL accepts interleaved text, image, and video inputs and provides a native context window of 256K tokens for multimodal content. The family mirrors the base Qwen3 architecture, offering both dense and MoE variants. Qwen3-VL retains the hybrid thinking capability from the base LLMs and delivers strong performance on single-image, multi-image, and video understanding benchmarks. The model's long native context allows it to handle lengthy document inputs (mixed text and images), multi-frame video clips, and multi-image reasoning tasks that would overflow shorter-context vision models. Qwen3-VL-2B-Instruct surpassed 18 million Hugging Face downloads in its first weeks, reflecting widespread interest in a small multimodal model with permissive licensing.

Qwen3-Omni

Qwen3-Omni is the audio and video branch of the family, released on September 22, 2025. The published checkpoints are Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner, all under Apache 2.0. Qwen3-Omni accepts text, image, audio, and video inputs and produces text and natively streamed speech output. Speech recognition is supported across 113 languages and dialects, and speech generation across 36 languages. The Qwen3-Omni technical report (arXiv:2509.17765) describes a single end-to-end multimodal model that maintains state-of-the-art performance across modalities without degradation relative to single-modal counterparts.

Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source state-of-the-art on 32 benchmarks and overall state-of-the-art on 22, outperforming closed-source competitors including Gemini 2.5 Pro, Seed-ASR, and GPT-4o-Transcribe. The model exposes a voice cloning interface where users can upload a sample voice and have responses synthesised in that voice through the API. A later Qwen3-Omni-Flash-2025-12-01 variant added quality and latency improvements aimed at near real-time deployment.

Qwen3-Embedding and Qwen3-Reranker

The Qwen3-Embedding and Qwen3-Reranker series were released on June 5, 2025, both under Apache 2.0. They cover three sizes (0.6B, 4B, and 8B) and target retrieval, classification, semantic search, and reranking workloads. The 8B Qwen3-Embedding ranked first on the MTEB multilingual leaderboard at release with a score of 70.58, and the Qwen3-Reranker-4B reached 69.76 on MTEB-R, with the 8B at 69.02. The series supports more than 100 natural and programming languages, and models are designed using dual-encoder (for embedding) and cross-encoder (for reranking) architectures, with LoRA-based fine-tuning over the Qwen3 base. A multimodal companion family, Qwen3-VL-Embedding and Qwen3-VL-Reranker, was released later at 2B and 8B sizes built on the Qwen3-VL backbone.

Qwen3-Next

Qwen3-Next-80B-A3B was released on September 11, 2025 in two variants: Qwen3-Next-80B-A3B-Instruct (non-thinking) and Qwen3-Next-80B-A3B-Thinking. Both ship with Apache 2.0 weights. The Next series is the first installment in a redesigned architecture line that pairs gated DeltaNet linear-time attention blocks with gated softmax attention blocks (a hybrid attention design), high-sparsity MoE layers, zero-centred and weight-decayed layernorm for stability, and Multi-Token Prediction for accelerated decoding. Native context length is 262,144 tokens, extendable past 1 million via YaRN. Alibaba reports that Qwen3-Next-80B-A3B-Instruct performs on par with Qwen3-235B-A22B-Instruct-2507 on many benchmarks while activating only 3 billion parameters per token, with particularly strong results on ultra-long-context tasks at and beyond the 256K mark.

2507 refreshes (Instruct and Thinking)

In late July 2025, Alibaba split the unified-mode 235B-A22B and 30B-A3B and 4B checkpoints into dedicated Instruct-2507 and Thinking-2507 variants. Qwen3-235B-A22B-Instruct-2507 launched first, followed by Qwen3-235B-A22B-Thinking-2507 on July 21, 2025. Both raised native context to 262,144 tokens (256K) with 1M-token extension via Dual Chunk Attention plus MInference. The Thinking-2507 variant is reasoning-only (its chat template inserts <think> tags automatically) and the Instruct-2507 variant suppresses think blocks entirely. Smaller Qwen3-30B-A3B-Instruct-2507, Qwen3-30B-A3B-Thinking-2507, Qwen3-4B-Instruct-2507, and Qwen3-4B-Thinking-2507 checkpoints followed in the same window.

Deployment

Qwen3 is one of the most broadly deployed open-weight LLM families on the market. The official ecosystem covers cloud APIs, self-hosting frameworks, edge runtimes, and fine-tuning toolkits.

Surface	Tool / platform	Notes
Cloud API	Alibaba Cloud Model Studio	OpenAI-compatible endpoints for Qwen3-Max, Plus, Turbo, Coder, VL, Omni
Cloud API	OpenRouter, Together AI, Fireworks AI, Hyperbolic, DeepInfra, Cerebras	Third-party hosted endpoints with competitive per-token rates
Cloud API	Amazon Bedrock	Qwen3-Coder-30B-A3B-Instruct first-class model card
Cloud API	Google Vertex AI Model Garden	Qwen3-VL hosted by Google
Cloud API	NVIDIA NIM (build.nvidia.com)	Qwen3-Coder-480B-A35B-Instruct served on NVIDIA infrastructure
Model hub	Hugging Face	All open-weight checkpoints, FP8 versions, GGUF community ports
Model hub	ModelScope	Alibaba's Chinese-market hub, mirrors HF releases
Model hub	Kaggle	Listed at launch alongside HF for ML-competition access
Inference server	vLLM (>= 0.8.4)	Production-grade tensor-parallel serving
Inference server	SGLang (>= 0.4.6.post1)	Structured-decoding inference framework
Inference server	TensorRT-LLM	NVIDIA's optimized server stack
Local runtime	Ollama	One-line `ollama run qwen3` for dense and MoE checkpoints
Local runtime	LM Studio	GUI loader with GGUF model browsing
Local runtime	llama.cpp	GGUF inference, including 4-bit and 8-bit quantization
Local runtime	KTransformers	CPU+GPU hybrid inference for 235B-class MoE on workstation hardware
Apple Silicon	MLX, MLX-LM	Native Metal-backed inference on Mac
Edge / mobile	ExecuTorch, MNN, OpenVINO	Embedded and mobile runtimes for the 0.6B and 1.7B dense models
Fine-tuning	Axolotl, Unsloth, ms-Swift, LLaMA-Factory	Standard open-source fine-tuning frameworks with Qwen3 templates

FP8 checkpoints for the larger MoE models reduce memory and bandwidth requirements relative to BF16. Hugging Face's head of product highlighted at launch that Qwen3 ships with FP8 weights, one-click Azure ML deployment, and quantized INT4 variants for local use. By 2026, Unsloth's Dynamic 2.0 quantization recipes became a popular way to fine-tune Qwen3 MoE checkpoints on consumer hardware at 4-bit precision with minimal accuracy loss.

API pricing

Qwen3 models are available through Alibaba Cloud Model Studio, with pricing varying by region, model size, and whether the thinking mode is in use. Thinking-mode outputs are priced higher because they produce substantially more output tokens (the reasoning trace counts toward the output token bill). A 50 percent discount applies to asynchronous batch jobs. The following prices are for the international Singapore region.

Model	Input (per M tokens)	Output, non-thinking (per M tokens)	Output, thinking (per M tokens)
Qwen3-235B-A22B	$0.70	$2.80	$8.40
Qwen3-32B	$0.16	$0.64	$0.64
Qwen3-30B-A3B	$0.20	$0.80	$2.40

For the Chinese mainland (Beijing region), prices are considerably lower due to local infrastructure and currency factors. Qwen3-235B input tokens cost approximately $0.29/M and non-thinking output costs approximately $1.15/M in the Beijing region.

Qwen3-Max, the branding for the trillion-parameter API model in Model Studio, uses tiered pricing that scales with prompt length across three ranges (0-32K, 32K-128K, and 128K-252K tokens), with the highest tier reaching approximately $3.00/M for input and up to $15.00/M for output in thinking mode.

Third-party inference providers including Fireworks AI, Together AI, Cerebras, and Hyperbolic also host Qwen3 models, with pricing that varies by provider and may differ from Alibaba's direct rates. The open-weight Apache 2.0 release means any provider can self-host and offer the models without licensing fees, which has driven competitive pricing across the ecosystem. Cerebras began serving Qwen3-235B-A22B-Instruct-2507 on its wafer-scale inference hardware in August 2025, advertising sub-second time-to-first-token at 1,000+ tokens per second, which is unusually high for a 235B-class MoE model.

Comparison with contemporaneous models

The table below compares Qwen3-235B-A22B with other notable models available at or near the April 2025 release date on selected benchmarks:

Model	Developer	Open weight	AIME 2024	LiveCodeBench v5	BFCL v3	License
Qwen3-235B-A22B	Alibaba	Yes	85.7	70.7	70.8	Apache 2.0
DeepSeek-R1	DeepSeek	Yes	79.8	65.9	37.0	MIT
Qwen3-32B	Alibaba	Yes	79.0	65.6	70.0	Apache 2.0
LLaMA 4 Maverick	Meta	Yes	--	--	--	Llama 4 Community
GPT-4o	OpenAI	No	9.3	32.3	50.1	Proprietary
o3-mini	OpenAI	No	63.6	53.8	48.4	Proprietary
Gemini 2.5 Pro	Google	No	~92	--	--	Proprietary

For math and coding tasks requiring extended reasoning, Qwen3-235B-A22B performs competitively with proprietary models available at the same time and leads the open-weight category. The BFCL tool-use advantage is particularly pronounced: Qwen3 scores roughly double DeepSeek-R1 and substantially above GPT-4o on this benchmark, consistent with Qwen3's training emphasis on agentic function-calling.

For general instruction following and subjective quality tasks, proprietary models such as GPT-4o and Claude maintained advantages according to human preference evaluations at launch, though the gap narrowed with Qwen3 relative to earlier Qwen generations. By the July 2025 Thinking-2507 update, the 235B model overtook several proprietary models on AIME 2025 and on specific GPQA Diamond and LiveCodeBench v6 metrics, with multiple commentators describing it as the strongest open-weight reasoning model in the public ecosystem at that point.

On a compute-adjusted basis, Qwen3-30B-A3B (3B active parameters) performs comparably to models requiring 7-10x more active parameters per forward pass. For operators who self-host models and pay for GPU time by the operation, this active-parameter efficiency translates directly to cost savings.

2026 open-weight landscape

The open-weight landscape moved quickly during late 2025 and early 2026, and Qwen3 retained a strong position even after DeepSeek shipped V3.2 and V4, Meta refreshed LLaMA 4, Zhipu AI released GLM-5.1, and Google open-weight-released Gemma 4.

Model	Developer	Total / active params	SWE-Bench Verified	License
Qwen3-Coder-Next	Alibaba	hybrid MoE	70.6 to 71.3	Apache 2.0
Qwen3-32B	Alibaba	32B dense	--	Apache 2.0
DeepSeek V4	DeepSeek	~1T / --	83.7	MIT
DeepSeek V3.2 Speciale	DeepSeek	~671B / 37B	--	MIT
LLaMA 4 Scout	Meta	109B / 17B	--	Llama 4 Community
GLM-5.1	Zhipu AI	MoE	49.1 (Pro: 58.4)	Custom
Gemma 4 31B	Google	31B dense	--	Gemma Terms

DeepSeek V4 leads raw coding accuracy at 83.7 percent on SWE-Bench Verified, while Qwen3-Coder-Next and Qwen3-32B punch above their weight on cost-per-token: Qwen3-32B and LLaMA 4 Scout sit at roughly $0.82 to $0.87 per million tokens of output and trade blows on most coding benchmarks. GLM-5.1 dominates the harder SWE-Bench Pro evaluation at 58.4 percent, even ahead of GPT-5.4 and Claude Opus 4.6 on that single benchmark. Qwen3-32B's 88.0 on HumanEval beats DeepSeek V3.2 Speciale's 82.6 despite a much smaller parameter count. On reasoning, Qwen3-Max-Thinking and Qwen3-235B-A22B-Thinking-2507 remain among the strongest options at the closed and open-weight tiers respectively, even as DeepSeek V3.2 and GLM-5.1 close in. Qwen3's 2025-era BFCL v3 tool-use lead has narrowed as competitors invested in their own agent post-training, but Qwen3-Coder-Next's three-scaffold SWE-Bench Verified showing reasserts Alibaba's case in agentic coding specifically.

Use cases

Given the range of sizes and the hybrid thinking mode, Qwen3 targets several distinct deployment scenarios.

For edge and on-device inference, the 0.6B and 1.7B models fit into the memory constraints of mobile devices and embedded systems. The 4B model runs on consumer GPUs with 8GB VRAM when quantized, and on Apple Silicon via the MLX framework. This makes Qwen3 practically accessible to developers without access to cloud GPU infrastructure.

For self-hosted research and enterprise deployments, the 8B through 32B models are sized for single-GPU and multi-GPU servers. The 32B model has seen particularly wide adoption as a high-quality open-weight model that runs on a single 80GB A100 GPU without quantization. Organizations that handle sensitive data and cannot send it to external APIs tend to prefer these models.

For agentic workflows, Qwen3's strong BFCL and tool-use scores, combined with native Model Context Protocol (MCP) support and the Qwen-Agent framework, make it a practical foundation for building agents that call external APIs, execute code, or coordinate multi-step tasks. The Qwen3-Coder variant extends this to repository-level software engineering tasks, and the Qwen Code CLI gives developers an out-of-the-box agent harness analogous to Claude Code.

For multilingual applications, the 119-language training coverage makes Qwen3 suitable for translation pipelines, customer service applications in non-English markets (including Arabic, French, Spanish, Japanese, Korean, and dozens of others), and localization workflows. The Chinese-English bilingual capability remains strong, consistent with the series' origins in Alibaba's Chinese-market products.

For reasoning-intensive tasks (competitive mathematics, scientific analysis, multi-step logic problems), the thinking mode makes Qwen3 useful in contexts where accuracy on extended reasoning outweighs response latency. The adjustable thinking budget gives operators control over the latency-accuracy tradeoff without requiring a model swap.

For retrieval and RAG pipelines, Qwen3-Embedding and Qwen3-Reranker provide a matched embedding-and-reranking stack covering 100+ languages with permissive licensing, which makes them a natural choice for builders who already use Qwen3 as the generation model.

For multimodal applications, Qwen3-VL handles document-with-image workflows, video question answering, and chart-and-figure extraction; Qwen3-Omni covers speech-in / speech-out conversational use cases including voice cloning. Both extend the Qwen3 hybrid-thinking design across modalities.

Reception

Qwen3 received substantial positive coverage from AI media and the developer community at launch. TechCrunch described the release as underscoring "competitive pressure between Chinese and American AI labs amid U.S. chip export restrictions," placing the models in the context of Alibaba's ability to develop frontier-class models despite restricted access to the most recent Nvidia hardware. CNBC reported on the announcement under the headline "Alibaba Qwen3 AI series: China's latest open-source AI breakthrough."

Within the developer community on Hugging Face and Reddit, Qwen3-30B-A3B attracted particular attention for its cost-quality profile. Commentators noted that a model requiring only 3 billion active parameters per forward pass while delivering reasoning quality competitive with 32B dense models was a practically significant result for operators who self-host inference. The Qwen3-32B dense model gained traction as the go-to choice for users who needed a manageable single-GPU model without the complexity of running the MoE variants.

Hugging Face's head of product highlighted several deployment conveniences: an FP8 checkpoint for faster inference, one-click deployment on Azure ML, and support for local use via MLX on Mac or INT4 quantized builds.

The reception of the hybrid thinking mode was broadly positive, with developers noting that per-turn /think and /no_think controls gave them granular control over latency without requiring separate model deployments for reasoning and non-reasoning tasks. The Qwen Team's later decision to split the unified mode back into Instruct-2507 and Thinking-2507 variants generated mixed reactions: some users preferred the cleaner specialisation, while others lamented the loss of in-conversation soft switching.

Later in 2025, after Moonshot AI released Kimi K2 as a competing open-weight model with strong agentic performance, updated checkpoints of Qwen3-235B-A22B (the -2507 version) outperformed Kimi K2 on GPQA, AIME 2025, and Arena-Hard v2 according to community benchmarks. Some community members observed that the updated Qwen3 checkpoint had "made Kimi K2 irrelevant after only one week, despite being one quarter the size," reflecting the rapid update cadence of the Chinese open-weight model ecosystem.

The Qwen3-Coder release in July 2025 was widely framed by the developer press as a direct competitor to Anthropic's Claude Sonnet 4 on agentic coding workflows. Simon Willison's coverage of Qwen3-Coder-Flash on the day of its release highlighted that the 30B-A3B variant could run usefully on a 64 GB M-series Mac, opening a category of frontier-class agentic coding to laptop users. The Qwen Code CLI's compatibility with Claude Code's plugin protocol made it a near-drop-in replacement for teams that wanted to swap underlying models without reworking developer workflows.

The Thinking-2507 release at the end of July 2025 was covered by VentureBeat under the headline "It's Qwen's summer," reflecting the cumulative effect of the 235B-A22B and Coder and Embedding releases coming within a few-week window. Standard third-party intelligence trackers such as Artificial Analysis and llm-stats began consistently placing Qwen3 in the top tier of open-weight reasoning models from August 2025 onward.

The January 2026 release of the final Qwen3-Max-Thinking checkpoint also generated wide coverage. VentureBeat reported under the headline "Qwen3-Max Thinking beats Gemini 3 Pro and GPT-5.2 on Humanity's Last Exam (with search)." The narrative around Qwen3 by early 2026 shifted from "strongest open-weight" to "competitive at the global frontier," particularly when the closed-weight Max tier is included alongside open-weight comparisons.

Adoption

At the time of Qwen3's release, Alibaba reported that the broader Qwen model family had accumulated over 300 million total downloads across Hugging Face, ModelScope, and other distribution channels. The family also had over 100,000 derivative models created by community members on Hugging Face, representing fine-tunes, merges, quantized versions, and task-specific adaptations built on top of Qwen base weights.

Qwen3 models accumulated large download numbers quickly following their April 2025 release. Several variants (particularly Qwen3-8B and Qwen3-32B) appeared among the most-downloaded models on the Hugging Face Hub within weeks, consistent with the pattern of earlier Qwen 2.5 models. The 0.6B and 1.7B models also saw high download volumes from the embedded and mobile development community.

Through mid-2025 the broader Qwen ecosystem continued to expand. Industry trackers reported that Qwen had overtaken Meta's LLaMA as the most-downloaded open-weight model family on Hugging Face at some point in the third quarter of 2025, and that Chinese open-weight models accounted for roughly 63 percent of new uploads on Hugging Face in September 2025, with Qwen3 holding a 40 percent or higher share of derivative work in the broader Hub. Reports of more than 700 million cumulative Qwen downloads on Hugging Face surfaced in late 2025.

Momentum accelerated through the first half of 2026. Qwen crossed 700 million cumulative Hugging Face downloads on January 13, 2026, and approached one billion by April 2026. In December 2025, Qwen's monthly downloads exceeded the combined total of the next eight competitors taken together: Meta, DeepSeek, OpenAI, Mistral, NVIDIA, Zhipu AI, Moonshot, and MiniMax. By the same window the derivative count crossed 200,000 on Hugging Face, the first open-weight family to reach that milestone, with the Qwen share of derivative uploads holding steady at 40 percent or more from May 2025 through March 2026.

Qwen3 powers Alibaba's own Quark AI assistant application in China, where Quark serves over 200 million users with a conversational interface combining search, reasoning, multimodal interactions, and task execution. Alibaba's CEO Eddie Wu (Wu Yongming) stated the company's goal to build "a world-leading full-stack AI provider serving businesses across all industries" while developing AI-native consumer applications.

Qwen3-Coder, released in July 2025, was deployed on Together AI's inference platform for external developers and positioned as a direct competitor to Anthropic's Claude for agentic coding workflows. Amazon Bedrock added Qwen3-Coder-30B-A3B-Instruct as a first-class model card, signalling adoption inside large enterprise cloud accounts. Google Vertex AI's Model Garden hosted Qwen3-VL, which is unusual for a Chinese-developed model and reflects Apache 2.0's broad enterprise acceptability.

Fine-tuning communities have produced specialised Qwen3 derivatives across domains. Notable examples include Qwen3 Swallow, a Japanese-tuned dense series from the Tokyo Institute of Technology's Swallow group, and several finance-, biomedical-, and law-specialised variants from Chinese universities and startups. Abacus AI's "Liberated Qwen" series removed alignment-trained refusal behavior and became one of the most-downloaded uncensored derivatives in the second half of 2025. The Unsloth team published optimised training recipes that let practitioners fine-tune the 30B-A3B MoE on a single 24GB consumer GPU, and the Dynamic 2.0 quantization methodology launched in late 2025 was widely adopted by community fine-tuners on quantized Qwen3 checkpoints.

Limitations

Several practical limitations have been identified through community testing and user reports.

Long-context reliability degrades at inputs approaching the stated 128K limit. Community testing found that effective performance is more stable under 64K to 96K tokens. Inputs processed using the YaRN extension mechanism (which rescales position encodings to handle lengths beyond the native training window) perform somewhat worse than inputs within the native window, particularly for the smaller models.

Hallucinations in reasoning traces have been reported, especially on claims about software APIs and programming language behavior. One analysis estimated a roughly 26 percent fabrication rate in reasoning-heavy responses about technical topics, which is higher than typical rates for leading proprietary models. This is consistent with a broader phenomenon in chain-of-thought models, where the model can generate plausible-sounding but incorrect intermediate reasoning steps.

Model identity and knowledge cutoff confusion has been filed as a bug by multiple users. When asked directly about its own name or training cutoff date, Qwen3-32B produces inconsistent answers across sessions, including some that are incorrect. The Qwen Team has acknowledged this in their issue tracker. The model does not have a clearly specified public knowledge cutoff date; community reports suggest answers in the range of October 2023 to December 2024 depending on the session.

Numerical formatting issues have been noted in some Qwen3-32B responses, specifically inconsistent rendering of large numbers with comma separators in structured outputs.

Deployment complexity for the MoE variants is a barrier for self-hosters. While the 235B-A22B model activates only 22 billion parameters per token, loading all 235 billion parameters into memory requires multiple high-memory GPUs (typically 8 x 80GB A100s or equivalent). Individual developers without access to multi-GPU infrastructure must rely on cloud APIs to access the flagship model. The 1M-token extended context requires roughly 1 TB of total GPU memory across an inference cluster for the 235B model and around 240 GB for the 30B-class MoE variants, putting both out of reach for most individual users.

Unified-mode behaviour in the original April 2025 checkpoints sometimes produced think-mode reasoning when the user asked a casual question, or non-thinking direct answers on hard problems where reasoning would have helped. The July 2025 split into Instruct-2507 and Thinking-2507 was Alibaba's response to this issue: by training the two modes as separate models, both became more reliable in their respective regimes, at the cost of losing the soft-switching /think and /no_think toggle within a conversation.

The Qwen3-Max trillion-parameter checkpoint is closed-weight and accessible only through Alibaba Cloud's Model Studio API. Researchers who depend on weight access for interpretability work, fine-tuning, or independent evaluation are restricted to the open-weight 235B-A22B and Next 80B-A3B tiers for the largest fully-open Qwen3 models. Qwen3-Max-Thinking also runs up large output-token bills in practice, because its reasoning trace consumes the bulk of the output budget on each call; sustained use of the Thinking endpoint on hard problems can be roughly an order of magnitude more expensive than the headline per-token rate suggests.

Successors

The Qwen Team continued to iterate after the April 2025 release. Beyond the 2507 refresh of the 235B-A22B, 30B-A3B, and 4B models in July 2025, Alibaba introduced Qwen3-Next-80B-A3B in September 2025 with a hybrid attention design that points toward the architecture of future Qwen generations. Qwen3-Max-Preview followed on September 5, 2025 as the first Qwen model to cross one trillion parameters, sitting alongside Qwen3-Max-Thinking variants that target reasoning tasks at the same scale. Qwen3-Omni-Flash-2025-12-01 in December 2025 added latency improvements to the multimodal branch, the final Qwen3-Max-Thinking checkpoint shipped in late January 2026, and Qwen3-Coder-Next followed in March 2026 with hybrid linear-attention scaling for agentic coding workflows.

In February 2026 Alibaba announced Qwen3.5, the next major generation of the series. Qwen3.5 expands multilingual support to 201 languages and dialects, introduces native multimodal capabilities (text, image, and video understanding within a single architecture), and includes a 397B-A17B MoE flagship together with smaller agentic-task-tuned models including Qwen3.5-122B-A10B, Qwen3.5-35B-A3B, and Qwen3.5-27B. The flagship reports decoding-throughput speedups of roughly 8.6x over Qwen3-Max at 32K context and 19x at 256K context, attributable to the Qwen3-Next-derived hybrid attention design carried forward into the new generation. A separate Qwen3.5-Flash variant launched with a 1 million token context window and up to 66,000 output tokens per call. In early March 2026 a Qwen3.5 Small Models tranche added 0.8B to 9B dense checkpoints for on-device deployment, mirroring the coverage Qwen3 had established a year earlier. Qwen3.5-Omni and Qwen3.6-Plus followed in April 2026, primarily as proprietary API releases, and a Qwen3.6-Max-Preview variant claimed first place on six coding benchmarks including SWE-Bench Pro and Terminal-Bench 2.0 by late April 2026.

The Qwen3.6 generation kept an open-weight tier alongside its proprietary API releases. Qwen3.6-35B-A3B shipped on April 16, 2026 as a sparse Mixture of Experts checkpoint with 35 billion total and 3 billion active parameters distributed across 40 layers, 256 experts (8 routed plus 1 shared per token), and a 262,144-token native context window extendable to roughly 1,010,000 tokens via YaRN. Its layer layout follows the Qwen3-Next-style hybrid attention recipe of three Gated DeltaNet blocks for every Gated Attention block, paired with Multi-Token Prediction during training. Qwen3.6-27B followed on April 22, 2026 as the first dense model in the Qwen3.6 line: 27 billion parameters across 64 layers with a 5,120 hidden dimension, 24 query and 4 key-value attention heads at 256 dimensions per head, the same hybrid Gated DeltaNet plus Gated Attention layout, and a 262,144-token native context extendable to about 1,010,000 tokens. The Qwen Team reported 77.2 percent on SWE-Bench Verified and 59.3 percent on Terminal-Bench 2.0 for Qwen3.6-27B, with the Terminal-Bench score matching Claude Opus 4.5 exactly and exceeding the company's own larger Qwen3.5-397B-A17B MoE on several agentic-coding benchmarks. Both Qwen3.6 open checkpoints ship in BF16 and FP8 block-128 quantized variants on Hugging Face and ModelScope under Apache 2.0.

Alibaba's broader posture around Qwen also evolved through 2026. The company committed roughly RMB 380 billion over three years to AI and cloud infrastructure, with CEO Eddie Wu stating that capital expenditure was likely to exceed that figure, and projected annualised recurring revenue from Qwen models and applications of roughly 30 billion yuan (about $4.4 billion) by the end of 2026. The Token Hub reorganisation in March 2026 consolidated Qwen development, Quark, DingTalk, and consumer AI hardware under a single division.

Qwen3 nevertheless remains the de facto baseline open-weight family for most downstream practitioners through 2026, with the 235B-A22B-2507, 30B-A3B-2507, Coder-480B-A35B, Coder-Next, and Next-80B-A3B checkpoints continuing to see heavy use. The fine-tuning ecosystem treats the 2507 refreshes and the Next-80B-A3B as the canonical baselines for new domain-specialised derivatives, and the Apache 2.0 license continues to differentiate the Qwen3 generation from the more restrictive licenses that began to appear in later Chinese open-weight releases.

References

Qwen Team. "Qwen3: Think Deeper, Act Faster." Qwen Blog, April 29, 2025. https://qwenlm.github.io/blog/qwen3/
Qwen Team. "Qwen3 Technical Report." arXiv:2505.09388, May 15, 2025. https://arxiv.org/abs/2505.09388
Alibaba Cloud. "Alibaba Introduces Qwen3, Setting New Benchmark in Open-Source AI with Hybrid Reasoning." Press release, April 29, 2025. https://www.alibabacloud.com/en/press-room/alibaba-introduces-qwen3-setting-new-benchmark
Alibaba Cloud Blog. "Alibaba Introduces Qwen3, Setting New Benchmark in Open-Source AI with Hybrid Reasoning." April 29, 2025. https://www.alibabacloud.com/blog/alibaba-introduces-qwen3-setting-new-benchmark-in-open-source-ai-with-hybrid-reasoning_602192
Coldewey, Devin. "Alibaba Unveils Qwen3, a Family of 'Hybrid' AI Reasoning Models." TechCrunch, April 28, 2025. https://techcrunch.com/2025/04/28/alibaba-unveils-qwen-3-a-family-of-hybrid-ai-reasoning-models/
Qwen/Qwen3-235B-A22B model card. Hugging Face. https://huggingface.co/Qwen/Qwen3-235B-A22B
QwenLM/Qwen3 GitHub repository. https://github.com/QwenLM/Qwen3
Alibaba Cloud Model Studio. "Model Pricing." https://www.alibabacloud.com/help/en/model-studio/model-pricing
Qwen Team. "Qwen3-Coder: Agentic Coding in the World." Qwen Blog, July 22, 2025. https://qwenlm.github.io/blog/qwen3-coder/
QwenLM/Qwen3-VL GitHub repository. https://github.com/QwenLM/Qwen3-VL
Kili Technology. "Data Story: A Deep Dive into Qwen3's Data Pipeline." https://kili-technology.com/blog/data-story-qwen3
Alibaba Cloud. "Alibaba Unveils Flagship AI Super Assistant Application Quark." https://www.alibabacloud.com/blog/alibaba-unveils-flagship-ai-super-assistant-application-quark_602058
CNBC. "Alibaba Qwen3 AI Series: China's Latest Open-Source AI Breakthrough." April 29, 2025. https://www.cnbc.com/2025/04/29/-alibaba-qwen3-ai-series-chinas-latest-open-source-ai-breakthrough.html
Qwen/Qwen3-235B-A22B-Instruct-2507 model card. Hugging Face. https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
Qwen/Qwen3-235B-A22B-Thinking-2507 model card. Hugging Face. https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
Qwen Team. "Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models." Qwen Blog, June 5, 2025. https://qwenlm.github.io/blog/qwen3-embedding/
Qwen Team. "Qwen3 Embedding Technical Report." arXiv:2506.05176, June 5, 2025. https://arxiv.org/abs/2506.05176
QwenLM/Qwen3-Omni GitHub repository. https://github.com/QwenLM/Qwen3-Omni
Qwen Team. "Qwen3-Omni Technical Report." arXiv:2509.17765, September 22, 2025. https://arxiv.org/abs/2509.17765
Qwen/Qwen3-Next-80B-A3B-Instruct model card. Hugging Face. https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
Qwen/Qwen3-Coder-480B-A35B-Instruct model card. Hugging Face. https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
Qwen/Qwen3-Coder-30B-A3B-Instruct model card. Hugging Face. https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct
Willison, Simon. "Trying out Qwen3-Coder Flash using LM Studio and Open WebUI and LLM." simonwillison.net, July 31, 2025. https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/
VentureBeat. "It's Qwen's summer: new open source Qwen3-235B-A22B-Thinking-2507 tops OpenAI, Gemini reasoning models on key benchmarks." July 25, 2025. https://venturebeat.com/ai/its-qwens-summer-new-open-source-qwen3-235b-a22b-thinking-2507-tops-openai-gemini-reasoning-models-on-key-benchmarks/
Qwen3-VL Technical Report. arXiv:2511.21631, November 2025. https://arxiv.org/abs/2511.21631
Qwen Twitter/X (@Alibaba_Qwen). "Big news: Introducing Qwen3-Max-Preview (Instruct)." September 5, 2025. https://x.com/Alibaba_Qwen/status/1963991502440562976
South China Morning Post. "Alibaba releases biggest AI model to date to rival OpenAI and Google DeepMind." https://www.scmp.com/tech/big-tech/article/3324684/
Wikipedia. "Qwen." https://en.wikipedia.org/wiki/Qwen
CNBC. "Alibaba unveils Qwen3.5 as China's chatbot race shifts to AI agents." February 17, 2026. https://www.cnbc.com/2026/02/17/china-alibaba-qwen-ai-agent-latest-model.html
Cerebras Systems Blog. "Qwen3 235B 2507 Instruct Now Available on Cerebras." https://www.cerebras.ai/blog/qwen3-235b-2507-instruct-now-available-on-cerebras
Sharon Goldman. "Qwen3-Max Thinking beats Gemini 3 Pro and GPT-5.2 on Humanity's Last Exam (with search)." VentureBeat, January 2026. https://venturebeat.com/technology/qwen3-max-thinking-beats-gemini-3-pro-and-gpt-5-2-on-humanitys-last-exam
Artificial Analysis. "Qwen3 Max Thinking Benchmarks and Analysis." https://artificialanalysis.ai/articles/qwen3-max-thinking-everything-you-need-to-know
Qwen Team. "Qwen3-Coder-Next: Pushing Small Hybrid Models." Qwen Blog, March 2026. https://qwen.ai/blog?id=qwen3-coder-next
Qwen Team. "Qwen3-Coder-Next Technical Report." arXiv:2603.00729, March 2026. https://arxiv.org/pdf/2603.00729
Artur Markus. "Alibaba's Qwen Hits 700 Million Downloads on Hugging Face, Overtakes Meta's Llama as World's Most Popular Open-Source AI Model." AI Unfiltered, January 2026. https://www.arturmarkus.com/alibabas-qwen-hits-700-million-downloads-on-hugging-face-overtakes-metas-llama-as-worlds-most-popular-open-source-ai-model/
Bloomberg. "Alibaba Starts Major Revamp to Heighten Focus on AI Profits." March 16, 2026. https://www.bloomberg.com/news/articles/2026-03-16/alibaba-plans-major-revamp-to-heighten-focus-on-ai-profits
Spheron. "DeepSeek V3.2 vs Llama 4 vs Qwen 3 (2026): Cost-per-Token from $0.04 to $0.31, Which to Pick." https://www.spheron.network/blog/deepseek-vs-llama-4-vs-qwen3/
Lushbinary. "Qwen 3.6 vs Gemma 4 vs Llama 4 vs GLM-5.1 vs DeepSeek V4 Comparison." https://lushbinary.com/blog/qwen-3-6-vs-gemma-4-llama-4-glm-5-1-deepseek-v4-open-source-comparison/
The Gadgeteer. "Alibaba's Quark AI Glasses S1 Just Got Proactive AI." May 2026. https://the-gadgeteer.com/2026/05/12/quark-ai-glasses-s1-proactive-ai-qwen-update-may-2026/
MarkTechPost. "Alibaba Qwen Team Releases Qwen 3.5 Medium Model Series: A Production Powerhouse Proving that Smaller AI Models are Smarter." February 24, 2026. https://www.marktechpost.com/2026/02/24/alibaba-qwen-team-releases-qwen-3-5-medium-model-series-a-production-powerhouse-proving-that-smaller-ai-models-are-smarter/
MarkTechPost. "Alibaba just released Qwen 3.5 Small models: a family of 0.8B to 9B parameters built for on-device applications." March 2, 2026. https://www.marktechpost.com/2026/03/02/alibaba-just-released-qwen-3-5-small-models-a-family-of-0-8b-to-9b-parameters-built-for-on-device-applications/
Unsloth. "Run & Fine-tune Qwen3." Unsloth Blog. https://unsloth.ai/blog/qwen3
Qwen/Qwen3.6-35B-A3B model card. Hugging Face. https://huggingface.co/Qwen/Qwen3.6-35B-A3B
Qwen/Qwen3.6-27B model card. Hugging Face. https://huggingface.co/Qwen/Qwen3.6-27B
MarkTechPost. "Alibaba Qwen Team Releases Qwen3.6-27B: A Dense Open-Weight Model Outperforming 397B MoE on Agentic Coding Benchmarks." April 22, 2026. https://www.marktechpost.com/2026/04/22/alibaba-qwen-team-releases-qwen3-6-27b-a-dense-open-weight-model-outperforming-397b-moe-on-agentic-coding-benchmarks/

Background

Model family

Per-model architectural specifications

Hybrid thinking mode

Architecture

Dense models

MoE models

Context length

Training

Pretraining data

Pretraining stages

Post-training pipeline

License

Benchmarks

Qwen3-Max-Thinking final benchmarks

Variants

Qwen3-Max

Qwen3-Coder

Qwen3-Coder-Next

Qwen3-VL

Qwen3-Omni

Qwen3-Embedding and Qwen3-Reranker

Qwen3-Next

2507 refreshes (Instruct and Thinking)

Deployment

API pricing

Comparison with contemporaneous models

2026 open-weight landscape

Use cases

Reception

Adoption

Limitations

Successors

See also

References

Improve this article

Related Articles

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

DeepSeek V3.1

MiniMax M2

Background

Model family

Per-model architectural specifications

Hybrid thinking mode

Architecture

Dense models

MoE models

Context length

Training

Pretraining data

Pretraining stages

Post-training pipeline

License

Benchmarks

Qwen3-Max-Thinking final benchmarks

Variants

Qwen3-Max

Qwen3-Coder

Qwen3-Coder-Next

Qwen3-VL

Qwen3-Omni

Qwen3-Embedding and Qwen3-Reranker

Qwen3-Next

2507 refreshes (Instruct and Thinking)

Deployment

API pricing

Comparison with contemporaneous models

2026 open-weight landscape

Use cases

Reception

Adoption

Limitations

Successors

See also

References

Related Articles

Kimi K2