Qwen3 is the third-generation family of large language models developed by the Qwen Team at Alibaba Cloud (also known as Tongyi Qianwen lab). Released on April 29, 2025, Qwen3 comprises eight open-weight models spanning dense and Mixture of Experts (MoE) architectures, with parameter counts ranging from 0.6 billion to 235 billion. The family introduces a hybrid thinking capability that lets a single model switch between deliberate, chain-of-thought reasoning and fast, non-thinking response modes within the same deployment. All eight base models carry an Apache 2.0 license, making them freely available for commercial and research use.
At launch, the flagship Qwen3-235B-A22B outperformed OpenAI's o3-mini on AIME math benchmarks and Codeforces programming contests, and performed comparably to Google's Gemini 2.5 Pro on several reasoning tasks. The family was trained on 36 trillion tokens covering 119 languages and dialects, double the dataset used for the previous Qwen 2.5 generation. By the time Qwen3 launched, the broader Qwen family had accumulated over 300 million downloads and more than 100,000 derivative models on Hugging Face. Throughout 2025 the series expanded into a sprawling ecosystem of subfamilies covering coding (Qwen3-Coder), vision-language (Qwen3-VL), audio and video (Qwen3-Omni), retrieval (Qwen3-Embedding and Qwen3-Reranker), the hybrid-attention Qwen3-Next, and the trillion-parameter proprietary Qwen3-Max. By September 2025 the broader Qwen lineage had overtaken Meta's LLaMA as the most-downloaded open-weight model family on Hugging Face according to Stanford's tracking, with several hundred thousand community derivatives on the Hub.
Alibaba Cloud's Qwen Team began releasing language models in 2023 under the Tongyi Qianwen brand. The original Qwen 1.0 models, released mid-2023, included dense models at 1.8B, 7B, 14B, and 72B parameter sizes, pretrained primarily on Chinese and English text with a focus of approximately 3 trillion tokens. Alibaba had introduced Tongyi Qianwen as a beta service in April 2023 before opening it publicly in September 2023 following regulatory clearance in China.
Qwen 1.5, released in early 2024, expanded the size range to 0.5B through 72B and introduced an early MoE variant (14B total, 2.7B active). It was treated internally as a beta release leading into Qwen 2, and the two generations share architectural lineage.
Qwen 2, released in June 2024, adopted Grouped Query Attention (GQA) across all model sizes. GQA reduces key-value cache memory demands and improves inference throughput compared to standard multi-head attention. The series scaled up to a 72B open-weight model, with Alibaba keeping its largest commercial variants proprietary through the Model Studio API.
Qwen 2.5, released in September 2024, was the largest Qwen open-weight release up to that point. Pretraining data scaled from the 7 trillion tokens used for Qwen 2 to 18 trillion tokens, and the release introduced specialized branches: Qwen2.5-Coder for software development tasks and Qwen2.5-Math for mathematical reasoning. The 72B Qwen2.5-Instruct model gained wide adoption for deployment on single-GPU server configurations, and the Qwen series established itself as one of the dominant open-weight lineages in the global open-source AI community. The Qwen 2.5 generation also supported 29 languages, a significant multilingual improvement over earlier versions.
Qwen3 was previewed in a series of teasers from the Qwen Twitter account in late April 2025. The flagship Qwen3-235B-A22B was first announced on April 28, 2025, and the full open-weight release including dense models from 0.6B to 32B and the 30B-A3B MoE went live on April 29, 2025. The release doubled pretraining data again to 36 trillion tokens, added genuine large-scale MoE models (30B-A3B and 235B-A22B), extended multilingual support to 119 languages, and introduced the hybrid thinking architecture that distinguishes the generation from its predecessors.
Qwen3 in its initial April 2025 release consists of six dense models and two MoE models. Dense models activate all their parameters during inference; MoE models activate only a fraction of their total parameters per forward pass, reducing compute requirements while maintaining or exceeding the performance of larger dense models.
| Model | Type | Total parameters | Active parameters | Native context | Extended context |
|---|---|---|---|---|---|
| Qwen3-0.6B | Dense | 0.6B | 0.6B | 32K | 32K |
| Qwen3-1.7B | Dense | 1.7B | 1.7B | 32K | 32K |
| Qwen3-4B | Dense | 4B | 4B | 32K | 128K |
| Qwen3-8B | Dense | 8B | 8B | 128K | 128K |
| Qwen3-14B | Dense | 14B | 14B | 128K | 128K |
| Qwen3-32B | Dense | 32B | 32B | 128K | 128K |
| Qwen3-30B-A3B | MoE | 30B | 3B | 128K | 128K |
| Qwen3-235B-A22B | MoE | 235B | 22B | 32K | 128K (YaRN) |
The two smallest models (0.6B and 1.7B) target edge deployment and on-device inference scenarios where memory is constrained. They support a hard 32K token context ceiling. The 4B through 32B dense models cover consumer GPU and server GPU configurations; the 4B model runs on hardware with as little as 8GB VRAM when quantized to 4-bit precision, and the 32B fits on a single 80GB A100 or equivalent.
The 30B-A3B MoE model activates only 3 billion parameters per forward pass despite having 30 billion total parameters stored in memory. In compute cost during inference, it is roughly equivalent to a 3B dense model while retaining substantially more capacity due to the conditional routing mechanism. The Qwen Team reports that Qwen3-30B-A3B outperforms QwQ-32B on several reasoning benchmarks despite having only one-tenth as many active parameters.
The flagship 235B-A22B MoE activates 22 billion parameters per token from a total of 235 billion stored parameters. All models are available for download under Apache 2.0 on Hugging Face, ModelScope, GitHub, and Kaggle.
The Qwen Team also reports that the 4B Qwen3 dense model, with knowledge distillation from the larger checkpoints, can rival Qwen2.5-72B-Instruct on many tasks despite using roughly one-eighteenth as many parameters. This compression result is one of the headline efficiency claims tied to the release.
The official Qwen3 technical report (arXiv:2505.09388) lists the following layer and attention-head configurations:
| Model | Layers | Q heads | KV heads | Experts (total / active) |
|---|---|---|---|---|
| Qwen3-0.6B | 28 | 16 | 8 | n/a |
| Qwen3-1.7B | 28 | 16 | 8 | n/a |
| Qwen3-4B | 36 | 32 | 8 | n/a |
| Qwen3-8B | 36 | 32 | 8 | n/a |
| Qwen3-14B | 40 | 40 | 8 | n/a |
| Qwen3-32B | 64 | 64 | 8 | n/a |
| Qwen3-30B-A3B | 48 | 32 | 4 | 128 / 8 |
| Qwen3-235B-A22B | 94 | 64 | 4 | 128 / 8 |
All variants use Grouped Query Attention, SwiGLU feed-forward blocks, and Rotary Positional Embeddings. The MoE layers in the 30B and 235B models contain 128 experts each, with 8 routed experts per token, plus fine-grained expert segmentation that splits the feed-forward dimension into smaller specialised units.
One of the defining design choices in Qwen3 is integrating thinking and non-thinking capabilities within a single model rather than shipping separate reasoning and chat variants. Earlier systems typically required users to pick between a fast general-purpose chat model and a slower but more deliberate reasoning model. In Qwen3, both behaviors exist in the same weights and can be selected at inference time.
In thinking mode, the model generates an internal chain-of-thought (CoT) trace before producing its final answer. This reasoning trace appears between special <think> and </think> delimiters in the raw output. The thinking trace can extend to approximately 38,000 tokens for complex problems, allowing the model to explore multiple reasoning paths, check its work, and revise conclusions before committing to an answer. Users can set a budget cap to limit how long the model thinks, enabling a direct tradeoff between latency and answer quality.
In non-thinking mode, the model responds directly without visible reasoning steps, prioritizing speed and conciseness. This is suited for conversational exchanges, simple lookups, and applications where response time matters more than deliberative accuracy.
Users switch between modes in two ways. At the API level, the enable_thinking parameter controls the mode for a given call. Within a conversation, /think and /no_think tokens can be placed inline in messages to enable or disable thinking for subsequent turns. This per-turn control is called soft switching and allows mixed conversations where some questions receive full reasoning and others receive quick answers.
Sampling recommendations differ by mode. For thinking mode, the Qwen Team recommends temperature 0.6, top-p 0.95, and top-k 20, with greedy decoding explicitly discouraged (it degrades output quality in the thinking regime). For non-thinking mode, temperature 0.7, top-p 0.8, and top-k 20 are recommended.
This dual-mode design was achieved through a four-stage post-training pipeline described in the Training section. By July 2025 the Qwen Team revised this design philosophy: the updated 235B-A22B-2507 checkpoints split the unified model back into two specialised variants (Instruct-2507 and Thinking-2507), with each variant optimized solely for one mode. The team explained that decoupling produced higher quality on each individual axis than the original mixed approach, even though it sacrificed the soft-switching convenience.
The Qwen3 dense models share a standard transformer architecture with several modifications introduced progressively across the Qwen series:
The two MoE variants (30B-A3B and 235B-A22B) share the same base architecture as the dense models but replace standard feed-forward layers with mixture-of-experts layers. Each MoE layer contains 128 total expert networks. During a forward pass, a learned router selects 8 of these 128 experts for each token, keeping computation constant regardless of how many experts exist in total.
Fine-grained expert segmentation divides the feed-forward dimensions into smaller units per expert, enabling more targeted specialization. Global-batch load balancing distributes token routing across experts to prevent any single expert from becoming a bottleneck during training or inference.
The flagship 235B-A22B MoE model has 94 transformer layers, 64 attention heads for queries, 4 heads for key-value, and uses BF16 tensor precision. It has 235 billion total non-embedding parameters and 234 billion non-embedding parameters (with embeddings excluded from the count).
The Qwen Team reports that the 30B-A3B MoE achieves comparable performance to the Qwen2.5-72B dense model on most benchmarks while activating roughly one-tenth as many parameters per token. This translates directly to lower inference compute cost when measured in floating-point operations per token.
Context lengths range from 32K tokens (0.6B and 1.7B models) to 128K tokens (8B through 235B). Extended context is enabled via YaRN (Yet another RoPE extensioN) scaling with a scaling factor of 4.0, and Dual Chunk Attention, which improves performance on inputs that exceed the training sequence length.
Community testing has found that effective performance is higher for inputs under 64K to 96K tokens even on models rated for 128K, and that YaRN-scaled inputs (which stress the extension mechanism) perform somewhat worse than inputs that fit within the native window. Users running contexts longer than 64K should expect some degradation relative to shorter inputs.
The later 2507 refresh of the 235B-A22B model lifts the native window to 262,144 tokens (256K) and supports extension up to roughly 1,010,000 tokens through Dual Chunk Attention combined with MInference sparse attention. Alibaba reports a roughly 3x speedup at 1M-token sequence lengths from MInference relative to dense attention, although VRAM requirements scale to the order of 1 TB across multiple GPUs at the longest setting.
Qwen3 was pretrained on approximately 36 trillion tokens, roughly double the 18 trillion tokens used for Qwen 2.5. The dataset spans 119 languages and dialects organized across Indo-European, Sino-Tibetan, Afro-Asiatic, Austronesian, and other language families. This extended multilingual coverage compares to Qwen 2.5's 29 languages.
The training data composition includes web text, books, academic papers, code repositories, mathematical content, and synthetic data. Rather than relying solely on crawled web data, the Qwen Team developed a multilingual annotation system that classified over 30 trillion tokens by educational value, domain, and safety dimensions. This labeling informed how tokens were weighted and sampled during training.
Synthetic data was generated using earlier Qwen models as a "data factory." Qwen2.5-Math generated mathematical textbooks and question-answer pairs. Qwen2.5-Coder generated synthetic code snippets and programming exercises. Qwen2.5-VL was fine-tuned to extract clean text from PDF documents, with Qwen2.5 then used to refine the OCR output before ingestion. This pipeline allowed Alibaba to convert large volumes of structured documents (scientific papers, technical manuals, textbooks) into clean training text.
Pretraining proceeded across three stages:
To produce the hybrid thinking behavior, Qwen3 uses a four-stage post-training pipeline:
Post-training data curation used a two-phase filtering approach for reasoning data. The query filter removed prompts that were unverifiable or solvable without reasoning. The response filter removed examples with incorrect answers, internal inconsistencies, repetitive traces, or indicators of guesswork. Human annotators assessed cases that automated verifiers could not resolve.
Knowledge distillation from the flagship 235B-A22B teacher into the smaller dense and MoE checkpoints is a key efficiency lever in the Qwen3 pipeline. The technical report describes a strong-to-weak distillation regime where smaller students learn from teacher logits during a portion of pretraining, which is what allows the 4B and 8B dense models to reach quality levels that previously required substantially larger checkpoints in earlier Qwen generations.
All eight models in the Qwen3 base family are released under the Apache 2.0 license. This allows commercial use, modification, and redistribution without requiring licensees to open-source their own derived work, subject to preservation of the Apache 2.0 copyright notice and attribution statement.
The Apache 2.0 license distinguishes Qwen3 from several competing open-weight releases. Meta's LLaMA series uses a custom community license that restricts services with more than 700 million monthly active users. Moonshot AI's Kimi K2, released in 2025, uses a modified MIT license that requires negotiation for commercial deployments exceeding 100 million monthly active users or $20 million in monthly revenue. The DeepSeek series uses the MIT license, which is also permissive. Qwen3's Apache 2.0 is the standard choice in the enterprise open-source software world and carries no graduated commercial restrictions.
The Apache 2.0 designation extends to the later Qwen3 subfamilies that ship public weights: Qwen3-Coder (480B-A35B and 30B-A3B), the Qwen3-VL dense and MoE checkpoints, the Qwen3-Omni 30B-A3B variants (Instruct, Thinking, and Captioner), Qwen3-Next-80B-A3B (Instruct and Thinking), and the Qwen3-Embedding and Qwen3-Reranker series at 0.6B, 4B, and 8B sizes. The proprietary Qwen3-Max is the only headline branch that ships through API access only without weights.
The following table shows performance of Qwen3-235B-A22B and Qwen3-32B against comparable models at the time of release. Scores are taken from the Qwen3 technical report (arXiv:2505.09388) and the official Alibaba blog post.
| Benchmark | Qwen3-235B-A22B | Qwen3-32B | Qwen3-30B-A3B | DeepSeek-R1 | GPT-4o | o3-mini | Gemini 2.5 Pro |
|---|---|---|---|---|---|---|---|
| MMLU-Pro | 68.18 | 65.54 | 61.49 | -- | -- | -- | -- |
| GPQA (5-shot CoT) | 47.47 | 49.49 | 43.94 | -- | -- | -- | -- |
| AIME 2024 | 85.7 | 79.0 | -- | 79.8 | 9.3 | 63.6 | ~92 |
| AIME 2025 | 81.5 | -- | -- | -- | -- | -- | -- |
| LiveCodeBench v5 | 70.7 | 65.6 | -- | 65.9 | 32.3 | 53.8 | -- |
| Codeforces rating | 2,056 | 1,977 | -- | 1,870 | 759 | 1,258 | -- |
| BFCL v3 (tool use) | 70.8 | 70.0 | -- | 37.0 | 50.1 | 48.4 | -- |
| SWE-Bench Pro | 21.4 | -- | -- | -- | -- | -- | -- |
The BFCL v3 (Berkeley Function-Calling Leaderboard) scores are particularly notable: both the 235B and 32B models score roughly double what DeepSeek-R1 achieves, reflecting Alibaba's explicit training emphasis on function-calling and agentic capabilities. The gap relative to GPT-4o on this benchmark is also large.
On AIME 2024, the 235B model at 85.7 substantially exceeds o3-mini (63.6) and DeepSeek-R1 (79.8), placing it in competitive range with Gemini 2.5 Pro. For coding, Qwen3-32B at 1,977 Codeforces rating surpasses what OpenAI's o1 model achieved at the same point in time.
For math reasoning that requires extended thinking, both the 235B and 32B models perform well above GPT-4o's baseline of 9.3 on AIME 2024, which reflects that GPT-4o was not a reasoning-specialized model at the time of comparison.
The Qwen3 technical report also shows that the 235B-A22B base model outperforms DeepSeek-V3 Base across 14 of 15 benchmarks while having roughly one-third the total parameters and two-thirds the activated parameters of DeepSeek-V3.
The July 2025 Thinking-2507 refresh raised these numbers further. According to the Hugging Face model card for Qwen3-235B-A22B-Thinking-2507, scores on AIME 2025, GPQA Diamond, and MMLU-Pro climb substantially over the April release:
| Benchmark | 235B-A22B (April 2025) | 235B-A22B-Thinking-2507 |
|---|---|---|
| AIME 2025 | 81.5 | 92.3 |
| GPQA Diamond | -- | 81.1 |
| MMLU-Pro | 68.18 | 84.4 |
| HMMT 2025 | -- | 83.9 |
| LiveCodeBench v6 | -- | 74.1 |
| LiveBench (2024-11-25) | -- | 78.4 |
| BFCL v3 | 70.8 | 71.9 |
| TAU1-Retail | -- | 67.8 |
| MultiIF | -- | 80.6 |
| PolyMATH | -- | 60.1 |
VentureBeat reported that Qwen3-235B-A22B-Thinking-2507 "tops OpenAI, Gemini reasoning models on key benchmarks," with the AIME 2025 score of 92.3 leading all reported models on that competition at the time.
The non-thinking Qwen3-235B-A22B-Instruct-2507 reaches 70.3 on AIME 2025, 77.5 on GPQA, 83.0 on MMLU-Pro, 79.2 on Arena-Hard v2, 88.7 on IFEval, 70.9 on BFCL v3, and 51.8 on LiveCodeBench v6. Those are non-thinking-mode numbers, so they sit below the Thinking-2507 results on reasoning-heavy benchmarks but ahead on instruction-following and arena-style preference scoring.
Qwen3 expanded steadily through 2025 into a dense family of subfamilies, each branded with a suffix that signals modality, deployment tier, or training focus.
Qwen3-Max is the commercial API branding used by Alibaba Cloud's Model Studio for the most capable model in the family. The original April 2025 launch tied this label to Qwen3-235B-A22B. Beginning with Qwen3-Max-Preview on September 5, 2025, the Max line moved to a separate trillion-plus parameter dense-MoE model that is closed-weight and accessible only through the API. Alibaba describes Qwen3-Max-Preview as the company's first model with more than one trillion parameters, trained on roughly 36 trillion tokens. Internal Alibaba benchmarks at announcement showed Qwen3-Max-Preview ahead of Qwen3-235B-A22B-2507 across the company's reasoning suite.
Qwen3-Max powers Alibaba's Quark AI super-assistant application in China, which serves over 200 million users. Quark integrates deep search, photo-based problem solving, AI writing, multimodal interactions (photo editing, AI camera), and task execution built on Qwen's reasoning capabilities. Model Studio also offers Qwen3-Plus and Qwen3-Turbo tiers, which correspond to smaller models with lower per-token pricing. The tiered API structure follows the pattern established by major cloud AI providers, allowing developers to select cost-performance tradeoffs appropriate for their workloads.
Qwen3-Coder is a code-specialized branch of the Qwen3 family. The flagship variant, Qwen3-Coder-480B-A35B-Instruct, was released on July 22, 2025, roughly three months after the base family. It has 480 billion total parameters and 35 billion active parameters, with a native context of 256K tokens and support for up to 1 million tokens via YaRN extrapolation. A smaller 30B-A3B coder, sometimes referred to as Qwen3-Coder-Flash in Alibaba's chat interface, followed on July 31, 2025 with 30.5 billion total parameters and 3.3 billion active. Both models are Apache 2.0.
Qwen3-Coder was designed specifically for agentic coding scenarios involving multi-turn tool use, code execution, and long-horizon repository-level problem solving. Pretraining used 7.5 trillion tokens with a 70 percent code-data ratio. During post-training, Alibaba ran Code RL (execution-driven reinforcement learning) and Agent RL (long-horizon reinforcement learning against real code execution environments), scaling training across up to 20,000 parallel environments. Verifiable coding tasks (where correctness can be checked automatically) were preferred, following the "hard to solve, easy to verify" principle from RLVR (Reinforcement Learning with Verifiable Rewards).
At launch, Alibaba reported that Qwen3-Coder achieved state-of-the-art performance among open-source models on SWE-Bench Verified without test-time scaling, with the Qwen Team describing its performance as comparable to Claude Sonnet 4 on agentic coding tasks. An open-source CLI tool called Qwen Code was released alongside the model, adapted from the Gemini Code CLI with customised prompts and function-calling protocols. Qwen Code is compatible with established agent interfaces including Claude Code and Cline. The model is also exposed through OpenAI SDK compatibility on Alibaba Cloud Model Studio, which lets existing OpenAI-API code swap base URL and key without further changes.
Qwen3-Coder is hosted on Together AI, Fireworks AI, NVIDIA NIM, Amazon Bedrock, OpenRouter, and Hyperbolic, among other inference providers. The 30B-A3B variant runs on a 32GB or 64GB Mac via MLX or LM Studio at 4-bit quantization, which made it widely accessible to individual developers without server GPUs.
Qwen3-VL is the vision-language branch of the Qwen3 series. The flagship Qwen3-VL-235B-A22B-Instruct and Qwen3-VL-235B-A22B-Thinking checkpoints were released on September 23, 2025. The 30B-A3B variants followed on October 4, 2025, and dense versions at 2B, 4B, 8B, and 32B were rolled out alongside the MoE branch. The Qwen3-VL technical report (arXiv:2511.21631) was published in November 2025.
Qwen3-VL accepts interleaved text, image, and video inputs and provides a native context window of 256K tokens for multimodal content. The family mirrors the base Qwen3 architecture, offering both dense and MoE variants. Qwen3-VL retains the hybrid thinking capability from the base LLMs and delivers strong performance on single-image, multi-image, and video understanding benchmarks. The model's long native context allows it to handle lengthy document inputs (mixed text and images), multi-frame video clips, and multi-image reasoning tasks that would overflow shorter-context vision models. Qwen3-VL-2B-Instruct surpassed 18 million Hugging Face downloads in its first weeks, reflecting widespread interest in a small multimodal model with permissive licensing.
Qwen3-Omni is the audio and video branch of the family, released on September 22, 2025. The published checkpoints are Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner, all under Apache 2.0. Qwen3-Omni accepts text, image, audio, and video inputs and produces text and natively streamed speech output. Speech recognition is supported across 113 languages and dialects, and speech generation across 36 languages. The Qwen3-Omni technical report (arXiv:2509.17765) describes a single end-to-end multimodal model that maintains state-of-the-art performance across modalities without degradation relative to single-modal counterparts.
Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source state-of-the-art on 32 benchmarks and overall state-of-the-art on 22, outperforming closed-source competitors including Gemini 2.5 Pro, Seed-ASR, and GPT-4o-Transcribe. The model exposes a voice cloning interface where users can upload a sample voice and have responses synthesised in that voice through the API. A later Qwen3-Omni-Flash-2025-12-01 variant added quality and latency improvements aimed at near real-time deployment.
The Qwen3-Embedding and Qwen3-Reranker series were released on June 5, 2025, both under Apache 2.0. They cover three sizes (0.6B, 4B, and 8B) and target retrieval, classification, semantic search, and reranking workloads. The 8B Qwen3-Embedding ranked first on the MTEB multilingual leaderboard at release with a score of 70.58, and the Qwen3-Reranker-4B reached 69.76 on MTEB-R, with the 8B at 69.02. The series supports more than 100 natural and programming languages, and models are designed using dual-encoder (for embedding) and cross-encoder (for reranking) architectures, with LoRA-based fine-tuning over the Qwen3 base. A multimodal companion family, Qwen3-VL-Embedding and Qwen3-VL-Reranker, was released later at 2B and 8B sizes built on the Qwen3-VL backbone.
Qwen3-Next-80B-A3B was released on September 11, 2025 in two variants: Qwen3-Next-80B-A3B-Instruct (non-thinking) and Qwen3-Next-80B-A3B-Thinking. Both ship with Apache 2.0 weights. The Next series is the first installment in a redesigned architecture line that pairs gated DeltaNet linear-time attention blocks with gated softmax attention blocks (a hybrid attention design), high-sparsity MoE layers, zero-centred and weight-decayed layernorm for stability, and Multi-Token Prediction for accelerated decoding. Native context length is 262,144 tokens, extendable past 1 million via YaRN. Alibaba reports that Qwen3-Next-80B-A3B-Instruct performs on par with Qwen3-235B-A22B-Instruct-2507 on many benchmarks while activating only 3 billion parameters per token, with particularly strong results on ultra-long-context tasks at and beyond the 256K mark.
In late July 2025, Alibaba split the unified-mode 235B-A22B and 30B-A3B and 4B checkpoints into dedicated Instruct-2507 and Thinking-2507 variants. Qwen3-235B-A22B-Instruct-2507 launched first, followed by Qwen3-235B-A22B-Thinking-2507 on July 21, 2025. Both raised native context to 262,144 tokens (256K) with 1M-token extension via Dual Chunk Attention plus MInference. The Thinking-2507 variant is reasoning-only (its chat template inserts <think> tags automatically) and the Instruct-2507 variant suppresses think blocks entirely. Smaller Qwen3-30B-A3B-Instruct-2507, Qwen3-30B-A3B-Thinking-2507, Qwen3-4B-Instruct-2507, and Qwen3-4B-Thinking-2507 checkpoints followed in the same window.
Qwen3 is one of the most broadly deployed open-weight LLM families on the market. The official ecosystem covers cloud APIs, self-hosting frameworks, edge runtimes, and fine-tuning toolkits.
| Surface | Tool / platform | Notes |
|---|---|---|
| Cloud API | Alibaba Cloud Model Studio | OpenAI-compatible endpoints for Qwen3-Max, Plus, Turbo, Coder, VL, Omni |
| Cloud API | OpenRouter, Together AI, Fireworks AI, Hyperbolic, DeepInfra, Cerebras | Third-party hosted endpoints with competitive per-token rates |
| Cloud API | Amazon Bedrock | Qwen3-Coder-30B-A3B-Instruct first-class model card |
| Cloud API | Google Vertex AI Model Garden | Qwen3-VL hosted by Google |
| Cloud API | NVIDIA NIM (build.nvidia.com) | Qwen3-Coder-480B-A35B-Instruct served on NVIDIA infrastructure |
| Model hub | Hugging Face | All open-weight checkpoints, FP8 versions, GGUF community ports |
| Model hub | ModelScope | Alibaba's Chinese-market hub, mirrors HF releases |
| Model hub | Kaggle | Listed at launch alongside HF for ML-competition access |
| Inference server | vLLM (>= 0.8.4) | Production-grade tensor-parallel serving |
| Inference server | SGLang (>= 0.4.6.post1) | Structured-decoding inference framework |
| Inference server | TensorRT-LLM | NVIDIA's optimized server stack |
| Local runtime | Ollama | One-line ollama run qwen3 for dense and MoE checkpoints |
| Local runtime | LM Studio | GUI loader with GGUF model browsing |
| Local runtime | llama.cpp | GGUF inference, including 4-bit and 8-bit quantization |
| Local runtime | KTransformers | CPU+GPU hybrid inference for 235B-class MoE on workstation hardware |
| Apple Silicon | MLX, MLX-LM | Native Metal-backed inference on Mac |
| Edge / mobile | ExecuTorch, MNN, OpenVINO | Embedded and mobile runtimes for the 0.6B and 1.7B dense models |
| Fine-tuning | Axolotl, Unsloth, ms-Swift, LLaMA-Factory | Standard open-source fine-tuning frameworks with Qwen3 templates |
FP8 checkpoints for the larger MoE models reduce memory and bandwidth requirements relative to BF16. Hugging Face's head of product highlighted at launch that Qwen3 ships with FP8 weights, one-click Azure ML deployment, and quantized INT4 variants for local use.
Qwen3 models are available through Alibaba Cloud Model Studio, with pricing varying by region, model size, and whether the thinking mode is in use. Thinking-mode outputs are priced higher because they produce substantially more output tokens (the reasoning trace counts toward the output token bill). A 50 percent discount applies to asynchronous batch jobs. The following prices are for the international Singapore region.
| Model | Input (per M tokens) | Output, non-thinking (per M tokens) | Output, thinking (per M tokens) |
|---|---|---|---|
| Qwen3-235B-A22B | $0.70 | $2.80 | $8.40 |
| Qwen3-32B | $0.16 | $0.64 | $0.64 |
| Qwen3-30B-A3B | $0.20 | $0.80 | $2.40 |
For the Chinese mainland (Beijing region), prices are considerably lower due to local infrastructure and currency factors. Qwen3-235B input tokens cost approximately $0.29/M and non-thinking output costs approximately $1.15/M in the Beijing region.
Qwen3-Max, the branding for the trillion-parameter API model in Model Studio, uses tiered pricing that scales with prompt length across three ranges (0-32K, 32K-128K, and 128K-252K tokens), with the highest tier reaching approximately $3.00/M for input and up to $15.00/M for output in thinking mode.
Third-party inference providers including Fireworks AI, Together AI, Cerebras, and Hyperbolic also host Qwen3 models, with pricing that varies by provider and may differ from Alibaba's direct rates. The open-weight Apache 2.0 release means any provider can self-host and offer the models without licensing fees, which has driven competitive pricing across the ecosystem. Cerebras began serving Qwen3-235B-A22B-Instruct-2507 on its wafer-scale inference hardware in August 2025, advertising sub-second time-to-first-token at 1,000+ tokens per second, which is unusually high for a 235B-class MoE model.
The table below compares Qwen3-235B-A22B with other notable models available at or near the April 2025 release date on selected benchmarks:
| Model | Developer | Open weight | AIME 2024 | LiveCodeBench v5 | BFCL v3 | License |
|---|---|---|---|---|---|---|
| Qwen3-235B-A22B | Alibaba | Yes | 85.7 | 70.7 | 70.8 | Apache 2.0 |
| DeepSeek-R1 | DeepSeek | Yes | 79.8 | 65.9 | 37.0 | MIT |
| Qwen3-32B | Alibaba | Yes | 79.0 | 65.6 | 70.0 | Apache 2.0 |
| LLaMA 4 Maverick | Meta | Yes | -- | -- | -- | Llama 4 Community |
| GPT-4o | OpenAI | No | 9.3 | 32.3 | 50.1 | Proprietary |
| o3-mini | OpenAI | No | 63.6 | 53.8 | 48.4 | Proprietary |
| Gemini 2.5 Pro | No | ~92 | -- | -- | Proprietary |
For math and coding tasks requiring extended reasoning, Qwen3-235B-A22B performs competitively with proprietary models available at the same time and leads the open-weight category. The BFCL tool-use advantage is particularly pronounced: Qwen3 scores roughly double DeepSeek-R1 and substantially above GPT-4o on this benchmark, consistent with Qwen3's training emphasis on agentic function-calling.
For general instruction following and subjective quality tasks, proprietary models such as GPT-4o and Claude maintained advantages according to human preference evaluations at launch, though the gap narrowed with Qwen3 relative to earlier Qwen generations. By the July 2025 Thinking-2507 update, the 235B model overtook several proprietary models on AIME 2025 and on specific GPQA Diamond and LiveCodeBench v6 metrics, with multiple commentators describing it as the strongest open-weight reasoning model in the public ecosystem at that point.
On a compute-adjusted basis, Qwen3-30B-A3B (3B active parameters) performs comparably to models requiring 7-10x more active parameters per forward pass. For operators who self-host models and pay for GPU time by the operation, this active-parameter efficiency translates directly to cost savings.
Given the range of sizes and the hybrid thinking mode, Qwen3 targets several distinct deployment scenarios.
For edge and on-device inference, the 0.6B and 1.7B models fit into the memory constraints of mobile devices and embedded systems. The 4B model runs on consumer GPUs with 8GB VRAM when quantized, and on Apple Silicon via the MLX framework. This makes Qwen3 practically accessible to developers without access to cloud GPU infrastructure.
For self-hosted research and enterprise deployments, the 8B through 32B models are sized for single-GPU and multi-GPU servers. The 32B model has seen particularly wide adoption as a high-quality open-weight model that runs on a single 80GB A100 GPU without quantization. Organizations that handle sensitive data and cannot send it to external APIs tend to prefer these models.
For agentic workflows, Qwen3's strong BFCL and tool-use scores, combined with native Model Context Protocol (MCP) support and the Qwen-Agent framework, make it a practical foundation for building agents that call external APIs, execute code, or coordinate multi-step tasks. The Qwen3-Coder variant extends this to repository-level software engineering tasks, and the Qwen Code CLI gives developers an out-of-the-box agent harness analogous to Claude Code.
For multilingual applications, the 119-language training coverage makes Qwen3 suitable for translation pipelines, customer service applications in non-English markets (including Arabic, French, Spanish, Japanese, Korean, and dozens of others), and localization workflows. The Chinese-English bilingual capability remains strong, consistent with the series' origins in Alibaba's Chinese-market products.
For reasoning-intensive tasks (competitive mathematics, scientific analysis, multi-step logic problems), the thinking mode makes Qwen3 useful in contexts where accuracy on extended reasoning outweighs response latency. The adjustable thinking budget gives operators control over the latency-accuracy tradeoff without requiring a model swap.
For retrieval and RAG pipelines, Qwen3-Embedding and Qwen3-Reranker provide a matched embedding-and-reranking stack covering 100+ languages with permissive licensing, which makes them a natural choice for builders who already use Qwen3 as the generation model.
For multimodal applications, Qwen3-VL handles document-with-image workflows, video question answering, and chart-and-figure extraction; Qwen3-Omni covers speech-in / speech-out conversational use cases including voice cloning. Both extend the Qwen3 hybrid-thinking design across modalities.
Qwen3 received substantial positive coverage from AI media and the developer community at launch. TechCrunch described the release as underscoring "competitive pressure between Chinese and American AI labs amid U.S. chip export restrictions," placing the models in the context of Alibaba's ability to develop frontier-class models despite restricted access to the most recent Nvidia hardware. CNBC reported on the announcement under the headline "Alibaba Qwen3 AI series: China's latest open-source AI breakthrough."
Within the developer community on Hugging Face and Reddit, Qwen3-30B-A3B attracted particular attention for its cost-quality profile. Commentators noted that a model requiring only 3 billion active parameters per forward pass while delivering reasoning quality competitive with 32B dense models was a practically significant result for operators who self-host inference. The Qwen3-32B dense model gained traction as the go-to choice for users who needed a manageable single-GPU model without the complexity of running the MoE variants.
Hugging Face's head of product highlighted several deployment conveniences: an FP8 checkpoint for faster inference, one-click deployment on Azure ML, and support for local use via MLX on Mac or INT4 quantized builds.
The reception of the hybrid thinking mode was broadly positive, with developers noting that per-turn /think and /no_think controls gave them granular control over latency without requiring separate model deployments for reasoning and non-reasoning tasks. The Qwen Team's later decision to split the unified mode back into Instruct-2507 and Thinking-2507 variants generated mixed reactions: some users preferred the cleaner specialisation, while others lamented the loss of in-conversation soft switching.
Later in 2025, after Moonshot AI released Kimi K2 as a competing open-weight model with strong agentic performance, updated checkpoints of Qwen3-235B-A22B (the -2507 version) outperformed Kimi K2 on GPQA, AIME 2025, and Arena-Hard v2 according to community benchmarks. Some community members observed that the updated Qwen3 checkpoint had "made Kimi K2 irrelevant after only one week, despite being one quarter the size," reflecting the rapid update cadence of the Chinese open-weight model ecosystem.
The Qwen3-Coder release in July 2025 was widely framed by the developer press as a direct competitor to Anthropic's Claude Sonnet 4 on agentic coding workflows. Simon Willison's coverage of Qwen3-Coder-Flash on the day of its release highlighted that the 30B-A3B variant could run usefully on a 64 GB M-series Mac, opening a category of frontier-class agentic coding to laptop users. The Qwen Code CLI's compatibility with Claude Code's plugin protocol made it a near-drop-in replacement for teams that wanted to swap underlying models without reworking developer workflows.
The Thinking-2507 release at the end of July 2025 was covered by VentureBeat under the headline "It's Qwen's summer," reflecting the cumulative effect of the 235B-A22B and Coder and Embedding releases coming within a few-week window. Standard third-party intelligence trackers such as Artificial Analysis and llm-stats began consistently placing Qwen3 in the top tier of open-weight reasoning models from August 2025 onward.
At the time of Qwen3's release, Alibaba reported that the broader Qwen model family had accumulated over 300 million total downloads across Hugging Face, ModelScope, and other distribution channels. The family also had over 100,000 derivative models created by community members on Hugging Face, representing fine-tunes, merges, quantized versions, and task-specific adaptations built on top of Qwen base weights.
Qwen3 models accumulated large download numbers quickly following their April 2025 release. Several variants (particularly Qwen3-8B and Qwen3-32B) appeared among the most-downloaded models on the Hugging Face Hub within weeks, consistent with the pattern of earlier Qwen 2.5 models. The 0.6B and 1.7B models also saw high download volumes from the embedded and mobile development community.
Through mid-2025 the broader Qwen ecosystem continued to expand. Industry trackers reported that Qwen had overtaken Meta's LLaMA as the most-downloaded open-weight model family on Hugging Face at some point in the third quarter of 2025, and that Chinese open-weight models accounted for roughly 63 percent of new uploads on Hugging Face in September 2025, with Qwen3 holding a 40 percent or higher share of derivative work in the broader Hub. Reports of more than 700 million cumulative Qwen downloads on Hugging Face surfaced in late 2025.
Qwen3 powers Alibaba's own Quark AI assistant application in China, where Quark serves over 200 million users with a conversational interface combining search, reasoning, multimodal interactions, and task execution. Alibaba's CEO Wu Yongming stated the company's goal to build "a world-leading full-stack AI provider serving businesses across all industries" while developing AI-native consumer applications.
Qwen3-Coder, released in July 2025, was deployed on Together AI's inference platform for external developers and positioned as a direct competitor to Anthropic's Claude for agentic coding workflows. Amazon Bedrock added Qwen3-Coder-30B-A3B-Instruct as a first-class model card, signalling adoption inside large enterprise cloud accounts. Google Vertex AI's Model Garden hosted Qwen3-VL, which is unusual for a Chinese-developed model and reflects Apache 2.0's broad enterprise acceptability.
Fine-tuning communities have produced specialised Qwen3 derivatives across domains. Notable examples include Qwen3 Swallow, a Japanese-tuned dense series from the Tokyo Institute of Technology's Swallow group, and several finance-, biomedical-, and law-specialised variants from Chinese universities and startups. The Unsloth team published optimised training recipes that let practitioners fine-tune the 30B-A3B MoE on a single 24GB consumer GPU.
Several practical limitations have been identified through community testing and user reports.
Long-context reliability degrades at inputs approaching the stated 128K limit. Community testing found that effective performance is more stable under 64K to 96K tokens. Inputs processed using the YaRN extension mechanism (which rescales position encodings to handle lengths beyond the native training window) perform somewhat worse than inputs within the native window, particularly for the smaller models.
Hallucinations in reasoning traces have been reported, especially on claims about software APIs and programming language behavior. One analysis estimated a roughly 26 percent fabrication rate in reasoning-heavy responses about technical topics, which is higher than typical rates for leading proprietary models. This is consistent with a broader phenomenon in chain-of-thought models, where the model can generate plausible-sounding but incorrect intermediate reasoning steps.
Model identity and knowledge cutoff confusion has been filed as a bug by multiple users. When asked directly about its own name or training cutoff date, Qwen3-32B produces inconsistent answers across sessions, including some that are incorrect. The Qwen Team has acknowledged this in their issue tracker. The model does not have a clearly specified public knowledge cutoff date; community reports suggest answers in the range of October 2023 to December 2024 depending on the session.
Numerical formatting issues have been noted in some Qwen3-32B responses, specifically inconsistent rendering of large numbers with comma separators in structured outputs.
Deployment complexity for the MoE variants is a barrier for self-hosters. While the 235B-A22B model activates only 22 billion parameters per token, loading all 235 billion parameters into memory requires multiple high-memory GPUs (typically 8 x 80GB A100s or equivalent). Individual developers without access to multi-GPU infrastructure must rely on cloud APIs to access the flagship model. The 1M-token extended context, similarly, requires roughly 1 TB of total GPU memory across an inference cluster, making it accessible only to operators with high-end multi-node hardware.
Unified-mode behaviour in the original April 2025 checkpoints sometimes produced think-mode reasoning when the user asked a casual question, or non-thinking direct answers on hard problems where reasoning would have helped. The July 2025 split into Instruct-2507 and Thinking-2507 was Alibaba's response to this issue: by training the two modes as separate models, both became more reliable in their respective regimes, at the cost of losing the soft-switching /think and /no_think toggle within a conversation.
The Qwen3-Max trillion-parameter checkpoint is closed-weight and accessible only through Alibaba Cloud's Model Studio API. Researchers who depend on weight access for interpretability work, fine-tuning, or independent evaluation are restricted to the open-weight 235B-A22B and Next 80B-A3B tiers for the largest fully-open Qwen3 models.
The Qwen Team continued to iterate after the April 2025 release. Beyond the 2507 refresh of the 235B-A22B, 30B-A3B, and 4B models in July 2025, Alibaba introduced Qwen3-Next-80B-A3B in September 2025 with a hybrid attention design that points toward the architecture of future Qwen generations. Qwen3-Max-Preview followed on September 5, 2025 as the first Qwen model to cross one trillion parameters, sitting alongside Qwen3-Max-Thinking variants that target reasoning tasks at the same scale. Qwen3-Omni-Flash-2025-12-01 in December 2025 added latency improvements to the multimodal branch.
In February 2026 Alibaba announced Qwen3.5, the next major generation of the series. Qwen3.5 expands multilingual support to 201 languages and dialects, introduces native multimodal capabilities (text, image, and video understanding within a single architecture), and includes a 397B-A17B MoE flagship together with smaller agentic-task-tuned models including Qwen3.5-122B-A10B and Qwen3.5-27B. Qwen3.5-Omni and Qwen3.6-Plus followed in April 2026, primarily as proprietary API releases. Qwen3 nevertheless remains the de facto baseline open-weight family for most downstream practitioners through 2026, with the 235B-A22B-2507, 30B-A3B-2507, Coder-480B-A35B, and Next-80B-A3B checkpoints continuing to see heavy use.