Qwen3-Max
Last reviewed
May 16, 2026
Sources
19 citations
Review status
Source-backed
Revision
v1 ยท 2,931 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
19 citations
Review status
Source-backed
Revision
v1 ยท 2,931 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen3-Max is the flagship large language model in Alibaba's Qwen series, first released in preview on September 5, 2025 and formally launched at the Apsara Conference on September 24, 2025. It is the first model in the family to cross the one trillion parameter threshold, and Alibaba positions it as its most capable proprietary LLM to date. The model uses a Mixture of Experts (MoE) architecture and supports a 262,144 token context window. It ships in two variants: Qwen3-Max-Instruct, a non-thinking model optimized for general chat, coding, and tool use; and Qwen3-Max-Thinking, a reasoning variant that emits explicit chain-of-thought and integrates code execution and search [1][2][3].
Qwen3-Max represents a notable strategic shift for the Qwen team. While earlier releases in the Qwen3 family were published under permissive open weights on Hugging Face, the Max tier is closed-weight and available only through Alibaba Cloud's Model Studio API, the Qwen Chat web interface, and a few partner platforms such as OpenRouter [4][5]. The preview version reached the top of several public leaderboards within days of release, and the production version unveiled at Apsara emphasized day-one performance on coding and agentic benchmarks, with Alibaba reporting 69.6 on SWE-bench Verified and 74.8 on Tau2-Bench at launch [2][6].
The Qwen series is developed by the Tongyi Lab at Alibaba Cloud and has grown rapidly since the first Qwen-7B release in August 2023. By 2025 the family already included dense models from 0.6 billion to 32 billion parameters, MoE models like Qwen3-235B-A22B, and specialized variants for coding (Qwen2.5-Coder) and vision-language tasks. The "Max" tier had existed since Qwen-Max in early 2024 and Qwen2.5-Max in January 2025, but those earlier flagships were always closed-source and intended as the commercial counterpart to the open Qwen line [4][7].
Qwen3-Max-Preview was announced on X by the official @Alibaba_Qwen account on September 5, 2025 with the line: "Introducing Qwen3-Max-Preview (Instruct), our biggest model yet, with over 1 trillion parameters" [8]. The post claimed the model beat Qwen3-235B-A22B-2507 across internal evaluations. Within roughly three weeks, on September 24, 2025, Alibaba used its annual Apsara Conference in Hangzhou to formally graduate the model from preview status and to announce the existence of a separate Thinking variant. The launch was framed as part of a broader "artificial superintelligence" roadmap that Alibaba Cloud sketched out at the event, alongside related releases including Qwen3-VL-235B-A22B for vision-language and a trio of Qwen3Guard safety models [2][9].
The Thinking variant was originally released as an intermediate checkpoint while training continued. Alibaba Cloud's social channels described it as "an early preview of Qwen3-Max-Thinking, an intermediate checkpoint still in training," and the team said tool use and large-scale test-time compute were essential ingredients in the early demonstrations [10]. A more polished release of Qwen3-Max-Thinking followed in late January 2026, with Alibaba claiming performance comparable to GPT-5.2-Thinking, Claude Opus 4.7, and Gemini 3 Pro across a suite of 19 benchmarks [11][12].
Qwen3-Max is a sparse Mixture of Experts model with more than one trillion total parameters. Alibaba has not published a full technical report for Qwen3-Max specifically, so several architectural details that are standard for smaller Qwen3 models, in particular the exact number of experts, the number of experts activated per token, and the precise active parameter count, have not been disclosed for the Max variant [1][13]. The Qwen3 technical report on arXiv covers the open models in the family (such as the 235B-A22B MoE and the dense 0.6B through 32B models) but does not include Max-specific architectural numbers [13].
The model studio listing for qwen3-max reports a total context window of 262,144 tokens, with an input limit of 258,048 tokens and a maximum output of 32,768 tokens. The Thinking variant adds a separate budget of up to 81,920 tokens for chain-of-thought reasoning [3][14]. Qwen3-Max is text-only at the Max tier; multimodal capabilities are handled by sibling models such as Qwen3-VL [9].
Unlike the smaller Qwen3 open models, which expose a single checkpoint that toggles between thinking and non-thinking behavior through a enable_thinking flag, Qwen3-Max ships the two modes as separate models: qwen3-max for the Instruct variant and qwen3-max-preview / qwen3-max-thinking for the reasoning variant. The Thinking endpoint only operates with streaming incremental output enabled, which means clients must set incremental_output=true on the API call to receive the reasoning trace [2]. In Thinking mode the model can be combined with tools such as a code interpreter and web search, and Alibaba uses parallel test-time computation to push scores on the hardest math benchmarks toward saturation [6][11].
The split between Instruct and Thinking is partly a product decision and partly a latency decision. Instruct returns a final answer with no visible chain-of-thought, so it is faster and cheaper per query. Thinking emits a long internal reasoning trace before the answer, often consuming tens of thousands of tokens for a single hard problem; Artificial Analysis reported that the Thinking checkpoint consumed roughly 86 million output tokens during its full Intelligence Index run, of which about 79 million were reasoning rather than final-answer tokens [16]. That makes Thinking dramatically more expensive to use at scale, even though the per-token rate is the same as Instruct.
Alibaba advertises 262,144 tokens as the native context window, but the model card and Model Studio documentation note that a larger 1 million token configuration is technically possible with YaRN-style extension and dramatically more GPU memory. The 1M configuration is not exposed through the standard API and is referenced more as a research capability of the underlying base than as a product offering [3][13]. For most practical use Qwen3-Max is treated as a 256K-class long-context model, comparable to GPT-5 and Claude Opus 4.7 in the same range.
Alibaba says Qwen3-Max was pre-trained on approximately 36 trillion tokens, roughly double the corpus used for Qwen2.5, with an emphasis on multilingual data, code, and STEM reasoning [1][6]. The pre-training run reportedly used a "global-batch load balancing loss" for the MoE routing and achieved about a 30% improvement in model FLOPs utilization compared with the earlier Qwen2.5-Max-Base run, although Alibaba has not published independently verifiable numbers for either claim outside of blog posts and conference talks [1]. Post-training combined supervised fine-tuning with reinforcement learning, and the Thinking variant was further refined with techniques described by Alibaba as "adaptive tool use" and "test-time scaling" [11].
The corpus emphasizes Chinese, English, and a long tail of Southeast and East Asian languages, with the Qwen team claiming support for over 100 languages and dialects in total. Code and mathematical reasoning data were upsampled during the later stages of pre-training, which is consistent with the model's relatively strong performance on coding-focused benchmarks compared with general knowledge benchmarks [1][6]. Beyond those high-level claims, however, Alibaba has not published a Qwen3-Max-specific technical report. Researchers looking for architectural and training detail have to triangulate from the broader Qwen3 technical report on arXiv, which covers the open weights up to 235B-A22B but stops short of Max [13].
For the Thinking variant, the Qwen team describes a multi-stage process: a base Instruct model trained as above, then a separate reasoning fine-tune that learns to emit chain-of-thought, then a tool-use stage where the model is trained against environments containing code interpreters, search tools, and other agentic actions. Alibaba's blog summarizes this as combining "expanded capacity, large-scale computing resources, and reinforcement learning" but does not disclose the specific RL algorithm, the reward models used, or the size of the post-training datasets [11].
The table below collects officially reported scores from Alibaba's launch materials and the most widely cited third-party evaluations. Where two scores are given for the same benchmark, the first is for Qwen3-Max-Instruct and the second is for Qwen3-Max-Thinking. Benchmarks that Alibaba has not published numbers for are omitted rather than inferred.
| Benchmark | Qwen3-Max-Instruct | Qwen3-Max-Thinking | Notes |
|---|---|---|---|
| AIME 2025 (math) | 80.6 / 81.6 (vendor-reported) | 100 (with tools, parallel test-time compute) | [1][6][11] |
| HMMT February 2025 | not reported | 98.0 (no tools); 100 (with tools) | [11][12] |
| GPQA Diamond | not reported | 85.4 to 87.4 across reports | [11][12][15] |
| SWE-bench Verified | 69.6 | 75.3 | [2][6][12] |
| Tau2-Bench (agent tool use) | 74.8 | 82.1 | [2][12] |
| LiveCodeBench v6 | 57.5 to 74.8 across reports | 85.9 | [1][6][12] |
| Arena-Hard v2 | 78.9 to 86.1 across reports | 90.2 | [1][12][15] |
| SuperGPQA | 64.6 to 65.1 | not reported | [1][6] |
| LiveBench (2024-11-25) | 79.3 | not reported | [1] |
| MMLU-Pro | not reported | 85.7 | [12] |
| MMLU-Redux | not reported | 92.8 | [12] |
| Humanity's Last Exam (with tools) | not reported | 49.8 | [12][16] |
| Artificial Analysis Intelligence Index | 26 (Preview) / 31 (final Instruct) | 40 | [15][17] |
The spread on LiveCodeBench v6, Arena-Hard v2, and AIME 2025 reflects the fact that different secondary reports captured the model at different points in its post-training cycle. The preview version that landed on LMArena in early September 2025 reached the third position on that public leaderboard and briefly led several open-source competitors and Claude Opus 4 across coding and agentic categories [1][8]. The Thinking variant later reported triple-digit accuracy on AIME 2025 and HMMT, although those scores depend on tool use (notably a Python interpreter) and large-scale parallel sampling, both of which Alibaba has been explicit about [10][11].
On Humanity's Last Exam, Alibaba claimed in January 2026 that Qwen3-Max-Thinking with search enabled scored 49.8, ahead of GPT-5.2 (45.5), Claude Opus 4.5 (43.2), and Gemini 3 Pro (45.8) on that particular configuration. Independent analysts at Artificial Analysis verified an Intelligence Index of 40 for the same Thinking checkpoint, placing it slightly behind DeepSeek V3.2 (42) but ahead of MiniMax-M2.1 [12][15].
Qwen3-Max is closed-weight. There is no open release of the model on Hugging Face or anywhere else; only the smaller Qwen3 models (up to 235B-A22B) are available for download. Access is gated behind Alibaba's commercial channels.
| Variant | Model ID | Primary use |
|---|---|---|
| Qwen3-Max-Instruct | qwen3-max | General chat, coding, tool use, structured outputs |
| Qwen3-Max-Preview | qwen3-max-preview | Early Instruct preview, deprecated after Apsara launch |
| Qwen3-Max-Thinking | qwen3-max-thinking | Reasoning, agentic workflows, math and science problems |
Developers can reach the model through three official routes. The Qwen Chat web interface at chat.qwen.ai exposes Qwen3-Max for free use by individual users. The Alibaba Cloud Model Studio API (also branded as Bailian in mainland China) is the primary route for commercial integration and supports an OpenAI-compatible request schema. Third-party gateways including OpenRouter and Hugging Face's AnyCoder also surface the model under Alibaba's branding [4][5][2].
Pricing varies by deployment region. The international Singapore deployment lists $1.20 per million input tokens and $6.00 per million output tokens for the first 32K of context, rising to $3.00 input and $15.00 output once the context exceeds 128K. The Beijing and Frankfurt deployments are cheaper, at roughly $0.36 input and $1.43 output per million tokens in the lowest tier, reflecting Alibaba's regional pricing strategy [3][14].
| Context range | Input ($/M tokens, Singapore) | Output ($/M tokens, Singapore) |
|---|---|---|
| 0 to 32K | 1.20 | 6.00 |
| 32K to 128K | 2.40 | 12.00 |
| 128K to 252K | 3.00 | 15.00 |
The reception of Qwen3-Max has split along familiar lines. Inside the open-source community, the move to closed weights for the Max tier drew complaints, since one of the things that made Qwen3 popular in the first place was the willingness to publish frontier weights under an Apache-style license. Several commentators pointed out that this puts Qwen3-Max closer in spirit to GPT-5 and Claude Opus 4 than to DeepSeek V3, which remains downloadable [4][5].
On benchmarks, the picture is genuinely competitive but not dominant. Qwen3-Max-Instruct sits at or near the frontier for non-reasoning models on coding and agentic tasks, and its 69.6 on SWE-bench Verified at launch placed it within a few points of the best closed models at the time [2][6]. The Thinking variant is where Alibaba has been most aggressive in framing comparisons, with the January 2026 announcement claiming parity with GPT-5.2-Thinking, Claude Opus 4.5, and Gemini 3 Pro across 19 benchmarks [11][12]. Independent measurements by Artificial Analysis put the Thinking Intelligence Index at 40, which is close to but not ahead of comparable Western reasoning models, suggesting the picture is closer to "competitive with the frontier" than "new state of the art" [15][16].
Where Qwen3-Max has a clear edge is price and Chinese-language performance. The input pricing is roughly half of GPT-5's at comparable context lengths, and on Chinese-focused evaluations such as QwenChineseBench the Qwen models consistently lead the field, which matters for enterprises in mainland China and Southeast Asia that need to operate primarily in Mandarin, Japanese, Korean, or Malay [3][6][9]. Reporters at VentureBeat highlighted output speed (around 50 tokens per second for the Preview Instruct version) as another differentiator, although the Thinking variant is meaningfully slower per output token because of its long reasoning traces [15][17].
Alibaba's larger framing at Apsara 2025 was that Qwen3-Max is the first concrete step on what its Cloud unit called a multi-year roadmap toward artificial general intelligence and ultimately artificial superintelligence. That framing has been read both as ambition and as marketing. The model itself, with its closed weights, paid API, and incremental jumps over Qwen3-235B-A22B, is more conservative than the rhetoric around it [9][2].
Reporters also noted that Qwen3-Max sits inside a unique geopolitical position. It is a Chinese model with a credible claim to being competitive at the global frontier, which is rare; previous Chinese frontier candidates such as DeepSeek V3 either traded weights for the spotlight or accepted a slight benchmark deficit in exchange for openness. Qwen3-Max takes the opposite route and looks more like an American frontier lab in its product strategy: keep the weights, charge for the API, and run a large research lab on the revenue. Whether enterprises outside China are willing to standardize on a model hosted on Alibaba Cloud is a separate question that several InfoWorld analysts said depends on data isolation, system logs, model update policy, and cross-border data movement rules in each customer's home jurisdiction [11].
Alibaba positions Qwen3-Max-Instruct as a general-purpose model for chat, coding, structured data extraction, and tool use, and the Thinking variant for agentic workflows that need explicit reasoning. Early adopters reported on it most often for three things. The first is enterprise coding assistance, where the SWE-bench Verified score and the long context window make it a credible alternative to closed Western models for repository-scale tasks. The second is Chinese-language and multilingual workloads, where the model's Mandarin and Asian-language capabilities outperform models built primarily on English data. The third is mathematical and scientific problem solving in research settings where Thinking mode plus a code interpreter can be allowed to spend tens of minutes on a single problem [6][11][18].
The main limitations are familiar for a closed proprietary model: there are no open weights, so private deployment is impossible; the model is not multimodal, so vision and audio tasks have to be routed to sibling Qwen models; and outside of Alibaba's own infrastructure, latency and reliability depend on what each gateway provider has provisioned. The Thinking variant in particular is meaningfully slower than Instruct, which limits how easily it can be dropped into latency-sensitive applications [16][17].