Qwen3.5
Last reviewed
Jun 9, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 ยท 2,247 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 9, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 ยท 2,247 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen3.5 is a family of open-weight large language models developed by the Qwen team at Alibaba, the successor to the Qwen3 series. [1][2] The line debuted on February 16, 2026 with a flagship model, Qwen3.5-397B-A17B, that pairs a sparse mixture of experts (397 billion total parameters, about 17 billion active per token) with linear attention via Gated DeltaNet, and ships as a natively multimodal system covering 201 languages and dialects. [1][3][4] Like earlier Qwen flagships, the open-weight checkpoints are released under the permissive Apache 2.0 license. [4][5]
Qwen3.5 is the fourth major generation of the Qwen model family. The Qwen team frames it under the tagline "Towards Native Multimodal Agents," signaling two priorities: folding vision into the model from the start rather than bolting it on afterward, and pushing agentic, tool-using behavior. [2][6] Every model in the family, from the 397-billion-parameter flagship down to a sub-billion-parameter checkpoint, is built on the same hybrid architecture and the same early-fusion vision-language foundation, so even the smallest variants accept image and video input alongside text. [2][7]
The generation's defining engineering choice is a hybrid attention stack. Instead of using full softmax attention in every layer, Qwen3.5 interleaves three blocks of Gated DeltaNet, a form of linear attention with constant memory cost per token, for every one block of conventional gated attention. [3][8] Combined with a sparse MoE feed-forward design, this lets the larger models hold a native 262,144-token context and decode long sequences far faster than the previous Qwen3-Max flagship while activating only a small fraction of their weights on any given token. [3][4]
Alibaba's Qwen team rolled out the Qwen3.5 family in three waves over roughly two weeks, leading with the largest model and filling in smaller sizes afterward. [1][2][6]
The flagship Qwen3.5-397B-A17B arrived first, on February 16, 2026, accompanied by the hosted Qwen3.5-Plus endpoint on Alibaba Cloud's Model Studio. [1][9] Mid-size checkpoints (Qwen3.5-122B-A10B, Qwen3.5-35B-A3B, and Qwen3.5-27B) followed on February 24, 2026, and a set of smaller models (Qwen3.5-9B, Qwen3.5-4B, Qwen3.5-2B, and Qwen3.5-0.8B) shipped on March 2, 2026 to Hugging Face and ModelScope. [2][6] The release landed in a crowded stretch of Chinese open-weight launches, and commentators read it as evidence that the open-weight race was not slowing down. [5]
Qwen3.5 was a relatively short-lived flagship. The Qwen team moved on to a Qwen3.6 generation in April 2026, but Qwen3.5 remains the reference point for the architecture, since Qwen3.6 inherited the same hybrid Gated DeltaNet plus MoE design. [6]
The Qwen3.5 open-weight release spans eight checkpoints. The four largest use a sparse MoE feed-forward network and carry an "A" suffix denoting active parameters (for example, "A17B" means 17 billion active per token); the smaller models replace the MoE blocks with dense feed-forward layers but keep the same 3:1 Gated DeltaNet to gated attention pattern. [3][4][7] All variants are natively multimodal and share a native 262,144-token context that can be extended toward roughly one million tokens using RoPE scaling. [4][7]
| Model | Total params | Active params | Experts (total / routed + shared) | Layers | FFN type | Release |
|---|---|---|---|---|---|---|
| Qwen3.5-397B-A17B | 397B | 17B | 512 / 10 + 1 | 60 | MoE | Feb 16, 2026 |
| Qwen3.5-122B-A10B | 122B | 10B | 256 / 8 + 1 | 48 | MoE | Feb 24, 2026 |
| Qwen3.5-35B-A3B | 35B | 3B | 256 / 8 + 1 | 40 | MoE | Feb 24, 2026 |
| Qwen3.5-27B | 27B | 27B (dense) | n/a | 64 | Dense | Feb 24, 2026 |
| Qwen3.5-9B | 9B | 9B (dense) | n/a | 32 | Dense | Mar 2, 2026 |
| Qwen3.5-4B | 4B | 4B (dense) | n/a | n/a | Dense | Mar 2, 2026 |
| Qwen3.5-2B | 2B | 2B (dense) | n/a | n/a | Dense | Mar 2, 2026 |
| Qwen3.5-0.8B | 0.8B | 0.8B (dense) | n/a | n/a | Dense | Mar 2, 2026 |
Sources: official Hugging Face model cards and the Qwen GitHub release notes. [4][6][7][10][11]
Qwen3.5 is built around a hybrid attention design that the model cards describe as an interleave of Gated DeltaNet and gated attention. [3][4] In the flagship, the 60 layers are organized as 15 repeated blocks, each laid out as three Gated DeltaNet sublayers followed by one gated-attention sublayer, with a MoE feed-forward network after every sublayer. [4][8] Gated DeltaNet is a linear-attention mechanism whose memory footprint does not grow with sequence length, which is what makes very long contexts cheap to serve; the periodic full-attention layers preserve the precise long-range recall that pure linear attention tends to lose. [3][8]
In the 397B flagship the Gated DeltaNet sublayers use 64 linear-attention heads for values and 16 for queries and keys (head dimension 128), while the gated-attention sublayers use 32 query heads and 2 key/value heads (head dimension 256) with a 64-dimensional rotary position embedding. [4] The hidden dimension is 4,096. The MoE layers hold 512 experts, of which a router activates 10 plus 1 always-on shared expert per token, so 11 of 512 experts fire on any forward pass, which is how a 397-billion-parameter network runs with only about 17 billion active parameters. [3][4] The mid-size MoE models reuse the same template at smaller scale: the 122B-A10B and 35B-A3B each use 256 experts with 8 routed plus 1 shared, differing in layer count, hidden size, and active-parameter budget. [10][11]
A second pillar is native multimodality. Rather than attaching a vision adapter to a finished text model, Qwen3.5 was pretrained with early fusion on trillions of multimodal tokens, so a vision encoder feeds the language model from the outset; the model cards expose the checkpoints as image-text-to-text systems and note that a flag can run them in text-only mode. [2][7][10] The Qwen team reports that this approach reached near-parity in training efficiency with text-only pretraining, and that the resulting models match or beat the dedicated Qwen3-VL vision models on visual reasoning while still improving on text. [2][7] The flagship is trained with multi-token prediction, which supports speculative decoding for faster generation. [4]
Alibaba has not published a full breakdown of the Qwen3.5 pretraining corpus, but the team states that the models were trained on trillions of multimodal tokens spanning text, images, and video, with a post-training stage that scales reinforcement learning for reasoning and agentic behavior. [2][7] The tokenizer vocabulary was enlarged from roughly 150,000 entries in Qwen3 to about 250,000, part of the work that widened language coverage from 119 languages to 201. [3][12]
Every Qwen3.5 model has a native context window of 262,144 tokens (256K). [4][7] Using RoPE-based length extrapolation such as YaRN, the open weights can be pushed toward roughly 1,010,000 tokens, and the hosted Qwen3.5-Plus endpoint exposes a one-million-token context window by default. [4][7][9] The hybrid attention stack is what makes contexts this large economical: because most layers use linear attention, per-token cost stays roughly flat as the sequence grows instead of scaling quadratically.
Qwen3.5 is positioned as a generalist agentic model. The Qwen team highlights long-horizon reasoning, tool use, and multimodal understanding spanning images, documents, charts, and video, alongside strong coding and software-engineering performance. [2][7] The larger models support a "thinking" mode, enabled by default, in which the model produces an internal reasoning trace before its final answer; this can be switched off for latency-sensitive use. [10]
The headline change for breadth is language coverage. Qwen3.5 supports 201 languages and dialects, up from 119 in Qwen3, which the team presents as one of the generation's main selling points. [2][3] This shows up in benchmark results, where the flagship's multilingual scores are among its strongest relative to Western frontier models. [3][4]
The efficiency story is equally central. Because of the hybrid attention design and sparse activation, the Qwen team reports decoding-throughput gains of roughly 8.6x at 32K context and up to 19x at 256K context relative to Qwen3-Max, which independent coverage echoed alongside claims of materially lower serving cost. [3][9]
Qwen reports that the flagship Qwen3.5-397B-A17B is competitive with leading proprietary frontier models of the period, including GPT-5.2, Gemini 3 Pro, and Claude Opus 4.5, leading on multilingual and several agentic and vision tasks while trailing on some math-competition and long-context evaluations. [3][5] The figures below are drawn from the official model cards. As with any vendor-reported numbers, scores reflect the publisher's own evaluation setup and are best treated as indicative.
| Benchmark | Qwen3.5-397B-A17B | Qwen3.5-27B |
|---|---|---|
| MMLU-Pro | 87.8 | 86.1 |
| GPQA | 88.4 | n/a |
| AIME 2026 | 91.3 | n/a |
| LiveCodeBench v6 | 83.6 | n/a |
| SWE-bench Verified | 76.4 | 72.4 |
| Tau2-bench | 86.7 | n/a |
| IFEval | 92.6 | 95.0 |
| LongBench v2 | 63.2 | 60.6 |
| MMMLU (multilingual) | 88.5 | n/a |
Reported vision-language results for the flagship include 85.0 on MMMU, 79.0 on MMMU-Pro, 88.6 on MathVision, 90.8 on OmniDocBench v1.5, and 87.5 on Video-MME with subtitles, which the Qwen team cites as evidence that the unified model surpasses the earlier dedicated Qwen3-VL line on visual reasoning. [4][7] Independent hands-on testing by analysts at the time described a model with balanced strength across reasoning, agentic execution, coding, and multimodal understanding rather than a single standout axis. [13]
| Vision-language benchmark | Qwen3.5-397B-A17B |
|---|---|
| MMMU | 85.0 |
| MMMU-Pro | 79.0 |
| MathVision | 88.6 |
| OmniDocBench v1.5 | 90.8 |
| Video-MME (with subtitles) | 87.5 |
| VideoMMMU | 84.7 |
Sources: official Qwen3.5-397B-A17B and Qwen3.5-27B model cards. [4][7]
The Qwen3.5 open-weight checkpoints are released under Apache 2.0, which permits commercial use, modification, and redistribution. [4][5] Weights are distributed through Hugging Face and ModelScope, and the family is also reachable as a hosted service. The Qwen3.5-Plus API on Alibaba Cloud's Model Studio serves the flagship-class model with a one-million-token default context and accepts text, image, and video input. [7][9]
| Channel | What is offered | License / terms |
|---|---|---|
| Hugging Face | All eight open-weight checkpoints | Apache 2.0 |
| ModelScope | All eight open-weight checkpoints | Apache 2.0 |
| Qwen3.5-Plus (Alibaba Cloud Model Studio) | Hosted flagship-class API, 1M context, multimodal | Commercial API (usage-based) |
| Qwen Chat | Consumer chat interface | Service terms |
Published API rates for the hosted endpoint vary by source and over time, and Alibaba's documentation describes tiered billing based on prompt size, so exact per-token prices are best taken from the current Model Studio pricing page rather than secondary write-ups. [9][14]
Coverage of Qwen3.5 centered on three points: the unusually high ratio of total to active parameters, the hybrid linear-attention architecture, and the decision to ship a strong multimodal flagship as free open weights. The Decoder framed the release as a sign that China's open-weight push was "far from slowing down," noting the jump to 201 languages, the larger 250,000-token vocabulary, and benchmark results that put the model close to GPT-5.2 on agentic tasks while leading on instruction following and multilingual work. [5] DeepLearning.AI's The Batch described the model as a hybrid of linear attention via Gated DeltaNet with sparse MoE that is "competitive with GPT-4, Claude Opus 4.5, and Gemini Pro 3" on knowledge, coding, reasoning, and multimodal tasks, while trailing the leaders on math competitions and long-context evaluations. [3] Technical outlets emphasized the throughput gains over Qwen3-Max and the fact that even sub-billion-parameter members of the family are natively multimodal. [9][13]
The published results and commentary point to a few consistent caveats. On the capability side, the flagship trails the strongest proprietary systems on some math-competition benchmarks and on certain long-context evaluations, even as it leads on multilingual and instruction-following tasks. [3][5] On the engineering side, the hybrid linear-attention design is comparatively new, and aggressive sparsity (only about 11 of 512 experts active per token in the flagship) and low-precision serving involve quality-versus-efficiency tradeoffs that are still being characterized in practice. [13] As with other open-weight releases, the reported benchmark figures are vendor-published and reflect Alibaba's own evaluation setup, so independent reproduction is needed to gauge real-world performance. Finally, Alibaba has not disclosed the full composition of the training data, which limits outside scrutiny of the corpus. [7]