Qwen3-Next
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,618 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,618 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen3-Next is an efficiency-focused large language model and model architecture released in September 2025 by the Qwen team at Alibaba Cloud. The flagship model in the series, Qwen3-Next-80B-A3B, is an 80-billion-parameter mixture of experts (MoE) model that activates only about 3 billion parameters per token, a configuration the team denotes "80B-A3B." It combines a hybrid attention design, mixing a linear attention component based on Gated DeltaNet with periodic standard attention layers, together with an ultra-sparse mixture of experts, multi-token prediction, and several training-stability optimizations. Alibaba positions the model as approaching the quality of its much larger Qwen3-235B-A22B model while using a small fraction of the active compute, and it describes the design as the basis for later, more capable Qwen models such as Qwen3.5 [1][2].
The model was open-weighted under the Apache 2.0 license in two variants, an instruction-tuned variant (Qwen3-Next-80B-A3B-Instruct) and a reasoning variant (Qwen3-Next-80B-A3B-Thinking), and distributed through Hugging Face, ModelScope, Kaggle, and Alibaba Cloud's Model Studio [3][4].
Qwen3-Next was introduced as a "new generation of ultra-efficient model architecture" rather than a single point release. Its central goal is to break the conventional trade-off in which higher quality requires proportionally more active parameters and compute. By combining a high-sparsity MoE (many experts, very few active per token) with a hybrid attention stack that replaces most full-attention layers with a cheaper linear-attention mechanism, the team aimed to raise the ratio of total to active parameters far beyond previous Qwen models while keeping long-context inference fast [1][2].
Alibaba reports that Qwen3-Next-80B-A3B-Base matches or slightly exceeds the dense Qwen3-32B model on standard benchmarks despite using less than 10 percent of its training cost, and that it delivers more than ten times the inference throughput at context lengths beyond 32,000 tokens. The instruction-tuned model is reported to perform close to the much larger Qwen3-235B-A22B-Instruct-2507, and the Thinking variant is reported to outperform Google's Gemini-2.5-Flash-Thinking on several reasoning benchmarks [1][2][3][4].
Qwen3-Next was developed by the Qwen team at Alibaba Cloud and released in September 2025, with the public model cards and blog post appearing on September 11 and 12, 2025 [1][3]. It builds on the broader Qwen3 family, which spans dense models and earlier MoE models, and shares lineage with Alibaba's flagship Qwen3-235B-A22B and the proprietary Qwen3-Max line. The Qwen3-Next architecture was presented as the efficiency frontier of that lineup and as a testbed for techniques the team intended to carry into future systems.
The team explicitly framed the release as a step toward its next major model, writing that it would "further refine this architecture to develop Qwen3.5." In this sense Qwen3-Next functioned both as a usable open-weight model and as an architectural prototype for subsequent Qwen efficiency models [1][2].
Qwen3-Next-80B-A3B has 80 billion total parameters, of which roughly 3 billion are activated per token, and about 79 billion non-embedding parameters. The model has 48 layers and a hidden dimension of 2,048 [3][4]. Its design rests on two main ideas: a hybrid attention stack and an ultra-sparse mixture of experts.
Instead of using full softmax attention in every layer, Qwen3-Next interleaves two mechanisms in a 3 to 1 ratio: 75 percent of the attention blocks use Gated DeltaNet, a gated variant of the DeltaNet linear-attention family, and 25 percent retain standard (gated) attention. The model card describes the repeating layout as twelve copies of the pattern "(3 x Gated DeltaNet -> MoE) -> (1 x Gated Attention -> MoE)" [3][4]. The team reports that Gated DeltaNet offers stronger in-context learning than sliding-window or simple linear attention, while the retained full-attention layers preserve precise long-range recall, and that the mixture outperforms either mechanism used alone [1][2].
In the implementation, the Gated DeltaNet blocks use 32 value heads and 16 query/key heads with a head dimension of 128, while the Gated Attention blocks use 16 query heads and 2 key/value heads with a head dimension of 256 [3]. This hybrid is what enables the model's reported throughput advantage at long context, since the linear-attention layers avoid the quadratic cost of full attention.
Qwen3-Next uses a high-sparsity MoE feed-forward design. Each MoE layer contains 512 experts, of which 10 routed experts plus 1 shared expert are activated per token, with global load balancing to keep expert utilization even during training [1][3]. This drives the very low activation ratio, roughly 3 billion of 80 billion parameters per token, which is the source of the model's compute efficiency. The approach is part of a broader industry trend toward extreme MoE sparsity also seen in models such as DeepSeek-V3 and other large-scale MoE systems.
The model is trained with multi-token prediction, which the team reports both improves pretraining quality and accelerates inference by enabling speculative decoding of multiple tokens at once. To stabilize training at this scale and sparsity, Qwen3-Next adds normalization and regularization changes, including zero-centered and weight-decayed layer normalization and other adjustments for robust pre-training and post-training [1][2]. Natively the model supports a context length of 262,144 tokens and can be extended to roughly 1,010,000 tokens using YaRN scaling with a factor of about 4.0, which the team reports validating at context lengths up to one million tokens [3][4].
Alibaba reports large efficiency gains for the base model. Qwen3-Next-80B-A3B-Base was trained on about 15 trillion tokens and is said to use only 9.3 percent of the GPU-hour compute cost of Qwen3-32B, and less than 80 percent of the GPU hours of the earlier Qwen3-30B-A3B, while matching or slightly beating Qwen3-32B on downstream benchmarks [1][2]. For inference, the team reports prefill throughput nearly 7 times higher than Qwen3-32B at a 4,000-token context and over 10 times higher beyond 32,000 tokens, and decode throughput nearly 4 times higher at 4,000 tokens, still maintaining more than a 10 times advantage at long context [1][2].
On quality benchmarks, the company reports that the Instruct variant approaches the much larger Qwen3-235B-A22B-Instruct-2507. The numbers below are reported by Alibaba and should be read as vendor-reported results [3].
| Benchmark (Instruct) | Qwen3-Next-80B-A3B-Instruct | Qwen3-235B-A22B-Instruct-2507 |
|---|---|---|
| MMLU-Pro | 80.6 | 83.0 |
| AIME25 | 69.5 | 70.3 |
| Arena-Hard v2 | 82.7 | 79.2 |
| LiveCodeBench | 56.6 | 51.8 |
For the Thinking variant, Alibaba reports that it trails the larger Qwen3-235B-A22B-Thinking-2507 on most reasoning tasks but exceeds Google's Gemini-2.5-Flash-Thinking on several of them. The following vendor-reported figures illustrate the comparison [4].
| Benchmark (Thinking) | Qwen3-Next-80B-A3B-Thinking | Qwen3-235B-A22B-Thinking-2507 | Gemini-2.5-Flash-Thinking |
|---|---|---|---|
| AIME25 | 87.8 | 92.3 | 72.0 |
| HMMT25 | 73.9 | 83.9 | 64.2 |
| LiveCodeBench | 68.7 | 74.1 | 61.2 |
| GPQA | 77.2 | 81.1 | 82.8 |
Because these results are reported by the developer rather than independently audited at release, the headline claim is best stated as the model reaching Qwen3-235B-class quality on a subset of benchmarks while activating far fewer parameters, not as uniform parity.
| Attribute | Qwen3-Next-80B-A3B |
|---|---|
| Developer | Qwen team, Alibaba Cloud |
| Release | September 2025 |
| Total parameters | 80 billion (about 79 billion non-embedding) |
| Active parameters per token | about 3 billion |
| Layers | 48 |
| Hidden dimension | 2,048 |
| Attention | Hybrid: Gated DeltaNet and gated attention, 3:1 ratio |
| MoE | 512 experts, 10 routed plus 1 shared activated per token |
| Other techniques | Multi-token prediction; stability-focused layer normalization |
| Native context | 262,144 tokens (extensible to about 1,010,000 via YaRN) |
| Pretraining data | about 15 trillion tokens |
| Variants | Instruct, Thinking |
| License | Apache 2.0 |
Both Qwen3-Next-80B-A3B variants were released as open-weight models under the Apache 2.0 license, which permits commercial use [3][4]. The weights are available on Hugging Face and ModelScope, with the model also offered through Kaggle and Alibaba Cloud's Model Studio API. The Instruct variant produces direct responses without explicit reasoning traces, while the Thinking variant emits chain-of-thought style "thinking" content before its final answer and is recommended to be run with a large generation budget for hard problems [3][4]. Following the initial release, quantized community builds (for example GGUF formats) and support in local-inference tools appeared, easing deployment of the model on consumer and workstation hardware.
Qwen3-Next is significant as an open demonstration that aggressive architectural sparsity, combining hybrid linear-and-full attention with an ultra-sparse MoE, can deliver near-frontier quality at a small fraction of the active compute and training cost. Its 80B-A3B configuration, with only about 3.7 percent of parameters active per token, pushed the total-to-active ratio well beyond earlier Qwen MoE models and exemplified a broader 2025 efficiency trend that also drew on linear-attention research such as the DeltaNet and Gated DeltaNet lines and on highly sparse MoE designs popularized by DeepSeek [1][2].
Within Alibaba's lineup, Qwen3-Next served as the efficiency anchor beneath the dense Qwen3 models and the flagship Qwen3-235B and Qwen3-Max systems, and the team identified it as the architectural foundation for its next major release, Qwen3.5 [1][2]. In that role it functioned as both a practical open-weight model for long-context and cost-sensitive workloads and as a prototype that informed later Qwen efficiency models.