Qwen3-Next

AI Models Large Language Models Open Source AI

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,618 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Qwen3-Next is an efficiency-focused large language model and model architecture released in September 2025 by the Qwen team at Alibaba Cloud. The flagship model in the series, Qwen3-Next-80B-A3B, is an 80-billion-parameter mixture of experts (MoE) model that activates only about 3 billion parameters per token, a configuration the team denotes "80B-A3B." It combines a hybrid attention design, mixing a linear attention component based on Gated DeltaNet with periodic standard attention layers, together with an ultra-sparse mixture of experts, multi-token prediction, and several training-stability optimizations. Alibaba positions the model as approaching the quality of its much larger Qwen3-235B-A22B model while using a small fraction of the active compute, and it describes the design as the basis for later, more capable Qwen models such as Qwen3.5 ^[1]^[2].

The model was open-weighted under the Apache 2.0 license in two variants, an instruction-tuned variant (Qwen3-Next-80B-A3B-Instruct) and a reasoning variant (Qwen3-Next-80B-A3B-Thinking), and distributed through Hugging Face, ModelScope, Kaggle, and Alibaba Cloud's Model Studio ^[3]^[4].

Overview

Qwen3-Next was introduced as a "new generation of ultra-efficient model architecture" rather than a single point release. Its central goal is to break the conventional trade-off in which higher quality requires proportionally more active parameters and compute. By combining a high-sparsity MoE (many experts, very few active per token) with a hybrid attention stack that replaces most full-attention layers with a cheaper linear-attention mechanism, the team aimed to raise the ratio of total to active parameters far beyond previous Qwen models while keeping long-context inference fast ^[1]^[2].

Alibaba reports that Qwen3-Next-80B-A3B-Base matches or slightly exceeds the dense Qwen3-32B model on standard benchmarks despite using less than 10 percent of its training cost, and that it delivers more than ten times the inference throughput at context lengths beyond 32,000 tokens. The instruction-tuned model is reported to perform close to the much larger Qwen3-235B-A22B-Instruct-2507, and the Thinking variant is reported to outperform Google's Gemini-2.5-Flash-Thinking on several reasoning benchmarks ^[1]^[2]^[3]^[4].

Development (Qwen team)

Qwen3-Next was developed by the Qwen team at Alibaba Cloud and released in September 2025, with the public model cards and blog post appearing on September 11 and 12, 2025 ^[1]^[3]. It builds on the broader Qwen3 family, which spans dense models and earlier MoE models, and shares lineage with Alibaba's flagship Qwen3-235B-A22B and the proprietary Qwen3-Max line. The Qwen3-Next architecture was presented as the efficiency frontier of that lineup and as a testbed for techniques the team intended to carry into future systems.

The team explicitly framed the release as a step toward its next major model, writing that it would "further refine this architecture to develop Qwen3.5." In this sense Qwen3-Next functioned both as a usable open-weight model and as an architectural prototype for subsequent Qwen efficiency models ^[1]^[2].

Architecture

Qwen3-Next-80B-A3B has 80 billion total parameters, of which roughly 3 billion are activated per token, and about 79 billion non-embedding parameters. The model has 48 layers and a hidden dimension of 2,048 ^[3]^[4]. Its design rests on two main ideas: a hybrid attention stack and an ultra-sparse mixture of experts.

Hybrid attention

Instead of using full softmax attention in every layer, Qwen3-Next interleaves two mechanisms in a 3 to 1 ratio: 75 percent of the attention blocks use Gated DeltaNet, a gated variant of the DeltaNet linear-attention family, and 25 percent retain standard (gated) attention. The model card describes the repeating layout as twelve copies of the pattern "(3 x Gated DeltaNet -> MoE) -> (1 x Gated Attention -> MoE)" ^[3]^[4]. The team reports that Gated DeltaNet offers stronger in-context learning than sliding-window or simple linear attention, while the retained full-attention layers preserve precise long-range recall, and that the mixture outperforms either mechanism used alone ^[1]^[2].

In the implementation, the Gated DeltaNet blocks use 32 value heads and 16 query/key heads with a head dimension of 128, while the Gated Attention blocks use 16 query heads and 2 key/value heads with a head dimension of 256 ^[3]. This hybrid is what enables the model's reported throughput advantage at long context, since the linear-attention layers avoid the quadratic cost of full attention.

Sparse mixture of experts

Qwen3-Next uses a high-sparsity MoE feed-forward design. Each MoE layer contains 512 experts, of which 10 routed experts plus 1 shared expert are activated per token, with global load balancing to keep expert utilization even during training ^[1]^[3]. This drives the very low activation ratio, roughly 3 billion of 80 billion parameters per token, which is the source of the model's compute efficiency. The approach is part of a broader industry trend toward extreme MoE sparsity also seen in models such as DeepSeek-V3 and other large-scale MoE systems.

Additional techniques

The model is trained with multi-token prediction, which the team reports both improves pretraining quality and accelerates inference by enabling speculative decoding of multiple tokens at once. To stabilize training at this scale and sparsity, Qwen3-Next adds normalization and regularization changes, including zero-centered and weight-decayed layer normalization and other adjustments for robust pre-training and post-training ^[1]^[2]. Natively the model supports a context length of 262,144 tokens and can be extended to roughly 1,010,000 tokens using YaRN scaling with a factor of about 4.0, which the team reports validating at context lengths up to one million tokens ^[3]^[4].

Performance and efficiency

Alibaba reports large efficiency gains for the base model. Qwen3-Next-80B-A3B-Base was trained on about 15 trillion tokens and is said to use only 9.3 percent of the GPU-hour compute cost of Qwen3-32B, and less than 80 percent of the GPU hours of the earlier Qwen3-30B-A3B, while matching or slightly beating Qwen3-32B on downstream benchmarks ^[1]^[2]. For inference, the team reports prefill throughput nearly 7 times higher than Qwen3-32B at a 4,000-token context and over 10 times higher beyond 32,000 tokens, and decode throughput nearly 4 times higher at 4,000 tokens, still maintaining more than a 10 times advantage at long context ^[1]^[2].

On quality benchmarks, the company reports that the Instruct variant approaches the much larger Qwen3-235B-A22B-Instruct-2507. The numbers below are reported by Alibaba and should be read as vendor-reported results ^[3].

Benchmark (Instruct)	Qwen3-Next-80B-A3B-Instruct	Qwen3-235B-A22B-Instruct-2507
MMLU-Pro	80.6	83.0
AIME25	69.5	70.3
Arena-Hard v2	82.7	79.2
LiveCodeBench	56.6	51.8

For the Thinking variant, Alibaba reports that it trails the larger Qwen3-235B-A22B-Thinking-2507 on most reasoning tasks but exceeds Google's Gemini-2.5-Flash-Thinking on several of them. The following vendor-reported figures illustrate the comparison ^[4].

Benchmark (Thinking)	Qwen3-Next-80B-A3B-Thinking	Qwen3-235B-A22B-Thinking-2507	Gemini-2.5-Flash-Thinking
AIME25	87.8	92.3	72.0
HMMT25	73.9	83.9	64.2
LiveCodeBench	68.7	74.1	61.2
GPQA	77.2	81.1	82.8

Because these results are reported by the developer rather than independently audited at release, the headline claim is best stated as the model reaching Qwen3-235B-class quality on a subset of benchmarks while activating far fewer parameters, not as uniform parity.

Specifications

Attribute	Qwen3-Next-80B-A3B
Developer	Qwen team, Alibaba Cloud
Release	September 2025
Total parameters	80 billion (about 79 billion non-embedding)
Active parameters per token	about 3 billion
Layers	48
Hidden dimension	2,048
Attention	Hybrid: Gated DeltaNet and gated attention, 3:1 ratio
MoE	512 experts, 10 routed plus 1 shared activated per token
Other techniques	Multi-token prediction; stability-focused layer normalization
Native context	262,144 tokens (extensible to about 1,010,000 via YaRN)
Pretraining data	about 15 trillion tokens
Variants	Instruct, Thinking
License	Apache 2.0

Availability and licensing

Both Qwen3-Next-80B-A3B variants were released as open-weight models under the Apache 2.0 license, which permits commercial use ^[3]^[4]. The weights are available on Hugging Face and ModelScope, with the model also offered through Kaggle and Alibaba Cloud's Model Studio API. The Instruct variant produces direct responses without explicit reasoning traces, while the Thinking variant emits chain-of-thought style "thinking" content before its final answer and is recommended to be run with a large generation budget for hard problems ^[3]^[4]. Following the initial release, quantized community builds (for example GGUF formats) and support in local-inference tools appeared, easing deployment of the model on consumer and workstation hardware.

Significance

Qwen3-Next is significant as an open demonstration that aggressive architectural sparsity, combining hybrid linear-and-full attention with an ultra-sparse MoE, can deliver near-frontier quality at a small fraction of the active compute and training cost. Its 80B-A3B configuration, with only about 3.7 percent of parameters active per token, pushed the total-to-active ratio well beyond earlier Qwen MoE models and exemplified a broader 2025 efficiency trend that also drew on linear-attention research such as the DeltaNet and Gated DeltaNet lines and on highly sparse MoE designs popularized by DeepSeek ^[1]^[2].

Within Alibaba's lineup, Qwen3-Next served as the efficiency anchor beneath the dense Qwen3 models and the flagship Qwen3-235B and Qwen3-Max systems, and the team identified it as the architectural foundation for its next major release, Qwen3.5 ^[1]^[2]. In that role it functioned as both a practical open-weight model for long-context and cost-sensitive workloads and as a prototype that informed later Qwen efficiency models.

References

Qwen Team. "Qwen3-Next: Towards Ultimate Training and Inference Efficiency." Qwen / Alibaba Cloud blog, September 2025. https://www.alibabacloud.com/blog/602580 ↩
Alibaba Cloud Community. "Qwen3-Next: A New Generation of Ultra-Efficient Model Architecture Unveiled." September 2025. https://www.alibabacloud.com/blog/602536 ↩
"Qwen/Qwen3-Next-80B-A3B-Instruct." Hugging Face model card. https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct ↩
"Qwen/Qwen3-Next-80B-A3B-Thinking." Hugging Face model card. https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

IBM Granite 4.0 RadixAttention

Overview

Development (Qwen team)

Architecture

Hybrid attention

Sparse mixture of experts

Additional techniques

Performance and efficiency

Specifications

Availability and licensing

Significance

References

Improve this article

Related Articles

Llama 3

OLMo

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

What links here

Related Articles

Llama 3

OLMo

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

What links here