Qwen3-Next Hybrid MoE Models Preview: Improved Accuracy and Faster Inference on NVIDIA Platform
Sources: https://developer.nvidia.com/blog/new-open-source-qwen3-next-models-preview-hybrid-moe-architecture-delivering-improved-accuracy-and-accelerated-parallel-processing-across-nvidia-platform, https://developer.nvidia.com/blog/new-open-source-qwen3-next-models-preview-hybrid-moe-architecture-delivering-improved-accuracy-and-accelerated-parallel-processing-across-nvidia-platform/, NVIDIA Dev Blog
TL;DR
- Alibaba released two open source Qwen3-Next models, 80B-A3B-Thinking and 80B-A3B-Instruct, previewing a hybrid Mixture of Experts (MoE) architecture that aims to improve accuracy while accelerating parallel processing across NVIDIA platforms.
- Each model has 80 billion parameters, but only about 3 billion are activated per token due to the sparse MoE design, delivering the power of a massive model with the efficiency of a smaller one.
- The MoE module routes between 512 experts plus 1 shared expert, with 10 experts activated per token, enabling scalable inference and flexible routing.
- The architecture supports long context lengths beyond 260K tokens and leverages Gated Delta Networks to process super-long text efficiently, with memory and compute scaling roughly linearly with sequence length.
- When run on NVIDIA Hopper and Blackwell hardware, the setup benefits from 5th-generation NVLink delivering 1.8 TB/s of direct GPU-to-GPU bandwidth, reducing latency in expert routing and boosting token throughput.
- NVIDIA collaborates with SGLang and vLLM to enable model deployment as NVIDIA NIM microservices; developers can test on build.nvidia.com, access prepackaged NIM services, and explore a Jupyter notebook guide for building AI agents using Qwen3-Next NIM endpoints.
Context and background
As AI models grow, efficiency becomes as important as scale. Long input sequences are increasingly common, and there is a need for architectures that deliver high accuracy without prohibitive compute costs. In this context, Alibaba released two open source Qwen3-Next models—80B-A3B-Thinking and 80B-A3B-Instruct—to preview a new hybrid Mixture of Experts (MoE) approach to research and development. The Qwen3-Next family is designed to provide the capabilities of very large models while maintaining practical resource use through sparsity and optimized inter-GPU communication. This initiative sits within NVIDIA’s broader open-source ecosystem, which includes projects like NeMo for AI lifecycle management, Nemotron LLMs, and Cosmos world foundation models (WFMs). The aim is to accelerate innovation by making cutting-edge models more accessible, transparent, and collaborative for researchers and developers alike. NVIDIA emphasizes open deployment pathways by partnering with open-source frameworks such as SGLang and vLLM, and by packaging models as NVIDIA NIM (NVIDIA Inference Modules). This approach enables enterprise developers and researchers to test and deploy Qwen3-Next models via both hosted endpoints and self-hosted containers, and to experiment with practical agent-building workflows through hands-on notebooks. A core motivation behind the hybrid MoE design is to push the boundaries of efficiency and reasoning. The combination of a sparse MoE routing strategy with optimized attention variants and new, memory-efficient primitives demonstrates how large-scale models can be accessed and evaluated by a broader community without sacrificing performance.
What’s new
The Qwen3-Next 80B-A3B-Thinking and 80B-A3B-Instruct previews introduce several architectural and deployment innovations:
- Model scale and sparsity: Each model has 80 billion parameters, but only 3 billion are activated per token due to the sparse MoE structure. This enables the “scale without the full cost” dynamic—massive capacity with targeted activation.
- MoE routing and capacity: The MoE module comprises 512 routed experts and 1 shared expert, with 10 experts activated per token. This routing enables dynamic utilization of model submodules depending on the input, improving efficiency for diverse tasks.
- Long-context capability: The architecture is optimized for long context lengths, supporting input sequences well over 260,000 tokens, aided by memory- and compute-efficient primitives.
- Attention design: The model uses 48 layers, with every 4th layer implementing GQA attention while the remaining layers deploy a new linear attention variant. This hybrid attention approach is designed to balance expressivity and efficiency for long sequences.
- Gated Delta Networks: NVIDIA and MIT contributions around Gated Delta Networks enhance focus in long-sequence processing, enabling more reliable retention of relevant information over very long text passages and scaling almost linearly with sequence length.
- Hardware and interconnects: The models are designed to run on NVIDIA Hopper and Blackwell GPUs, leveraging 5th-generation NVLink that delivers about 1.8 TB/s of direct GPU-to-GPU bandwidth. This high-speed fabric minimizes latency in the expert routing stage and supports higher token throughput for AI inference workloads.
- Software and deployment: The hybrid MoE approach is supported by CUDA-based experimentation pathways, allowing both full attention layers (as in traditional Transformers) and linear attention layers to coexist in the Qwen3-Next stack. NVIDIA also partnered with SGLang and vLLM to facilitate deployment and packaging as NIM microservices, with options for open testing via NVIDIA-hosted endpoints and secure self-hosted deployments.
- Open access and tooling: Developers can access Qwen3-Next-80B-A3B-Thinking through build.nvidia.com for immediate testing in the UI or via the NVIDIA NIM API. Prepackaged NIM microservices extend deployment options, and a Jupyter notebook guide demonstrates practical agent-building workflows powered by Qwen3-Next NIM endpoints.
Why it matters (impact for developers/enterprises)
The hybrid MoE design in Qwen3-Next aims to deliver practical gains in both reasoning capability and throughput. By activating only a small fraction of the model per token, these architectures aim to combine the best of both worlds: the deep representational power of a large model with the efficiency typically associated with smaller models. For developers and enterprises, the open-source nature of Qwen3-Next, together with NVIDIA’s deployment tooling, lowers the barrier to experimentation and production integration. NIM microservices provide ready-to-use endpoints, while self-hosted deployment options offer secure environments for in-house AI workflows. The collaboration with open-source frameworks like SGLang and vLLM further lowers integration friction for serving and experimentation. On the hardware side, the use of Hopper and Blackwell GPUs with high-bandwidth NVLink reduces latency in routing decisions across dozens to hundreds of experts, improving usable throughput for real-time or near-real-time AI tasks. In practice, this can translate into faster responses for long-context reasoning tasks, increased token throughput, and more scalable AI agents in production environments. The broader NVIDIA ecosystem emphasizes openness and collaboration, helping researchers, developers, and enterprises alike to contribute to and benefit from state-of-the-art AI infrastructures. By providing reference deployments, Jupyter notebooks, and containers, NVIDIA hopes to accelerate innovation and broaden access to advanced model architectures.
Technical details or Implementation
Key specifications at a glance
| Attribute | Value
| --- |
|---|
| Model size |
| Active parameters per token |
| MoE routing |
| Experts activated per token |
| Number of layers |
| Attention scheme |
| Context length |
| Interconnect bandwidth |
| Hardware targets |
| Deployment options |
| Availability |
Implementation notes
The Qwen3-Next hybrid MoE approach relies on dynamically routing computation across hundreds of experts. With 10 experts active per token, the model can distribute processing, enabling both scalable throughput and improved reasoning capabilities on very long input sequences. The combination of GQA and linear attention variants allows the network to adapt its attention pattern for different layers, balancing the need for long-range dependency modeling with practical compute constraints. Gated Delta Networks are cited as enabling better focus during long-context processing, reducing drift and maintaining salient information across extremely long passages. This, in turn, helps memory and computation scale almost linearly with sequence length, addressing one of the central challenges of ultra-long-context modeling. From a deployment perspective, the integration with CUDA and NVIDIA platform tooling, plus collaboration with SGLang and vLLM, creates practical paths from research to production. Enterprises can explore Qwen3-Next via hosted endpoints in the NVIDIA API catalog or opt for secure self-hosted microservices, allowing for experimentation with AI agents and long-context tasks in real-world settings.
Key takeaways
- Qwen3-Next introduces a scalable, sparse MoE approach with 80B parameters but only 3B active per token, paired with 10-active-expert routing per token.
- Long-context capability (>260K tokens) is paired with memory-safe, linear-scaling computation through Gated Delta Networks and a mix of attention types.
- High-speed interconnects (NVLink 1.8 TB/s) and GPU platforms (Hopper/Blackwell) are leveraged to minimize routing latency and maximize throughput.
- Open access through NVIDIA NIM, SGLang, and vLLM enables researchers and enterprises to deploy, test, and extend the technology; builds and tutorials are available via build.nvidia.com and associated resources.
- The work exemplifies NVIDIA’s broader commitment to open source, aiming to extend accessibility, transparency, and collaboration in AI model development.
FAQ
-
What is the Qwen3-Next 80B-A3B-Thinking model?
It is an open-source 80B-parameter model designed to run with a hybrid Mixture of Experts (MoE) architecture, enabling a sparse activation pattern (about 3B active per token) and long-context processing.
-
How does the MoE architecture work in these models?
The MoE module features 512 routed experts plus 1 shared expert, with 10 experts activated per token. This routing distributes computation across experts to balance capacity and efficiency.
-
What enables long-context processing and efficiency in Qwen3-Next?
The models are designed for context lengths over 260K tokens, use Gated Delta Networks to maintain focus over long sequences, and apply a mix of GQA and linear attention across 48 layers to optimize performance.
-
How can developers access and deploy these models?
Developers can test the models on build.nvidia.com, use NVIDIA NIM microservices for deployment, and explore self-hosted options with support from SGLang and vLLM tooling.
-
What is the broader significance of this work?
It demonstrates a path to combining large-scale model capability with practical efficiency, fosters open-source collaboration, and provides concrete deployment pathways for researchers and enterprises on NVIDIA hardware. ---
References
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity
NVIDIA HGX B200 lowers embodied carbon intensity by 24% vs. HGX H100, while delivering higher AI performance and energy efficiency. This article reviews the PCF-backed improvements, new hardware features, and implications for developers and enterprises.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.