NVLink Fusion: Scale-up AI Inference with NVLink for Custom CPUs/XPUs
Sources: https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion, https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/, NVIDIA Dev Blog
Overview
The rapid growth of AI model complexity—from millions to trillions of parameters—drives unprecedented compute needs that typically require GPU clusters. Inference workloads increasingly leverage mix-of-experts (MoE) architectures and test-time scaling, amplifying compute and memory demands. To meet this demand, the industry has moved toward large-scale parallelism and memory-semantic fabrics that let many GPUs operate as a unified compute-memory pool. NVIDIA NVLink Fusion extends the proven NVLink scale-up fabric into programmable, rack-scale deployments, giving hyperscalers and enterprises a path to large-domain AI inference with hardware and software co-design. NVLink originated in 2016 to overcome PCIe limits and enable faster GPU-to-GPU communication and unified memory. By 2018, NVLink Switch delivered 300 GB/s all-to-all bandwidth in an 8-GPU topology, enabling scale-up compute fabrics. Third-generation SHARP further reduced bandwidth reductions and collective latency, while the fifth-generation NVLink (2024) supports up to 72 GPUs with 1,800 GB/s all-to-all bandwidth and about 130 TB/s aggregate bandwidth—roughly 800× the first generation. NVIDIA continues to push new generations annually to match AI model growth. Performance with NVLink hinges on hardware and NCCL—the NVIDIA Collective Communication Library—which accelerates GPU communication, is open-source, and integrates with major deep-learning frameworks via CUDA-X libraries. NVLink Fusion broadens access to this scale-up fabric by enabling custom silicon paths (CPUs and XPUs) to integrate with the NVLink scale-up fabric and rack-scale architecture for semi-custom AI infrastructure deployments. It supports open standards and a modular, Open Compute Project (OCP) MGX rack approach, allowing integration with NICs, DPUs, or scale-out switches and enabling custom CPU or XPU configurations via UCIe IP or NVLink-C2C IP. The result is a flexible, production-ready ecosystem designed to scale AI inference across large-domain deployments while preserving memory coherence and high-bandwidth communication. For the rack offering, NVIDIA points to production-grade systems (e.g., GB200 NVL72 and GB300 NVL72) and an ecosystem designed to reduce time-to-market. The NVLink Fusion platform relies on a robust silicon ecosystem, partners for custom silicon, CPUs, and IP, and a data-center-ready rack solution that includes high-density spine networks, copper cabling, advanced cooling, and supply-chain readiness. In short, NVLink Fusion packages the core NVLink scale-up technology with a broad ecosystem to enable customized, large-scale AI inference. Reference: NVIDIA’s overview of the NVLink and NVLink Fusion capabilities in scaling AI inference: https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/.
Key features
- NVLink Fusion extends scale-up NVLink capabilities to custom CPU and XPU paths via UCIe IP and NVLink chiplets, bridging conventional CPUs/GPUs with NVLink scale-up fabrics.
- Open Compute Project (OCP) MGX rack compatibility provides a modular, production-grade rack solution that can integrate with NICs, DPUs, or scale-out switches.
- UCIe-based integration for custom XPUs and NVLink-C2C IP for CPU connectivities ensures high-performance, coherent memory access across heterogeneous compute elements.
- Hardware foundation includes NVLink SERDES, NVLink chiplets, and NVLink Switches, along with a rack-scale spine and copper cabling, advanced power, and liquid cooling for high-density deployments.
- The 72-GPU scale-up topology delivers up to 1,800 GB/s all-to-all communication and about 130 TB/s aggregate bandwidth, delivering substantial gains over earlier generations.
- NCCL (NVIDIA Collective Communication Library) remains central to achieving near-theoretical GPU-to-GPU bandwidth, with automatic topology awareness and optimization integrated into CUDA-X libraries.
- The platform supports a unified compute-memory domain enabling tensor, pipeline, and expert parallelism across large GPU domains, aligning with the needs of MoE and test-time scaling.
- A broad silicon ecosystem with partners for custom silicon, CPUs, and IP provides design-in flexibility and faster time-to-market.
- The approach emphasizes production-grade, rack-scale, scale-up fabric integration with a focus on inference workloads across AI reasoning and enterprise deployments.
Common use cases
- Large-scale AI inference for models with massive parameter counts, including MoE-style architectures and test-time scaling scenarios.
- Hyperscaler deployments that require multi-hundred- or multi-thousand-GPU scale-up setups where memory coherence and high-bandwidth interconnects are critical.
- LLM and other transformer-based inference workloads where throughput per watt and latency trade-offs are balanced by NVLink-scale fabrics.
- Custom AI pipelines that need tightly coupled CPU/XPU configurations to achieve low-latency inference across large model families.
- Scenarios where a unified pool of compute and memory is preferred to simplify orchestration across thousands of compute elements.
Setup & installation
Setup and installation details are not specified in the provided source. The article describes the architectural approach and ecosystem rather than a step-by-step deployment guide. See the cited NVIDIA blog for broader context and model-driven guidance: https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/.
# Setup and installation details are not provided in the source.
# Placeholder: this section intentionally left with no runnable commands.
Quick start
The source frames the capabilities and architectural patterns rather than a runnable quick-start guide. A minimal practical path would involve engaging with NVIDIA’s NVLink Fusion ecosystem to align a rack-scale solution with a given CPU/XPU strategy, but no runnable steps or sample code are provided in the article. For developers, the immediate takeaway is to consider how a semi-custom CPU/XPU integration, connected via NVLink Fusion, could fit into a scalable inference fabric. See the original article for context on capabilities, performance targets, and ecosystem components: https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/.
Pros and cons
- Pros
- Industry-scale interconnects with very high all-to-all bandwidth (up to 1,800 GB/s) and aggregate throughput (roughly 130 TB/s).
- Ability to integrate custom CPUs and XPUs into the NVLink scale-up fabric, enabling semi-custom AI infrastructure deployments.
- Open standards and MGX rack alignment support rapid design-in with a broad ecosystem of silicon partners and IP providers.
- Strong software acceleration via NCCL, with automatic topology awareness and framework integration (CUDA-X libraries).
- Unified memory space and high-bandwidth communication across large GPU domains to support tensor, pipeline, and expert parallelism.
- Cons
- The source does not enumerate downsides or trade-offs; formal pros/cons require evaluation in a given deployment context.
- Deployment involves a specialized rack-scale solution and ecosystem, which may imply higher upfront design effort and integration work.
- Not all workloads may require such scale-up fabric, so the value proposition depends on model size, parallelism strategy, and latency/throughput targets.
Alternatives (brief comparisons)
| Alternative interconnect path | How it differs from NVLink Fusion | Notes from the source |---|---|---| | PCIe-based interconnects | NVLink originally arose to overcome PCIe limitations, delivering higher bandwidth and unified memory space | PCIe was the baseline technology in earlier systems; NVLink provides higher bandwidth and memory coherence for GPU interconnects |NVLink scale-up without Fusion | Traditional NVLink scale-up fabrics across GPUs and NVLink Switches | Fusion expands access to scale-up technologies via modular rack integration and CPU/XPU interfaces |CPU/GPU direct integration with NVLink-C2C | connectivity between NVIDIA GPUs and custom CPUs via NVLink-C2C IP | Useful for optimized CPU-to-GPU paths in semi-custom configurations |
Pricing or License
Not specified in the source. The article discusses technology capabilities, ecosystem, and rack-scale deployment concepts rather than licensing terms or pricing.
References
- NVIDIA blog: Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion. https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/
More resources
CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Unified CUDA toolkit for Arm on Jetson Thor with full memory coherence, multi-process GPU sharing, OpenRM/dmabuf interoperability, NUMA support, and better tooling across embedded and server-class targets.
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Improving GEMM Kernel Auto-Tuning Efficiency with nvMatmulHeuristics in CUTLASS 4.2
Introduces nvMatmulHeuristics to quickly select a small set of high-potential GEMM kernel configurations for CUTLASS 4.2, drastically reducing auto-tuning time while approaching exhaustive-search performance.
Make ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Learn how PyTorch AoT compilation speeds up ZeroGPU Spaces by exporting a compiled model once and reloading instantly, with FP8 quantization, dynamic shapes, and careful integration with the Spaces GPU workflow.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.