NVLink Fusion: Scale-up AI Inference with NVLink for Custom CPUs/XPUs

Overview

The rapid growth of AI model complexity—from millions to trillions of parameters—drives unprecedented compute needs that typically require GPU clusters. Inference workloads increasingly leverage mix-of-experts (MoE) architectures and test-time scaling, amplifying compute and memory demands. To meet this demand, the industry has moved toward large-scale parallelism and memory-semantic fabrics that let many GPUs operate as a unified compute-memory pool. NVIDIA NVLink Fusion extends the proven NVLink scale-up fabric into programmable, rack-scale deployments, giving hyperscalers and enterprises a path to large-domain AI inference with hardware and software co-design. NVLink originated in 2016 to overcome PCIe limits and enable faster GPU-to-GPU communication and unified memory. By 2018, NVLink Switch delivered 300 GB/s all-to-all bandwidth in an 8-GPU topology, enabling scale-up compute fabrics. Third-generation SHARP further reduced bandwidth reductions and collective latency, while the fifth-generation NVLink (2024) supports up to 72 GPUs with 1,800 GB/s all-to-all bandwidth and about 130 TB/s aggregate bandwidth—roughly 800× the first generation. NVIDIA continues to push new generations annually to match AI model growth. Performance with NVLink hinges on hardware and NCCL—the NVIDIA Collective Communication Library—which accelerates GPU communication, is open-source, and integrates with major deep-learning frameworks via CUDA-X libraries. NVLink Fusion broadens access to this scale-up fabric by enabling custom silicon paths (CPUs and XPUs) to integrate with the NVLink scale-up fabric and rack-scale architecture for semi-custom AI infrastructure deployments. It supports open standards and a modular, Open Compute Project (OCP) MGX rack approach, allowing integration with NICs, DPUs, or scale-out switches and enabling custom CPU or XPU configurations via UCIe IP or NVLink-C2C IP. The result is a flexible, production-ready ecosystem designed to scale AI inference across large-domain deployments while preserving memory coherence and high-bandwidth communication. For the rack offering, NVIDIA points to production-grade systems (e.g., GB200 NVL72 and GB300 NVL72) and an ecosystem designed to reduce time-to-market. The NVLink Fusion platform relies on a robust silicon ecosystem, partners for custom silicon, CPUs, and IP, and a data-center-ready rack solution that includes high-density spine networks, copper cabling, advanced cooling, and supply-chain readiness. In short, NVLink Fusion packages the core NVLink scale-up technology with a broad ecosystem to enable customized, large-scale AI inference. Reference: NVIDIA’s overview of the NVLink and NVLink Fusion capabilities in scaling AI inference: https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/.

Key features

NVLink Fusion extends scale-up NVLink capabilities to custom CPU and XPU paths via UCIe IP and NVLink chiplets, bridging conventional CPUs/GPUs with NVLink scale-up fabrics.
Open Compute Project (OCP) MGX rack compatibility provides a modular, production-grade rack solution that can integrate with NICs, DPUs, or scale-out switches.
UCIe-based integration for custom XPUs and NVLink-C2C IP for CPU connectivities ensures high-performance, coherent memory access across heterogeneous compute elements.
Hardware foundation includes NVLink SERDES, NVLink chiplets, and NVLink Switches, along with a rack-scale spine and copper cabling, advanced power, and liquid cooling for high-density deployments.
The 72-GPU scale-up topology delivers up to 1,800 GB/s all-to-all communication and about 130 TB/s aggregate bandwidth, delivering substantial gains over earlier generations.
NCCL (NVIDIA Collective Communication Library) remains central to achieving near-theoretical GPU-to-GPU bandwidth, with automatic topology awareness and optimization integrated into CUDA-X libraries.
The platform supports a unified compute-memory domain enabling tensor, pipeline, and expert parallelism across large GPU domains, aligning with the needs of MoE and test-time scaling.
A broad silicon ecosystem with partners for custom silicon, CPUs, and IP provides design-in flexibility and faster time-to-market.
The approach emphasizes production-grade, rack-scale, scale-up fabric integration with a focus on inference workloads across AI reasoning and enterprise deployments.

Common use cases

Large-scale AI inference for models with massive parameter counts, including MoE-style architectures and test-time scaling scenarios.
Hyperscaler deployments that require multi-hundred- or multi-thousand-GPU scale-up setups where memory coherence and high-bandwidth interconnects are critical.
LLM and other transformer-based inference workloads where throughput per watt and latency trade-offs are balanced by NVLink-scale fabrics.
Custom AI pipelines that need tightly coupled CPU/XPU configurations to achieve low-latency inference across large model families.
Scenarios where a unified pool of compute and memory is preferred to simplify orchestration across thousands of compute elements.

Setup & installation

Setup and installation details are not specified in the provided source. The article describes the architectural approach and ecosystem rather than a step-by-step deployment guide. See the cited NVIDIA blog for broader context and model-driven guidance: https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/.

# Setup and installation details are not provided in the source.
# Placeholder: this section intentionally left with no runnable commands.

Quick start

The source frames the capabilities and architectural patterns rather than a runnable quick-start guide. A minimal practical path would involve engaging with NVIDIA’s NVLink Fusion ecosystem to align a rack-scale solution with a given CPU/XPU strategy, but no runnable steps or sample code are provided in the article. For developers, the immediate takeaway is to consider how a semi-custom CPU/XPU integration, connected via NVLink Fusion, could fit into a scalable inference fabric. See the original article for context on capabilities, performance targets, and ecosystem components: https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/.

Pros and cons

Pros
Industry-scale interconnects with very high all-to-all bandwidth (up to 1,800 GB/s) and aggregate throughput (roughly 130 TB/s).
Ability to integrate custom CPUs and XPUs into the NVLink scale-up fabric, enabling semi-custom AI infrastructure deployments.
Open standards and MGX rack alignment support rapid design-in with a broad ecosystem of silicon partners and IP providers.
Strong software acceleration via NCCL, with automatic topology awareness and framework integration (CUDA-X libraries).
Unified memory space and high-bandwidth communication across large GPU domains to support tensor, pipeline, and expert parallelism.
Cons
The source does not enumerate downsides or trade-offs; formal pros/cons require evaluation in a given deployment context.
Deployment involves a specialized rack-scale solution and ecosystem, which may imply higher upfront design effort and integration work.
Not all workloads may require such scale-up fabric, so the value proposition depends on model size, parallelism strategy, and latency/throughput targets.

Alternatives (brief comparisons)

| Alternative interconnect path | How it differs from NVLink Fusion | Notes from the source |---|---|---| | PCIe-based interconnects | NVLink originally arose to overcome PCIe limitations, delivering higher bandwidth and unified memory space | PCIe was the baseline technology in earlier systems; NVLink provides higher bandwidth and memory coherence for GPU interconnects |NVLink scale-up without Fusion | Traditional NVLink scale-up fabrics across GPUs and NVLink Switches | Fusion expands access to scale-up technologies via modular rack integration and CPU/XPU interfaces |CPU/GPU direct integration with NVLink-C2C | connectivity between NVIDIA GPUs and custom CPUs via NVLink-C2C IP | Useful for optimized CPU-to-GPU paths in semi-custom configurations |

Pricing or License

Not specified in the source. The article discusses technology capabilities, ecosystem, and rack-scale deployment concepts rather than licensing terms or pricing.