Skip to content
Scaling AI Inference with NVIDIA NVLink Fusion: Scale-Up Fabric for Custom CPUs and XPUs
Source: developer.nvidia.com

Scaling AI Inference with NVIDIA NVLink Fusion: Scale-Up Fabric for Custom CPUs and XPUs

Sources: https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion

TL;DR

  • The growth of AI model complexity has driven parameter counts from millions to trillions, necessitating GPU clusters and large-scale parallelization for efficient inference.
  • NVIDIA NVLink Fusion provides access to production-proven NVLink scale-up technologies for hyperscalers, enabling custom CPUs and XPUs to join the NVLink scale-up fabric via open standards and modular rack solutions.
  • The fifth-generation NVLink (2024) supports 72 GPUs with all-to-all communication at 1,800 GB/s, delivering about 130 TB/s of aggregate bandwidth—800x the first NVLink generation.
  • NCCL, NVIDIA’s open-source collectives library, remains central to high-performance GPU-to-GPU communication across scale-up and scale-out topologies and is integrated into major frameworks.
  • The NVLink Fusion ecosystem includes a robust rack-scale Open Compute Project (OCP) MGX solution, bridges to UCIe for XPUs, and CUDA-X software to accelerate AI workloads on customized hardware stacks.

Context and background

The rapid rise in AI model complexity has driven parameter counts from millions to trillions, demanding unprecedented computational resources. Inference workloads now rely on large-scale parallelization strategies, including tensor, pipeline, and expert (Mixture-of-Experts) parallelism, to deliver the performance required by contemporary models. This evolution pushes AI systems toward memory-semantic scale-up compute fabrics, enabling a unified pool of compute and memory across a growing domain of GPUs. NVIDIA introduced NVLink in 2016 to overcome PCIe limitations in high-performance computing and AI workloads, enabling faster GPU-to-GPU communication and a unified memory space. In 2018, the NVLink Switch technology achieved 300 GB/s all-to-all bandwidth in an 8-GPU topology, setting the stage for scale-up fabrics in multi-GPU systems. The third-generation NVLink Switch brought SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) improvements, optimizing bandwidth reductions and reducing collective operation latency. By 2024, the fifth-generation NVLink delivered 72-GPU all-to-all communication at 1,800 GB/s, yielding about 130 TB/s of aggregate bandwidth—an 800-fold improvement over the first generation. NVIDIA continues to advance these capabilities on an annual cadence to meet growing AI compute needs. A critical enabling factor is the NVIDIA Collective Communication Library (NCCL), an open-source library designed to accelerate GPU-to-GPU communication in single-node and multi-node topologies. NCCL supports both scale-up and scale-out configurations, includes topology awareness and optimizations, and is integrated into major deep learning frameworks. Together, NVLink hardware and NCCL software provide the foundation for high-performance inference across diverse AI workloads.

What’s new

NVLink Fusion expands access to NVIDIA’s production-proven scale-up fabric technologies by giving hyperscalers modular, open, and customizable paths to integration. It enables custom silicon (CPUs and XPUs) to be wired into the NVLink scale-up fabric and rack-scale architectures, enabling semi-custom AI infrastructures that can be deployed at scale. Key features include:

  • A modular, Open Compute Project (OCP) MGX rack solution that can integrate with NICs, DPUs, or scale-out switches, providing broad compatibility and rapid deployment.
  • Interfaces for XPUs via Universal Chiplet Interconnect Express (UCIe) IP, with an NVIDIA bridge chiplet to NVLink for high performance and simplified integration. UCIe’s openness gives adopters flexibility in choosing XPU options.
  • For custom CPUs, NVLink-C2C IP enables efficient connectivity to NVIDIA GPUs, allowing customers to leverage CUDA-X libraries and the broader CUDA ecosystem.
  • A robust silicon ecosystem with partners for custom silicon, CPUs, and IP technologies, supporting rapid design-in and continuous advancement.
  • Production-ready rack-scale systems, including the NVIDIA GB200 NVL72 and GB300 NVL72 platforms, that demonstrate the maturity of NVLink Fusion in real deployments. NVLink Fusion thus provides hyperscalers with access to the mature NVLink scale-up family while maintaining the flexibility to tailor systems around CPUs, XPUs, or mixed configurations that meet specific inference workloads.

Why it matters (impact for developers/enterprises)

AI inference today occurs at scale not only through more GPUs but through smarter, higher-bandwidth interconnects and software ecosystems. The NVLink scale-up fabric, complemented by NCCL and CUDA-X, helps maximize throughput per watt and minimize latency across large GPU pools. By enabling custom CPUs and XPUs to participate in the NVLink fabric, enterprises can tailor compute and memory resources to the needs of modern models, including mixtures-of-experts and test-time scaling strategies, while preserving software compatibility with established CUDA-based workflows. The 72-GPU rack configuration with all-to-all communication and the 130 TB/s aggregate bandwidth—coupled with high-density rack-scale architectures and advanced cooling—supports a broad range of inference scenarios. For developers, this means more efficient model serving, better latency profiles, and the ability to scale inference workloads with fewer bottlenecks at the interconnect level. For operators and data-center planners, NVLink Fusion provides a path to rack-scale, scale-up fabrics that align with production-grade Open Compute standards and established supply chains, potentially reducing bring-up time and accelerating time-to-market for bespoke AI inference stacks.

Technical details or Implementation

Hardware and interconnects

NVLink Fusion exposes core scale-up technologies including NVLink SERDES, NVLink chiplets, NVLink Switches, and all aspects of the rack-scale architecture (spine, copper cabling, power, and advanced cooling) as part of the solution. This hardware stack is designed to operate as a unified compute and memory domain, enabling high-bandwidth, low-latency GPU communication across large device counts.

Interface options for CPUs and XPUs

For custom XPU configurations, NVLink Fusion uses UCIe IP integration to bridge XPUs to NVLink. NVIDIA provides a bridge chiplet for UCIe to NVLink to preserve performance parity and ease of integration, while maintaining access to the NVLink capabilities that underpin performance gains in AI workloads. UCIe’s open standard provides adopters flexibility in selecting XPU options across current and future platforms. For custom CPU configurations, NVLink-C2C IP connects NVIDIA GPUs to non-NVIDIA CPUs, enabling high-performance data movement within mixed CPU/GPU environments and enabling access to CUDA-X libraries as part of the CUDA platform.

Software and libraries

NVIDIA’s NCCL remains the core library for GPU communication, delivering near-theoretical bandwidth for GPU-to-GPU transfers and supporting both scale-up and scale-out topologies. NCCL has been integrated into every major deep learning framework, benefiting from a decade of development and production deployment. The software stack underpins the performance of NVLink-based configurations and is a critical component of the AI inference performance story.

Rack-scale architecture and ecosystem

NVLink Fusion is designed as an MGX rack solution that can interface with NICs, DPUs, or scale-out switches, providing the ecosystem and supply-chain readiness needed for production deployment. The rack-scale architecture, including spine components and high-density interconnects, is built to support the large-scale inference workloads typical of LLMs and other contemporary AI models.

Production deployments and roadmap

NVIDIA has been deploying NVLink scale-up technologies for nearly a decade, with continued advancement in five generations of NVLink. The 2024 fifth generation achieves 1,800 GB/s all-to-all across 72 GPUs and up to 130 TB/s of aggregate bandwidth, marking a substantial leap over early generations. The combination of continued hardware improvements and NCCL software optimizations aims to keep pace with the exponential growth in AI model complexity.

Key takeaways

  • NVLink Fusion extends NVLink scale-up capabilities to hyperscalers, enabling semi-custom CPU and XPU integration into a unified NVLink fabric.
  • The solution leverages modular MGX rack architectures and UCIe IP to connect XPUs, with NVLink-C2C enabling CPU-GPU connectivity for CUDA-X software ecosystems.
  • The 72-GPU NVLink topology delivers 1,800 GB/s all-to-all and 130 TB/s aggregate bandwidth, providing significant gains for AI inference workloads.
  • NCCL remains a central software pillar for fast GPU communication across scale-up and scale-out deployments, integrated across major frameworks.
  • The NVLink Fusion ecosystem includes production-grade rack systems (GB200 NVL72 and GB300 NVL72) and a broad partner network to accelerate time-to-market.

FAQ

  • What is NVLink Fusion?

    NVLink Fusion is NVIDIA’s approach to giving hyperscalers access to production-proven NVLink scale-up technologies, enabling custom CPUs and XPUs to interface with the NVLink fabric through modular MGX rack solutions and open interfaces like UCIe.

  • How does XPUs integration work under NVLink Fusion?

    XPUs connect to the NVLink scale-up fabric via UCIe IP with a bridge chiplet provided by NVIDIA, allowing high-performance, memory-semantic interconnects while preserving access to CUDA-X libraries.

  • What role does NCCL play in these systems?

    NCCL accelerates GPU-to-GPU communication in both scale-up and scale-out configurations, supports automatic topology awareness and optimizations, and is integrated into major deep learning frameworks.

  • Why does this matter for AI inference workloads?

    The combination of high-bandwidth interconnects, flexible CPU/XPU integration, and mature software stacks enables larger, faster, and more efficient inference across complex AI models, including Mixture-of-Experts and test-time scaling scenarios.

  • Are there production-ready rack systems available?

    Yes, NVIDIA references production-volume rack systems such as GB200 NVL72 and GB300 NVL72 as part of the NVLink Fusion ecosystem.

References

More news