Skip to content
Connecting Distributed Data Centers into Large AI Factories with Scale-Across Networking
Source: developer.nvidia.com

Connecting Distributed Data Centers into Large AI Factories with Scale-Across Networking

Sources: https://developer.nvidia.com/blog/how-to-connect-distributed-data-centers-into-large-ai-factories-with-scale-across-networking, https://developer.nvidia.com/blog/how-to-connect-distributed-data-centers-into-large-ai-factories-with-scale-across-networking/, NVIDIA Dev Blog

TL;DR

  • Spectrum-XGS Ethernet enables scale-across networking to connect distributed data centers into a single AI factory over long distances (beyond 500 meters).
  • It uses NVIDIA Spectrum-X Ethernet platform hardware (Spectrum-X switches and ConnectX-8 SuperNICs) with telemetry-based congestion control and distance-aware adaptive routing to minimize latency.
  • In NCCL tests at 10 km, Spectrum-XGS delivers up to 1.9x higher all-reduce bandwidth than off-the-shelf Ethernet, especially for large messages.
  • The technology unifies data centers regardless of proximity, enhancing fungibility of AI infrastructure and enabling large-scale single-job training and disaggregated inference.
  • It addresses latency and jitter issues associated with deep-buffer long-haul Ethernet, providing predictable performance for synchronous AI workloads. NVIDIA

Context and background

AI scaling is incredibly complex, and new techniques in training and inference are continually demanding more out of the data center. While data center capabilities are scaling quickly, data center infrastructure is subject to fundamental physical limitations that have no impact on algorithms and models. Power availability, cooling capacity, and space constraints place limits on the physical footprint of an AI factory. To continue growing, new data centers are built, and connectivity over distance becomes a factor in pooling these resources together to function in tandem on a single training or disaggregated inference workload. NVIDIA Traditionally, when connecting data centers together with long-haul Ethernet built from “off-the-shelf” merchant silicon, the principal objective was to ensure that data successfully made it to its destination. Because distances can be long and latencies high, the possibility for congestion is also high, and the impact can be extreme. To mitigate this challenge and prevent packets from being dropped, off-the-shelf Ethernet vendors create solutions where deep packet buffers, capable of absorbing large bursts of network traffic, are employed. While these deep buffer switches are a solution for long-haul service providers and telecoms, they introduce problems for AI. In particular, switches with deep buffers inherently suffer from higher latencies. In addition, when the buffer starts becoming full, it must “drain.” With respect to AI workloads, this occurrence is unpredictable, causing a large amount of jitter, or variance in data delivery. High latency and unpredictability from this shock-absorber technique becomes problematic for training and disaggregated inference performance, which are synchronous in nature and require predictable performance from the network. This post explains how NVIDIA Spectrum-XGS Ethernet for scale-across networking enables inter-data center connectivity with the high performance needed for AI. NVIDIA Scale-across networking is a new category of AI compute fabric connectivity that can be thought of as a new dimension, orthogonal to the existing connectivity options of scale-up and scale-out. With Spectrum-XGS Ethernet for scale-across networking, multiple data centers of varying sizes and distances can be unified as one large AI factory. For the first time, the network can deliver the performance needed for large-scale single job AI training and inference across geographically separated data centers. NVIDIA Spectrum-XGS Ethernet is a new technology addition to the NVIDIA Spectrum-X Ethernet platform. It is based on the same hardware combination of Spectrum-X Ethernet switches and ConnectX-8 SuperNICs, and leverages the same stack of software and libraries used for scale-out connectivity within the data center. With Spectrum-XGS Ethernet, the connectivity is between AI factories over long distances; that is, over 500 meters. This could mean connectivity between buildings in a campus, or over tens or hundreds of miles, across cities or even states and countries. To make scale-across connectivity feasible, the algorithms responsible for ensuring high effective bandwidth and performance isolation had to evolve. NVIDIA One of the challenges with moving data across long distances is the implication of increased latency—even for data traversing an optical fiber in the form of light. Data propagates across the glass strands at a rate of 5 nanoseconds per meter. This means that traveling 1 kilometer takes 5 microseconds. These numbers may seem small in absolute terms, but for GPU-to-GPU communication, every microsecond counts. Spectrum-XGS Ethernet features modified telemetry-based congestion control and adaptive routing algorithms that are optimized around the distance between communicating devices. Whenever a connection is initiated, the network notes whether the two devices are together inside the data center, or not. This helps the switch know the best approach to load balance for adaptive routing, and informs the SuperNIC to handle injection rate for congestion control. At the network level, this enables Spectrum-XGS Ethernet to holistically handle communications without incurring additional latency. NVIDIA Some of the key benefits of Spectrum-XGS Ethernet technology to scale-across networking include:

  • To show the impact of NVIDIA Spectrum-XGS Ethernet on scale-across performance, NVIDIA engineers ran NCCL primitives across multiple sites at a distance of 10 km and compared the results to off-the-shelf Ethernet. The results were significant: NVIDIA Spectrum-XGS Ethernet delivers up to 1.9x higher NCCL all-reduce bandwidth over off-the-shelf Ethernet. The greatest speedup occurs with the larger message sizes, which are the most common with AI training workloads. These improvements to NCCL performance translate into faster job completion times for AI applications. NVIDIA
  • Spectrum-XGS Ethernet enhances the fungibility of AI infrastructure. By introducing a technology that enables data centers to communicate over any distance without performance degradation, Spectrum-XGS Ethernet creates common architecture shared between scale-out and scale-across networking. Ethernet data centers built on Spectrum-XGS Ethernet can readily be combined together to act as one, regardless of proximity. This enables mission-critical AI infrastructure to pool resources and consistently deliver value for advanced AI workloads. NVIDIA To learn more about the technical innovations underpinning NVIDIA Spectrum-X Ethernet, see NVIDIA Spectrum-X Network Platform Architecture. NVIDIA

More news