Skip to content
Schedule topology-aware workloads using Amazon SageMaker HyperPod task governance
Source: aws.amazon.com

Schedule topology-aware workloads using Amazon SageMaker HyperPod task governance

Sources: https://aws.amazon.com/blogs/machine-learning/schedule-topology-aware-workloads-using-amazon-sagemaker-hyperpod-task-governance, https://aws.amazon.com/blogs/machine-learning/schedule-topology-aware-workloads-using-amazon-sagemaker-hyperpod-task-governance/, AWS ML Blog

TL;DR

  • AWS announced topology-aware scheduling as a new capability of Amazon SageMaker HyperPod task governance to optimize training efficiency and network latency on Amazon EKS clusters.
  • The approach leverages EC2 topology information to place interdependent training tasks within the same network nodes and layers, reducing network hops and latency.
  • There are two primary submission methods for topology-aware workloads: annotating Kubernetes manifests with a topology annotation and using the SageMaker HyperPod CLI with topology-related flags.
  • The workflow includes confirming topology information, identifying nodes sharing network layers, and submitting topology-aware training tasks to achieve greater visibility and control over placement.
  • The post provides practical steps, an example topology overview, and invites feedback from users.

Context and background

Generative AI workloads often require extensive cross-node communication across EC2 instances. In these environments, network latency is influenced by how instances are physically and logically placed within a data center’s hierarchical topology. AWS describes data centers as comprising nested organizational units such as network nodes and node sets, with multiple instances per node and multiple nodes per set. Instances that share closer network nodes tend to have faster processing times due to fewer network hops. To optimize the placement of AI workloads in SageMaker HyperPod clusters, practitioners can incorporate EC2 topology information when submitting jobs. An EC2 instance’s topology is described by a set of nodes, with one node in each layer of the network. Data centers organize topology into layers, and the availability of shared layers informs proximity and communication efficiency. Network topology labels enable topology-aware scheduling that prioritizes workloads with optimized network communication, thereby improving task efficiency and resource utilization. In this context, SageMaker HyperPod task governance extends governance capabilities to accelerated compute allocation and priority policies across teams and projects on EKS clusters. This governance helps administrators align resource allocation with organizational priorities, enabling faster AI innovation and reduced time-to-market by minimizing the need to coordinate resource provisioning and replanning tasks. For readers seeking deeper guidance, AWS points to Best practices for Amazon SageMaker HyperPod task governance for more information.

What’s new

This post introduces topology-aware scheduling as a capability within SageMaker HyperPod task governance. The goal is to optimize training efficiency and network throughput/latency by considering the physical and logical arrangement of resources in the cluster. Key points:

  • Topology-aware scheduling uses EC2 topology information to inform job placement, with instances in the same network node layers sharing faster interconnects.
  • Administrators can govern accelerated compute allocation and enforce task priority policies to improve resource utilization.
  • Data scientists interact with SageMaker HyperPod clusters to ensure capacity and permission alignment when working with GPU-backed resources.
  • The approach supports two primary submission methods and a third alternative workflow, providing flexibility for different teams and workflows.

Two primary submission methods (plus an alternative workflow)

  1. Kubernetes manifest annotation: modify your existing Kubernetes manifest to include a topology annotation such as kueue.x-k8s.io/podset-required-topology to schedule pods that share the same layer 3 network node. To verify pod placement, you can inspect node allocations with:
  • kubectl get pods -n hyperpod-ns-team-a -o wide
  1. SageMaker HyperPod CLI: submit topology-aware jobs via the HyperPod CLI using either —preferred-topology or —required-topology in your create job command. An example is provided to illustrate starting a topology-aware MNIST training job with a placeholder for the AWS account ID (XXXXXXXXXXXX). The post also notes practical considerations for environments where new resources were deployed and points readers to the Clean Up section in the SageMaker HyperPod EKS workshop to avoid unintended charges. It also emphasizes that LLM training involves significant pod-to-pod communication and that topology-aware scheduling improves throughput and latency for these workloads.

How to get started

To begin, you should:

  • Confirm topology information for all nodes in your cluster.
  • Run a script to identify which instances are on the same network nodes across layers 1–3.
  • Schedule topology-aware training tasks on your cluster using either submission method. This workflow aims to provide greater visibility and control over the placement of training instances, which can translate into more predictable performance for distributed AI workloads. The article notes that network topology labels help in achieving these benefits and points readers to inline examples and visualizations (via Mermaid.js.org) to aid topology understanding.

Why it matters (impact for developers/enterprises)

Topology-aware scheduling addresses a core performance factor for distributed AI training: inter-node communication. For generative AI and other large-scale training tasks, reducing network hops between GPUs across nodes can meaningfully decrease training time and inference latency. By leveraging EC2 topology data within SageMaker HyperPod task governance, organizations can:

  • Improve resource utilization by aligning compute placement with network proximity.
  • Simplify governance of accelerated compute allocations across teams and projects.
  • Accelerate time-to-market for AI innovations by reducing the coordination overhead associated with resource planning and task replanning.
  • Provide data scientists with clearer visibility into where their training jobs will run, enabling more informed experimentation and optimization. These capabilities are particularly valuable for teams deploying large distributed models that require frequent cross-node synchronization and data exchange. The governance model aims to strike a balance between performance (through topology-aware placement) and administrative control (through task priority policies and resource governance).

Technical details or Implementation

Prerequisites and setup:

  • Start by showing node topology labels in your cluster. The topology label you’ll commonly encounter is topology.k8s.aws/network-node-layer-3, with sample outputs such as topology.k8s.aws/network-node-layer-3: nn-33333example. This reveals how instances are organized across network layers.
  • Use a script to identify which nodes share layers 1, 2, and 3 in your cluster. The output can be used to construct a flow diagram (for example, in Mermaid.js.org) to visualize cluster topology.
  • AWS provides two practical submission paths for topology-aware workloads:
  • Kubernetes manifest annotation: add the kueue.x-k8s.io/podset-required-topology annotation to your Pod or Job manifests so that pods sharing the same layer 3 network node are scheduled together.
  • SageMaker HyperPod CLI: use the HyperPod CLI with either —preferred-topology or —required-topology when creating a job. This approach enables programmatic, topology-informed scheduling within SageMaker HyperPod governance.
  • Verifying placement: after launching pods, you can verify node assignments with a command such as kubectl get pods -n hyperpod-ns-team-a -o wide, which shows the node IDs for each pod.
  • The article includes an example command to start a topology-aware MNIST training job via the HyperPod CLI, where you replace XXXXXXXXXXXX with your AWS account ID. While the exact command is not reproduced here, the example demonstrates the CLI workflow. Ongoing considerations:
  • If you deployed new resources as part of adopting topology-aware scheduling, consider following the Clean Up guidance in the SageMaker HyperPod EKS workshop to avoid unintended charges.
  • The approach is particularly relevant for large language model (LLM) training, where pod-to-pod data exchange is frequent and sensitive to network topology.
  • AWS highlights that topology-aware scheduling can offer higher visibility and control over training placements, which can translate into improved efficiency and performance for AI workloads.

Practical example and visualization

The workflow includes generating a flow chart that shows how nodes relate across layers 1–3. You can visualize this topology in tools such as Mermaid.js.org to plan pod placement before submitting topology-aware tasks. The example cluster discussed in the article demonstrates how a seven-instance cluster maps to the hierarchical topology and informs decisions about shared network layers.

Key considerations when implementing

  • Decide between the manifest-based approach and the CLI-based approach based on your workflow and automation preferences.
  • Ensure your team has the necessary permissions to interact with the SageMaker HyperPod cluster and to annotate Kubernetes resources.
  • Plan for ongoing monitoring of topology-aware placements and performance metrics to validate gains in throughput and latency.

Key takeaways

  • SageMaker HyperPod task governance now supports topology-aware scheduling to improve AI training efficiency and network latency.
  • You can implement topology-aware scheduling via Kubernetes manifest annotations or the SageMaker HyperPod CLI, offering flexibility for different teams.
  • EC2 topology information is leveraged to place related pods within the same network layers, reducing inter-node communication costs.
  • A structured workflow includes verifying topology, identifying shared network layers, and submitting topology-aware tasks to achieve better resource utilization.
  • The approach is particularly beneficial for distributed training workloads such as LLMs, where pod-to-pod communication dominates runtime.

FAQ

  • What is SageMaker HyperPod task governance topology-aware scheduling?

    It is a capability to optimize training efficiency and network latency by scheduling workloads with awareness of the underlying EC2 topology and network layers within SageMaker HyperPod governance.

  • How can I submit topology-aware tasks?

    You can either annotate your Kubernetes manifests with a topology-aware annotation (kueue.x-k8s.io/podset-required-topology) or use the SageMaker HyperPod CLI with --preferred-topology or --required-topology when creating a job.

  • What prerequisites do I need to start?

    You should confirm topology information for all cluster nodes, run a script to identify shared network layers, and verify pod placements with kubectl commands. You may also review the example MNIST/topology scenario provided in the article.

  • Where can I find more information?

    See the AWS blog post linked in References, and explore the Mermaid visualization reference and the Kubernetes pod topology annotation (kueue.x-k8s.io/podset-required-topology) for implementation details.

References

More news