Skip to content
Announcing the new cluster creation experience for Amazon SageMaker HyperPod
Source: aws.amazon.com

Announcing the new cluster creation experience for Amazon SageMaker HyperPod

Sources: https://aws.amazon.com/blogs/machine-learning/announcing-the-new-cluster-creation-experience-for-amazon-sagemaker-hyperpod, https://aws.amazon.com/blogs/machine-learning/announcing-the-new-cluster-creation-experience-for-amazon-sagemaker-hyperpod/, AWS ML Blog

TL;DR

  • SageMaker HyperPod now offers a one-click, validated cluster creation experience that provisions prerequisite AWS resources and applies prescriptive defaults automatically.
  • Two deployment options are available on the AWS Management Console for clusters orchestrated by Slurm or Amazon EKS: quick setup and custom setup.
  • The deployment creates a CloudFormation stack to deploy the cluster and supporting resources, enabling IaC and consistent deployments across environments.
  • Quick setup emphasizes automatic defaults, infrastructure provisioning, and automatic instance recovery, while custom setup provides granular control and flexibility for advanced configurations.
  • The solution supports large-scale AI workloads with high-performance networking and storage, including EFA and FSx for Lustre, and offers health checks and continuous provisioning features. For more details, see the official AWS blog post announcing the feature: AWS Blog – Announcing the new cluster creation experience for Amazon SageMaker HyperPod.

Context and background

Amazon SageMaker HyperPod enables distributed training and inference across clusters with hundreds or thousands of AI accelerators, using orchestration via Slurm or Amazon Elastic Kubernetes Service (Amazon EKS). Previously, setting up a HyperPod cluster required configuring multiple prerequisite AWS resources—such as a VPC, an S3 bucket, IAM roles, and other components—in a multi-step process prone to misconfigurations. The new cluster creation experience changes this by enabling one-click cluster creation with prescriptive defaults that are automatically applied, reducing manual touchpoints and the potential for mistakes. HyperPod’s deployment options appear in the AWS Management Console alongside SageMaker AI controls, offering two paths: quick setup and custom setup. Each path ultimately creates a CloudFormation stack to provision the cluster and its supporting resources, empowering declarative infrastructure as code (IaC) that can be reused and versioned across environments. The approach aligns with best practices for repeatable, auditable cloud deployments.

What’s new

The primary enhancements center on a validated, one-click experience for building HyperPod clusters, including the necessary prerequisite resources. The two deployment options are designed to cover common use cases:

  • Quick setup: Uses prescriptive defaults for instance groups, networking, orchestration, lifecycle configuration, permissions, and storage. It also enables automatic instance recovery for unhealthy or unresponsive nodes.
  • Custom setup: Offers granular configuration across the same dimensions as quick setup, with the ability to tailor networking, orchestration, storage, and authorization to fit specialized requirements. Key infrastructure elements created or configured during the process include:
  • A new VPC with subnets spread across Availability Zones, including a public /24 subnet for internet access via NAT, a private /24 subnet for EKS control plane communications, and a /16 private subnet to accommodate large capacity for accelerated instances.
  • A new security group configured for Elastic Fabric Adapter (EFA) and FSx for Lustre traffic.
  • An Amazon EKS cluster with the latest supported Kubernetes version, with operators and plugins enabled (EFA, Neuron, NVIDIA device plugins), health monitoring agent (HMA), Kubeflow training operators, and the SageMaker HyperPod inference operator.
  • A new S3 bucket to store default lifecycle scripts and a new IAM role with permissions required by the SageMaker HyperPod cluster.
  • A new FSx for Lustre file system for high-performance data storage and retrieval. For those who prefer to reuse existing resources, the custom setup allows you to reference an existing VPC, security group, or EKS cluster, and to connect to an existing FSx for Lustre file system. You can also specify a custom CIDR for the VPC and target specific Availability Zones for subnet creation.

Why it matters (impact for developers/enterprises)

By removing manual provisioning steps and providing prescriptive defaults, the new cluster creation experience reduces the risk of misconfigurations during HyperPod setup. This accelerates time-to-value for teams performing large-scale generative AI training, fine-tuning, or inference across clusters with substantial accelerator counts. The solution supports robust, scalable workloads and aligns with IaC practices, enabling engineers to express desired states declaratively via CloudFormation templates and reuse configurations across environments. The ability to export a preconfigured CloudFormation template enables integration with CI/CD pipelines (e.g., CodePipeline) for automated validation and promotion of changes from development to production environments, further improving consistency and governance across deployments.

Technical details or Implementation

The cluster creation experience leverages AWS CloudFormation to provision a HyperPod cluster and its prerequisite resources in a single declarative operation. When users initiate cluster creation, the system deploys a CloudFormation stack that orchestrates the setup of networking, storage, identity, and compute resources required for HyperPod, ensuring a consistent state across environments. This IaC approach accommodates complex compositions including multiple managed services in a single request. Two deployment modes are offered:

  • Quick setup: Applies sensible defaults for instance groups, networking, orchestration, lifecycle scripts, permissions, and storage. It also provides visibility into which configurations can be edited after deployment and which would require recreating AWS resources. Automatic instance recovery is enabled by default to address unhealthy or unresponsive instances.
  • Custom setup: Provides granular control over configurations and allows you to selectively disable automatic node recovery if needed for troubleshooting or testing. It also supports continuous provisioning mode, enabling concurrent initiation of multiple operations such as scaling, AMI updates, and cluster creation—even when all requested instances are not yet available. Networking and capacity planning details include:
  • Quick setup creates a new VPC with subnets across AZs, including a public /24 subnet for NAT, a private /24 subnet for EKS control plane communications, and a /16 private subnet to sustain large-scale accelerator capacity.
  • The default /16 private subnet supports more than 65,000 private IPs, helping accommodate clusters with many hosts that require multiple IPs per node.
  • For EKS orchestration, the quick setup provisions an EKS cluster with the latest Kubernetes version and enables a set of operators and plugins (EFA, Neuron, NVIDIA device plugins), the health monitoring agent, Kubeflow operators, and the SageMaker HyperPod inference operator.
  • Storage provisioning includes a new FSx for Lustre file system alongside a new S3 bucket to hold lifecycle scripts. The custom setup option enables broad flexibility:
  • Create a new VPC with a custom CIDR or reuse an existing VPC and security group.
  • Point to an existing EKS cluster or provision a new one with configurable Kubernetes versions and subnets for robust connectivity between the Kubernetes API server and the VPC.
  • Attach an existing FSx for Lustre file system or provision a new one with multiple throughput and storage capacity options.
  • Add or customize instance groups, including standard and restricted instance groups, with capacity models aligned to on-demand workloads or flexible training plans for large-scale jobs.
  • Fine-grained control over optional operators installed in the EKS cluster via Helm charts.
  • Advanced lifecycle scripts can be supplied from an existing S3 bucket for customized ML frameworks or dependency configurations. For developers seeking observability and resilience, the platform supports deep health checks (stress and connectivity) in addition to the basic health checks applied by the orchestrator. These checks validate hardware components (e.g., GPUs, memory) and network connectivity across nodes to maintain reliable distributed training. You can also adjust the number of threads per CPU core to influence performance characteristics (one thread per core vs. two threads per core). A copy of the CloudFormation template used to deploy the selected configuration is downloadable from the SageMaker AI console, enabling reuse and integration with continuous delivery tools like AWS CodePipeline. Parameter overrides can be defined in a template configuration file to support multi-environment promotions from dev to test to prod.

Key takeaways

  • The new cluster creation experience for SageMaker HyperPod streamlines deployment with one-click provisioning and prescriptive defaults.
  • Quick setup emphasizes speed and safety with automatic recovery and new networking, storage, and cluster components.
  • Custom setup provides granular control for advanced users and environments, including the option to reuse existing resources.
  • CloudFormation-based IaC enables declarative deployments, templated reuse, and integration with CI/CD pipelines.
  • Continuous provisioning mode and health checks help deliver faster, more reliable large-scale AI workloads.
  • The option to export and reuse CloudFormation templates supports consistent multi-environment deployments.

FAQ

  • What is the purpose of the new cluster creation experience for SageMaker HyperPod?

    It provides a one-click, validated path to create HyperPod clusters with the required AWS resources and prescriptive defaults, reducing misconfigurations and setup time.

  • What resources are created automatically in quick setup?

    new VPC with multi-AZ subnets, a security group for EFA and FSx traffic, an EKS cluster with required operators, an S3 bucket for lifecycle scripts, an IAM role, and an FSx for Lustre file system.

  • Can I reuse existing AWS resources?

    Yes, the custom setup lets you reference existing VPCs, security groups, EKS clusters, and FSx for Lustre file systems.

  • What is continuous provisioning mode?

    It enables concurrent initiation of multiple operations, such as scaling and AMI updates, within a single instance group, allowing faster deployments even when all requested instances are not immediately available.

  • How can I reuse the CloudFormation template?

    You can download the template from the SageMaker AI console and use parameter overrides with CodePipeline to automate builds, tests, and promotions.

References

More news