Skip to content
Train and deploy models on Amazon SageMaker HyperPod with the new HyperPod CLI and SDK
Source: aws.amazon.com

Train and deploy models on Amazon SageMaker HyperPod with the new HyperPod CLI and SDK

Sources: https://aws.amazon.com/blogs/machine-learning/train-and-deploy-models-on-amazon-sagemaker-hyperpod-using-the-new-hyperpod-cli-and-sdk, https://aws.amazon.com/blogs/machine-learning/train-and-deploy-models-on-amazon-sagemaker-hyperpod-using-the-new-hyperpod-cli-and-sdk/, AWS ML Blog

TL;DR

  • The SageMaker HyperPod CLI and SDK simplify training and deploying large AI models on SageMaker HyperPod.
  • The CLI provides an intuitive command-line experience for launching training, fine-tuning, and inference endpoints, while the SDK offers programmatic access for advanced workflows.
  • The article demonstrates distributed training using Fully Sharded Data Parallel (FSDP) and model deployment for inference, including JumpStart foundation models.
  • Prerequisites include installing the HyperPod CLI/SDK (version 3.1.0) and Kubernetes operators; interactions rely on the Kubernetes Python client.

Context and background

Managing distributed training and inference at scale has historically required substantial engineering and operations work. The newly released SageMaker HyperPod CLI and SDK aim to lower that barrier by wrapping distributed capabilities behind familiar interfaces. The CLI builds on top of the SageMaker HyperPod SDK to offer straightforward commands for common tasks like launching and monitoring training jobs, fine-tuning, and deploying inference endpoints. This makes experimentation faster and more approachable for data scientists, while the SDK provides the flexibility needed for production-grade, customized pipelines. The tools are shown in action for training and deploying large language models (LLMs) on SageMaker HyperPod, illustrating distributed training with Fully Sharded Data Parallel (FSDP) and the deployment of models for inference. The release highlights how developers can streamline the path from research to production with controlled, scalable distributed workloads. AWS ML Blog The HyperPod CLI and SDK interact with the cluster via the Kubernetes API, which means the Kubernetes Python client must be configured to reach the cluster context. In practice, users list available clusters and set the active context before submitting jobs. The CLI is designed to be approachable for standard workflows and to hide much of the underlying distributed systems complexity, while the SDK enables deeper customization when needed. The post demonstrates both paths for training and deploying models on the HyperPod platform. For a real-world workflow, you configure a Kubernetes-based HyperPodPyTorchJob custom resource, installed as part of the HyperPod training operator, and you orchestrate workers and lifecycles through the Elastic agent embedded in the training container. The example focuses on a Meta Llama 3.1 8B model configuration utilizing FSDP. The prerequisites section emphasizes setting up the HyperPod tooling and the Kubernetes operators in the cluster. The example workflow also shows how to build and push a Docker image, log in to Amazon ECR, and submit a training job via the CLI or SDK. Observability and debugging are addressed through commands to inspect job status, view pod logs, and inspect training artifacts. The content makes clear that you can apply the same primitives to deploy jumps to inference endpoints and to deploy custom models stored in S3 or FSx for Lustre, with optional load balancer support for secure HTTPS access. AWS ML Blog

What’s new

The post announces a new end-to-end tooling story for SageMaker HyperPod, centered on the CLI and SDK release (version 3.1.0). The key elements include:

  • A CLI that abstracts distributed system details and enables quick submission of PyTorch jobs to a SageMaker HyperPod cluster via the HyperPodPyTorchJob Kubernetes custom resource.
  • A Python SDK that provides programmatic access to configure training and deployment parameters while preserving the simplicity of Python objects.
  • Support for distributed training with Fully Sharded Data Parallel (FSDP) and for deploying models for inference directly on HyperPod clusters.
  • The ability to deploy foundation models from SageMaker JumpStart or to deploy custom models with artifacts stored on Amazon S3 or FSx for Lustre, with an optional Application Load Balancer (ALB) and TLS for secure access.
  • Observability and debugging capabilities for training and inference pods, including logs and pod-level status.
  • Examples show how to build Docker images, push to Amazon ECR, and submit a training job that writes checkpoints to /fsx/checkpoints on an FSx for Lustre PVC. The CLI exposes a family of commands for training (hyp create hyp-pytorch-job), monitoring (hyp list hyp-pytorch-job), and logs (for individual pods). It also demonstrates deployment workflows for JumpStart endpoints (hyp-jumpstart-endpoint) and for custom endpoints (hyp-custom-endpoint). To illustrate the breadth of functionality, the article walks through a Meta Llama 3.1 8B training job and shows how to adapt the configuration by adjusting —args in the Kubernetes manifests. The deployment path includes creating an inference endpoint and an ALB, enabling HTTPS access with TLS, and observing the endpoint status via the CLI. A second deployment path covers deploying custom models stored in S3 or FSx for Lustre, enabling you to deploy fine-tuned artifacts alongside the appropriate inference container image (e.g., a DJL Large Model Inference container image). AWS ML Blog A concise table summarizes the main capabilities:
CapabilityDescription
CLI-based workflowsSubmits training and fine-tuning jobs, deploys inference endpoints, monitors cluster performance
SDK-based workflowsProgrammatic configuration of training and deployment parameters via Python objects
Inference endpointsJumpStart foundation models or custom models, with optional ALB and TLS
Debug and observabilityLogs, pod statuses, and advanced debugging workflows

Why it matters (impact for developers/enterprises)

The HyperPod CLI and SDK together provide a unified, scalable path to train and deploy large models with reduced operational burden. For developers, this means faster experimentation and iteration, as common tasks—such as launching distributed training, scaling attention to FSDP configurations, and deploying endpoints for production-grade inference—are streamlined behind familiar interfaces. For enterprises, the approach translates into more predictable workflows, improved visibility into cluster performance, and the ability to leverage JumpStart foundation models or custom models in a production-ready manner. The combination of CLI ease of use with SDK flexibility supports both rapid pilots and robust, production-grade pipelines on SageMaker HyperPod. The integration with Kubernetes and the HyperPod operators reinforces a standards-based, containerized approach that aligns with modern MLOps practices. AWS ML Blog

Technical details or Implementation

The workflow relies on several explicit components and steps.

  • Prerequisites
  • Install the latest SageMaker HyperPod CLI and SDK, version 3.1.0 or newer, to access the relevant features.
  • In your Kubernetes cluster, install the HyperPod training and inference operators as part of the prerequisites. The HyperPod CLI interacts with the cluster via the Kubernetes API, using the Kubernetes Python client. You must configure the cluster context to enable API calls against your cluster.
  • From the local environment, verify the CLI installation by running the hyp command and inspecting its output. The CLI provides guidance on available commands and parameters.
  • Training workflow with the CLI
  • The CLI enables submitting a PyTorch training job to a SageMaker HyperPod cluster via the HyperPodPyTorchJob custom resource, implemented by the HyperPod training operator.
  • The example builds a Docker image from the awsome-distributed-training repository, authenticates to Amazon ECR, and pushes the image to a registry. The HyperPod elastic agent inside the image coordinates the lifecycles of training workers across containers and communicates with the training operator.
  • You can adapt the configuration by changing the —args in the Kubernetes manifest to fit different LLMs or FSDP settings. Checkpoints are written to /fsx/checkpoints on the FSx for Lustre PVC.
  • The CLI supports a set of arguments for creating a PyTorch job, which you can discover by running hyp create hyp-pytorch-job and exploring the help output. After submission, the status can be tracked with hyp list hyp-pytorch-job, and pod logs can be retrieved to diagnose issues.
  • Inference deployment workflow
  • The HyperPod CLI can deploy models to a SageMaker HyperPod cluster for inference, including JumpStart foundation models or custom models whose artifacts are stored on S3 or FSx for Lustre. The deployment uses the HyperPod inference operator in the cluster and can optionally create a SageMaker inference endpoint and an Application Load Balancer (ALB) with TLS for secure access.
  • For JumpStart models, a deployment command returns a DeploymentInProgress status that transitions to DeploymentComplete once the endpoint is ready. You can observe the deployment pod logs for debugging and verify the endpoint by invoking it through the CLI.
  • For custom models, you provide the S3 location of the artifacts and an inference container image compatible with SageMaker Endpoints. You can then deploy the model using similar CLI commands, watch the endpoint state, and invoke the endpoint when ready. The Python SDK provides an equivalent programmatic path for both JumpStart and custom deployments. AWS ML Blog
  • Observability and debugging
  • The post emphasizes multiple debugging angles: viewing training pod logs, inspecting pod status, and examining the Kubernetes resources backing the HyperPod jobs. The CLI makes it straightforward to fetch logs, list active resources, and monitor progress across both training and inference workloads.
  • Practical notes
  • The training example demonstrates a Meta Llama 3.1 8B model with FSDP, instantiated via the HyperPod PyTorchJob CR. You can tailor the configuration by adjusting arguments and model settings in the provided manifests. The deployment path shows how to generate endpoints and, when desired, attach an ALB with TLS for secure access.
  • The article also covers deploying a TinyLlama 1.1B model from S3 using a DJL Large Model Inference container image, illustrating how to mix artifacts stored remotely with a compatible inference container. In summary, the HyperPod CLI and SDK deliver a practical, scalable approach to training and deploying large AI models on SageMaker HyperPod, with clear pathways for both JumpStart-based and custom deployments, underpinned by Kubernetes-based orchestration. AWS ML Blog

Key takeaways

  • HyperPod CLI provides an accessible entry point for common workflows, including training, fine-tuning, and inference deployment.
  • HyperPod SDK enables more granular, programmatic control of training and deployment settings.
  • The platform supports large language models with FSDP and provides end-to-end deployment options, including JumpStart and custom models.
  • Inference deployments can leverage ALB with TLS, offering secure access to endpoints.
  • Observability and debugging features help maintain production-grade reliability across training and inference workloads.

FAQ

  • What problem do the HyperPod CLI and SDK solve?

    They simplify distributed training and inference on SageMaker HyperPod by hiding complexity behind an easy CLI and providing a programmable Python API for advanced use cases.

  • What kinds of models and workloads can be trained or deployed with HyperPod’s tools?

    The examples focus on large language models using Fully Sharded Data Parallel (FSDP) and deploying both JumpStart foundation models and custom models with artifacts stored in S3 or FSx for Lustre.

  • How can I deploy an inference endpoint securely?

    The CLI supports automatic creation of a SageMaker inference endpoint and an Application Load Balancer (ALB) with TLS for HTTPS access.

  • What prerequisites are required before using the CLI/SDK?

    Install the HyperPod CLI and SDK (version 3.1.0), install the Kubernetes operators in the cluster, and configure the Kubernetes Python client with the cluster context.

  • How do I monitor training and deployment progress?

    Use hyp list commands (e.g., hyp list hyp-pytorch-job and hyp list hyp-jumpstart-endpoint) and inspect pod logs to observe status and diagnose issues.

References

More news