Fine-tune OpenAI GPT-OSS models using Amazon SageMaker HyperPod recipes
TL;DR
- Fine-tune OpenAI GPT-OSS models using SageMaker HyperPod recipes or SageMaker training jobs.
- HyperPod recipes provide pre-built, validated configurations for foundation models such as Meta’s Llama, Mistral, and DeepSeek, enabling rapid on demand experimentation at scale.
- The process supports multilingual reasoning with the HuggingFaceH4 Multilingual-Thinking dataset and a 4,000 token sequence length for GPT-OSS 120B.
- Data storage options include FSx for Lustre for HyperPod and S3 for training jobs; final model artifacts are stored in S3 and deployed to SageMaker endpoints using vLLM containers.
- Prerequisites cover cluster readiness and environment setup; the HyperPod launcher automates orchestration on architectures like EKS or Slurm, while training jobs automate resource provisioning.
Context and background
This post is the second part of the GPT-OSS series focusing on model customization with Amazon SageMaker AI. In Part 1, we demonstrated fine-tuning GPT-OSS models using open source Hugging Face libraries with SageMaker training jobs, which support distributed multi-GPU and multi-node configurations so you can spin up high-performance clusters on demand. In this post, we show how you can fine-tune GPT-OSS models using recipes on SageMaker HyperPod and Training Jobs. HyperPod recipes help you get started with training and fine-tuning popular publicly available foundation models such as Meta’s Llama, Mistral, and DeepSeek in just minutes, using either SageMaker HyperPod or training jobs. The recipes provide pre-built, validated configurations that alleviate the complexity of setting up distributed training environments while maintaining enterprise-grade performance and scalability for models. We outline steps to fine-tune the GPT-OSS model on a multilingual reasoning dataset, HuggingFaceH4/Multilingual-Thinking, so GPT-OSS can handle structured, chain-of-thought reasoning across multiple languages. This solution uses SageMaker HyperPod recipes to run a fine-tuning job on HyperPod using Amazon Elastic Kubernetes Service orchestration or training jobs. Recipes are processed through the SageMaker HyperPod recipe launcher, which serves as the orchestration layer responsible for launching a job on the corresponding architecture such as SageMaker HyperPod (Slurm or Amazon EKS) or training jobs. For details on HyperPod recipes, see SageMaker HyperPod recipes. For details on fine-tuning the GPT-OSS model, see Fine-tune OpenAI GPT-OSS models on Amazon SageMaker AI using Hugging Face libraries. In the following sections, we discuss the prerequisites for both options, and then move on to the data preparation. The prepared data is saved to Amazon FSx for Lustre, which is used as the persistent file system for SageMaker HyperPod, or Amazon Simple Storage Service (Amazon S3) for training jobs. We then use recipes to submit the fine-tuning job, and finally deploy the trained model to a SageMaker endpoint for testing and evaluating the model.
What’s new
HyperPod recipes extend the GPT-OSS fine-tuning workflow to a managed, persistent cluster with pre-built configurations that are validated for enterprise-scale training. You can run the fine-tuning on SageMaker HyperPod with EKS orchestration or use SageMaker training jobs for on-demand compute. The multilingual dataset Multilingual-Thinking provides CoT examples translated into languages such as French, Spanish, and German, enabling GPT-OSS to perform structured reasoning across languages. The recipe supports a sequence length of 4,000 tokens for the GPT-OSS 120B model. The orchestration layer behind the scenes is the SageMaker HyperPod recipe launcher, which handles launching the job on the selected architecture. For broader guidance, refer to the AWS blog post linked in the References.
Context and background (continued)
Data preparation follows a straightforward path: tokenize the multilingual-thinking dataset in a Hugging Face format (arrow) and save the processed data to disk. The approach aligns with the two available execution paths — HyperPod and training jobs — and emphasizes enterprise-grade performance, resilience, and scalability. The guide notes that data storage choices differ by path: FSx for Lustre is used with HyperPod, while training jobs rely on S3 as the persistent data store. The GPT-OSS 120B model is the target for the described scenario, with a sequence length capability of 4,000 tokens.
Why it matters (impact for developers/enterprises)
For developers and enterprises, this workflow enables rapid experimentation with large open models while preserving control over distributed training resources. HyperPod recipes reduce the setup complexity of distributed environments and offer persistent clusters that support ongoing development and iterative experimentation. Training jobs provide a fully managed, on demand experience suitable for one-off or periodic training workloads. The combination of HyperPod and training jobs gives teams flexibility in when and how they train, while the multilingual dataset expands capabilities across languages. The final artifacts merge the base model with customized PEFT adapters, stored in S3 and deployable to SageMaker endpoints. To deploy, you need to work with the latest vLLM containers and a compatible SageMaker runtime environment.
Technical details or Implementation
This section covers the practical steps and components involved in the process.
Prerequisites
- Ensure the required service limits and quotas are approved (these can take up to 24 hours).
- You can also reserve training plans for specific time windows (cluster or training jobs) as described in the AWS guidance.
- Prepare a development environment with Python 3.9 or greater. Have access to SageMaker HyperPod, the HyperPod recipe launcher, and the respective configuration files.
Data preparation
- Use the HuggingFaceH4/Multilingual-Thinking dataset, a multilingual reasoning dataset with CoT examples translated into languages such as French, Spanish, and German.
- Tokenize the dataset for the GPT-OSS 120B model. The recipe accepts data in Hugging Face format (arrow) and can save the processed dataset to disk for subsequent fine-tuning.
Fine-tuning with SageMaker HyperPod
- Set up the virtual environment and dependencies required to run the training job on the EKS cluster. Confirm that the cluster is InService and that Python 3.9+ is active in your environment.
- Download and set up the SageMaker HyperPod recipes repository. Use the HyperPod recipe launcher to submit your training job.
- In recipes_collection/cluster/k8s.yaml, update the persistent_volume_claims section to mount the FSx claim to the /fsx directory of each computing pod.
- The recipe launcher provides a launch script for each recipe under launcher_scripts. To fine-tune the GPT-OSS-120B model, modify the script at launcher_scripts/gpt_oss/run_hf_gpt_oss_120b_seq4k_gpu_lora.sh and set cluster_type to k8s (if using EKS).
- After preparing the script, submit the fine-tuning job. Use kubectl to verify that pods are running and to inspect logs (for example kubectl logs -f hf-gpt-oss-120b-lora-h2cwd-worker-0).
- The training process writes checkpoints to /fsx/experiment/checkpoints. The final merged model is located under /fsx/experiment/checkpoints/peft_full/steps_50/final-model, within your defined experiment directory.
Fine-tuning with SageMaker training jobs
- Alternatively, you can use SageMaker training jobs with the PyTorch estimator and the training_recipe parameter to specify the fine-tuning recipe. Set the input, output, and results directories to locations under /opt/ml as required by SageMaker training jobs.
- After submitting the job, monitor it in the SageMaker console under Training, and review logs to confirm successful completion. Outputs are saved to an S3 location; the exact artifact location is shown in the job details.
Deployment and serving
- After fine-tuning, the output is a customized model artifact that merges the base GPT-OSS model with the tailored PEFT adapters. This final model is stored in Amazon S3 and can be deployed directly from S3 to SageMaker endpoints.
- To serve GPT-OSS models, ensure you use the latest vLLM containers (version 0.10.1 or later). A full list of vllm-openai Docker image versions is available via Docker Hub.
- Deploy a SageMaker endpoint by building a deployment container that extends the base vLLM image with a SageMaker compatible serve entrypoint script. The build and deployment workflow creates a container image pushed to Amazon ECR. SageMaker endpoints pull the image at runtime to spin up the inference container.
- The example deployment workflow includes building the container with a script similar to build.sh, which generates a vllm-openai container optimized for SageMaker hosting, pushing it to ECR, and wiring the endpoint to download model artifacts from S3 at startup.
- The endpoint can be invoked via the SageMaker Python SDK, using the OpenAI-style messages input format for chat style requests.
Key notes
- The approach preserves the familiar SageMaker workflow while integrating the vLLM runtime for low latency inference.
- The architecture supports both HyperPod clusters and standard SageMaker training jobs, giving teams a choice between persistent experimentation environments and on-demand compute.
- The multilingual data and 4K sequence length extend GPT-OSS capabilities to multi-language reasoning in practical settings.
Key takeaways
- HyperPod recipes enable rapid fine-tuning of GPT-OSS models on enterprise-grade, scalable clusters.
- Data preparation and tokenization are aligned with Hugging Face format and the specified multilingual dataset.
- Final model artifacts merge base models with adapters and are deployed via SageMaker endpoints using vLLM containers for optimized inference.
- The process supports two execution paths: persistent HyperPod clusters and on-demand SageMaker training jobs, providing flexibility for development and production workloads.
- Deployment relies on a compatible vLLM container and a deployment workflow that remains consistent with SageMaker hosting practices.
FAQ
-
What models are supported by HyperPod recipes for fine-tuning GPT-OSS?
The recipes cover popular publicly available foundation models such as Meta's Llama, Mistral, and DeepSeek. See the referenced workflow for GPT-OSS fine-tuning.
-
What data format and dataset are used for multilingual reasoning?
The HuggingFaceH4/Multilingual-Thinking dataset is used, with CoT examples translated into multiple languages such as French, Spanish, and German. The dataset is tokenized to the 4,000 token sequence length required for GPT-OSS 120B.
-
Where are data and model artifacts stored during training?
For HyperPod, data is stored on FSx for Lustre; for training jobs, data goes to S3. Final model artifacts are merged artifacts stored in S3 and can be deployed from S3 to SageMaker endpoints.
-
How do I deploy the fine-tuned model?
Build a SageMaker hosting container based on a vLLM open AI image, push it to ECR, and deploy an endpoint. The endpoint uses the vLLM runtime and OpenAI style input formats for chat-style requests.
-
Where can I find the official guidance and code references?
The workflow is documented in the AWS ML Blog post on fine-tuning GPT-OSS with HyperPod recipes, which also references the HyperPod recipe launcher and related tooling. See the provided link in References.
References
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Move AI agents from proof of concept to production with Amazon Bedrock AgentCore
A detailed look at how Amazon Bedrock AgentCore helps transition agent-based AI applications from experimental proof of concept to enterprise-grade production systems, preserving security, memory, observability, and scalable tool management.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.