Amazon SageMaker HyperPod

Reduce time to train foundation models by up to 40% with a purpose-built infrastructure for distributed training at scale

Get started with SageMaker HyperPod

What is SageMaker HyperPod?

AmazonSageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure for training foundation models (FMs), reducing training time by up to 40%. SageMaker HyperPod is pre-configured with SageMaker’s distributed training libraries that enable customers to automatically split training workloads across thousands of accelerators, so workloads can be processed in parallel for improved model performance. SageMaker HyperPod also ensures customers can continue FM training uninterrupted by periodically saving checkpoints. When a hardware failure occurs during training, SageMaker HyperPod automatically detects the failure, repairs or replaces the faulty instance, and resumes the training from the last saved checkpoint, removing the need for customers to manually manage this process and helping them train for week or months in a distributed setting without disruption.

Benefits of SageMaker HyperPod

Streamlined distributed training for large training clusters

Amazon SageMaker HyperPod is preconfigured with Amazon SageMaker distributed training libraries, allowing you to automatically split your models and training datasets across AWS cluster instances to help you efficiently scale training workloads.

Optimized utilization of cluster’s compute, memory, and network resources

Amazon SageMaker distributed training libraries optimizes your training job for AWS network infrastructure and cluster topology through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing them across multiple GPUs to train. Data parallelism splits large datasets to train concurrently in order to improve training speed.

Resilient training environment that removes interruptions

SageMaker HyperPod enables a more resilient training environment by automatically detecting, diagnosing, and recovering from faults, allowing you to continually train FMs for months without disruption.

Optimized distributed training libraries

SageMaker HyperPod is preconfigured with SageMaker distributed libraries. With only a few lines of code, you can enable data parallelism in your training scripts. SageMaker HyperPod makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS GPU instances.

Learn more »

Automatic cluster health check and repair

If any instances become defective during a training workload, SageMaker HyperPod automatically detects and swaps faulty nodes with healthy ones. To detect faulty hardware, SageMaker HyperPod regularly runs an array of health checks for GPU and network integrity.

Debug and improve model performance

You can use purpose-built ML tools in SageMaker HyperPod to improve training performance. Amazon SageMaker with TensorBoard helps you to save development time by visualizing the model architecture to identify and remediate convergence issues, such as validation loss, not converging, or vanishing gradients.

Learn more »

Workload scheduling and orchestration

The SageMaker HyperPod user interface is highly customizable using Slurm. You can select and install any needed frameworks or tools. All clusters are provisioned with the instance type and count you choose, and they are retained for your use across workloads.