What is SageMaker HyperPod?
AmazonSageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure for training foundation models (FMs), reducing training time by up to 40%. SageMaker HyperPod is pre-configured with SageMaker’s distributed training libraries that enable customers to automatically split training workloads across thousands of accelerators, so workloads can be processed in parallel for improved model performance. SageMaker HyperPod also ensures customers can continue FM training uninterrupted by periodically saving checkpoints. When a hardware failure occurs during training, SageMaker HyperPod automatically detects the failure, repairs or replaces the faulty instance, and resumes the training from the last saved checkpoint, removing the need for customers to manually manage this process and helping them train for week or months in a distributed setting without disruption.
Benefits of SageMaker HyperPod
Optimized distributed training libraries
SageMaker HyperPod is preconfigured with SageMaker distributed libraries. With only a few lines of code, you can enable data parallelism in your training scripts. SageMaker HyperPod makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS GPU instances.
Learn more »
Automatic cluster health check and repair
If any instances become defective during a training workload, SageMaker HyperPod automatically detects and swaps faulty nodes with healthy ones. To detect faulty hardware, SageMaker HyperPod regularly runs an array of health checks for GPU and network integrity.
Debug and improve model performance
You can use purpose-built ML tools in SageMaker HyperPod to improve training performance. Amazon SageMaker with TensorBoard helps you to save development time by visualizing the model architecture to identify and remediate convergence issues, such as validation loss, not converging, or vanishing gradients.