Hi, Julien from Arcee here. In this video, I'd like to talk about Spot Training for Amazon SageMaker. You may be familiar with Spot Training on Amazon EC2. Spot is basically unused capacity in EC2 that you can use at a deep discounted price. The same feature is now available on Amazon SageMaker and has been for several months. However, I'm always surprised to meet customers who haven't heard about it, so it's such an important feature in saving money on your training jobs that I think it's worthy of a video. Let me show you how this works.
First, let's quickly recap how we train on SageMaker. We use an estimator object from the SageMaker SDK. Here, I'm using the TensorFlow estimator, passing the location of a script and specifying how much infrastructure I want to train on. For example, I want to use a C5.2xlarge instance and can pass hyperparameters and more. This is the default way of training, and we're going to fire up that C5.2xlarge instance on demand. This means we're going to pay the on-demand price, which is listed on the SageMaker pricing page. The managed instance starts, SageMaker will configure it, pull the TensorFlow container, inject your script, load your data, and training starts. This specific job lasted 777 seconds, which is about 13 or 14 minutes, and we're billed for the exact number of seconds used for that specific job. You know exactly how much you're going to pay because you know the on-demand price for that instance type.
Of course, if there's a less expensive way to do it, let's look at it. It's called managed spot training. This has been available on EC2 for years, and people have found many clever ways to leverage spot instances on EC2. It's super easy to use on SageMaker. Starting from the same estimator, and again, I'm using TensorFlow here, but it would be the same for built-in algorithms or other frameworks. It works for single instance training, distributed training, and hyperparameter tuning. All configurations are supported. The only thing you need to do is add these three parameters to your estimator. If you want to find out more, you can go to the SageMaker SDK documentation. I'll put the URL in the description. Here are the parameters: `train_use_spot_instances`, `train_max_run`, and `train_max_wait`.
The first parameter is obvious: `train_use_spot_instances=True`. The other two are important and might be a little confusing, so let me explain. `train_max_run` is the maximum training time for that specific job. For example, you don't want that job to run for more than half an hour, so you set it to 1800 seconds. `train_max_wait` is the total duration of the job, including the training time and any additional time you're willing to wait for spot instances. You might wait a few minutes or more for spot instances to be available. `train_max_wait` lets you cap the amount of time you're willing to wait for spot instances to be available. For example, if they're not available after half an hour, you might default to using on-demand instances.
So, use these three parameters: use spot instances, set how long the job should run, and how long you're willing to wait for spot instances. Then call `fit` as usual. If we scroll all the way down to the training log, we see the training time is 795 seconds, which is close to the previous training time. It seems we didn't wait long for spot instances, maybe just a few seconds. The billable seconds are what matter because these show up on your AWS bill. You will only be billed for 183 seconds, meaning you saved 77% of your training cost. That's a significant savings, especially if you train frequently and run distributed training or hyperparameter tuning.
If you're used to working with EC2 instances, you won't see these savings in the EC2 console because SageMaker instances are fully managed and not visible in EC2. Now, what's the catch? If we give you a 77% discount, there must be a catch. As with spot instances on EC2, we might have to reclaim that capacity at any given time for on-demand workloads. On EC2, you get a notification and two minutes to shut down your instance before we take it away. People who've worked with EC2 for a long time know how to handle this. On SageMaker, it's much simpler. If we have to reclaim capacity, your training job will be interrupted, and SageMaker will restart it automatically. If you use checkpointing, which saves the model in training at periodic intervals, SageMaker will restart from the latest checkpoint. Some built-in algorithms support checkpointing; refer to the SageMaker documentation. TensorFlow supports checkpointing by default, so there's nothing to do. For other libraries like MXNet, PyTorch, or Keras, you just need to enable checkpointing, which is often a simple parameter setting or a bit of coding.
When would you not use spot instances? If you have very long-running training jobs that cannot checkpoint for some reason, it might be better to use on-demand instances. If your job gets interrupted, you'll restart from scratch, which can be costly and frustrating. However, for everything else, give spot instances a try because 77% savings or more, depending on the instance type, is a compelling proposition.
If you want to know more, please take a look at the SageMaker SDK documentation and read the blog post I wrote when this feature came out. I'll put all those URLs in the description. That's it for spot training. I'll see you soon for other cool SageMaker features. See you.