Hi, Julien from Arcee here. In this video, I would like to talk about a really cool cost optimization feature called Amazon Elastic Inference. So let me explain. A lot of people train deep learning models and then deploy them on GPU instances like the P2 family or the P3 family. But as it turns out, a lot of models are not large enough or parallel enough to really make good use of the thousands of cores present on the GPU instances. And so you end up paying for a GPU instance that you use maybe only 10%, 20%, 30%. And of course, that's not good because we'd like you to pay exactly for what you use and nothing more. So we came up with this service called Amazon Elastic Inference, which was actually launched at re:Invent last year, but I still come across a lot of customers who've never heard of it. So I figured it's time to maybe refresh your memory, especially since just a few months ago, we actually launched a new generation for those elastic inference accelerators, giving you twice as much GPU memory as the previous generation.
So let's look at an example, and then we'll dive a little deeper. First, I'm going to show you an example based on Amazon SageMaker. Here, I trained an image classification model using the built-in algo in SageMaker. As usual, you'll find the link to this notebook in the video description. So I trained this model using the generic SageMaker estimator. I set hyperparameters for it, and then I trained it. OK, so it trained for a while. And then, as this is a deep learning-based algo, in fact, it's a ResNet architecture implemented with Apache MXNet, you'd think it would make sense to actually deploy it to maybe a P2 XL instance, which is the least expensive GPU instance that we have, right? At $1.36 in EU West 1, which is the region I mostly use, but pricing will be similar in other regions. So, of course, I could deploy the model and do it just like that. And I could predict with my model. So let's run those cells. OK, grab an image. And that's one of the classes present in the model that I trained. Then select my endpoint called the predict API. Okay, fine, and it's properly classified. So fine, but then if we were doing this at scale and if we were looking at the CloudWatch metrics for this endpoint here, probably we'd see that we don't really keep it busy, right, because here I trained the smaller ResNet version which has only 18 layers. My gut feeling is this is not large enough to keep the GPU completely busy, so I'm paying for this but I don't think I'm getting the best bang for the buck. So the alternative is to use elastic inference, combining a CPU instance with an elastic inference accelerator. They come in three sizes: medium, large, xlarge, giving you a certain amount of teraflops and a certain amount of GPU memory. And again, compared to EIA1, these have twice as much memory. So make sure you use those, and that will allow you to basically send multiple samples for prediction. You'll be able to pack more samples in a single prediction, and that would increase throughput and generally make your app even more cost-effective and even more efficient. So use EIA2 from now on. And so that combination here costs about $0.320. If you compare this combination to the cost of P2XL, you see it's a huge discount. It's actually a 77% discount, which means you're going to save $754 per instance per month. So if you're running multiple endpoints, 24-7, this could be a huge cost optimization for you. Please consider using that service.
What about performance? Well, your mileage may vary, but generally, we find that this combination here is quite close to P2 XL. So all things considered, you get similar performance, maybe within 10% or something, and you get a 77% discount. So it's up to you. Some customers will want the best performance, in which case, I guess they're ready to pay a little more for full-fledged GPU instances, even though they're not fully using them. For some other customers, maybe cost is a more important factor, and they really want to find the best cost-performance ratio. Then this elastic inference service is an easy way to do that. Here we're using SageMaker endpoints, but as it turns out, the service is also available on EC2. You can actually use Elastic Inference with TensorFlow and MXNet, meaning that if you use the deep learning AMIs or if you use the deep learning containers, we have actually added extra APIs in TensorFlow and in MXNet that extend the behavior of the TensorFlow estimator so that you can very easily use those models. And actually, in the blog post that I wrote a while ago, there's an example with MXNet. It's basically extremely well integrated. For MXNet, you just say, "Hey, please let me use the elastic inference accelerator context," and that's about it. I think that's the only modification you need to do in your code. And for TensorFlow, we added a `tf.contrib` package that basically extends the `tf` estimator and lets you use elastic inference. So if you're using Elastic Inference outside of SageMaker, then basically you have the flexibility to come up with the exact combination that you want. Maybe you're running a compute-intensive app, so maybe a C5 instance would be interesting to run the app itself or the web service itself, and you need some kind of acceleration because this API or this app is also performing prediction using a machine learning model. So you can easily benchmark the different instance types for the app and the different accelerator types for the machine learning part. Medium, large, x-large. And so you really get the best of both worlds. You don't have to compromise on using either a C instance that's probably slow for your machine learning predictions or using a GPU instance which is fast for machine learning but maybe not the best instance type for the actual app that runs on it. So it's a very flexible service, and again, the cost savings are really impressive.
For training, you can use spot instances, which I have another video for if you look in my channel. And if you combine spot and elastic inference, you can actually save 70%, 80% across the end-to-end machine learning workflow from training to deployment. So I would take a look personally, although I'm not paying my bills. But if I was paying my bills, I would absolutely investigate and see if I can save a whole bunch of money here. Okay, well, that's it for Elastic Inference. So, super simple to use in SageMaker and also available for EC2 with TensorFlow and MXNet using extra APIs that we contributed. So give it a try. I will add links to all that stuff in the video description, and I guess that's it for this one. See you later. Bye-bye.