Hey, Julien from Arcee here. In this video, I'd like to talk about a new chip and a new instance type that we announced at reInvent. Of course, I mean the AWS Inferentia chip and the Amazon EC2 INF1 instances. This chip and instance type have been designed to deliver high throughput, low latency, cost-effective predictions at scale. Let me show you how to get started. This is based on a blog post that I wrote, and of course, I will put the URL in the video description. You'll find a whole bunch of resources to learn about Inferentia, including the breakout session from reInvent. If you're curious about the chip itself, that's where to go. But for now, let's focus on the instances.
The Inferentia chip requires that your deep learning models are compiled. We support TensorFlow, PyTorch, MXNet, and ONNX models. You could compile the model on any instance, but we recommend using a compute-optimized instance. That's why I've created a C5 instance here that I'll use for model compilation. I will run inference on an inf1.xlarge instance. I've created these already, so let me show you how to do this quickly.
Just click on launch instance as usual. You want to use the deep learning AMI and the latest version of the deep learning AMI because this one has everything you need. There's a whole bunch of instances here, and the one you want is this one. Make sure you use version 26, which is the relevant one, and the AMI ID ends in 3710. This one has all the tools we're going to use. Just select this, go and pick a C5 instance. This is the one you would use for compilation, and then you would do the same for inference. Of course, you can start an inference instance and just build and compile the model on it, which is fine for testing. For production workloads, where you would be compiling models again and again as part of your model pipeline, it's probably a better idea to split the tasks.
I've got my instances now, so let's switch to this view. On the left, I have my C5 instance. If you don't see environments, those conda environments that say AWS Neuron, you probably used the wrong AMI. You specifically need to see those things here. You see the MXNet environment and the TensorFlow environment. What's AWS Neuron? AWS Neuron is an SDK that includes all the tools we need to build models and then use them to predict using TensorFlow, MXNet, etc. Make sure you see this, otherwise, you're missing the tools that are needed. I'm going to work with a TensorFlow model, so I want to select this conda environment. This will automatically ensure I have the proper TensorFlow version and all the tools.
The first step here is to bring a vanilla TensorFlow model and compile it. We have a bit of code here. It's not a lot of code. The first thing we do is grab a ResNet 50 model, which is an image classifier model, pre-trained on the ImageNet dataset, a very large multi-million image dataset. We save it in a specific directory and in the saved model format, which is the standard format for TensorFlow models. We pull that ResNet-50 model from the collection of TensorFlow models and save it. Then we compile it. That's where we use the Neuron SDK, and that single line of code is really all it takes. This comes from the `tensorflow.neuron` package, which is an extension we provide, and this is why it's important to use the deep learning AMI to ensure you have all the necessary tools. We just say, "Hey, compile the model that we saved in this model directory, and write the compiled model into this other directory." Then we zip the compiled model directory to a zip file because that makes it easy to ship it to other instances.
You just run the script. It runs for a few minutes. I've run it before, so there's nothing spectacular to see here. Just run `Python compile blah, blah, blah`. It runs for a few minutes, which is why you want a powerful instance—it's a pretty heavy process. This vanilla model is transformed into a hardware-optimized representation for Inferentia, so TensorFlow operators are transformed into hardware-optimized operators. We see this zip file here, and if we look inside, we see the saved model format. That's the vanilla format, but this model has been optimized for Inferentia.
There are many ways to do this, but I'm going to copy it to an S3 bucket. You could ship it to your Inferentia instance in whatever way you like, but for me, it's easy to copy to S3 and grab it again. We're done with that compilation instance, so we can close this. Now I can log in to my Inf1 instance. I see the neuron-enabled environment, which is the one I want. We have a bunch of CLI tools as well. `neuron.cc` is the compiler. If we don't want to compile using framework APIs like we've done on the C5 instance, we can compile using CLI tools. We have a few more tools. `neuron.ls` shows us how many neuron cores are available on this instance. This is an INF1.xlarge instance, so it only has one Inferentia chip, and we have four cores per chip. If we had multiple Inferentia chips, we would see multiples of four. We have the `neuron-clm` tool that lets us list models that have been loaded. Nothing here, of course. We can list neuron core groups, so you can partition those neuron cores into different groups. This lets you load different models on different groups. If you have four models and want to use a different core for each model, you can do that. You can play around with these tools to manage cores and models.
Let's copy the S3 artifact again for the sake of it. I'm going to move to this inf1 directory, and we see our zip file again because I've copied it before. I extracted it here and added an intermediate directory called one. This is how it looks. I'll explain why I've done this. The directory "1" contains the extracted model because I'm going to use TensorFlow Serving for prediction. TensorFlow Serving requires that models are organized in folders showing model versions. Here, I'm only working with one model, so it's going to be model 1, version 1. That's why I'm extracting this into that subdirectory called one. Now, I want to try and use that model, so first, I'm going to use it as is, just load the model and predict.
What am I doing here? I'm loading a test image that I downloaded. It's a cat image, as you could probably guess. I'm converting it to the right format for ResNet50 prediction. This is vanilla TensorFlow. I'm pointing at my model. Don't forget that one folder; it's not needed here because I'm not using TensorFlow Serving, but we will need it later. I'm loading the model from the saved model format. This is a standard TensorFlow API, nothing weird here. Then I'm defining the input and using that model to predict. Why am I predicting a thousand times? It sounds a little silly, but if I want to predict just once, this is how I'm going to do it. Then print predictions. If we run this, it will load the saved model from the directory, and there you go. It prints predictions. None of this code is Neuron-specific. You can see all the imports are vanilla TensorFlow imports, and we use `tf.contrib.predictor`, which is the standard way of predicting with TensorFlow models. You won't have to change a line of code in your prediction code, which I think is pretty cool.
For a quick test, that's all right. In production, we probably want to use TensorFlow Serving, so let's start TensorFlow Serving. The actual line would be this: we're using `tensorflow_model_server_neuron`, a Neuron-enabled version included in the deep learning AMI. We're passing the name of the model we want to work on, which is user-defined, and the location of the model. This is why we need that one directory because we're loading model one. If you're not familiar with TensorFlow Serving, this might look weird, but it's standard. I can see my model server is listening on port 8500, so let's send this to the background.
Let's take a look at the prediction script. This is really simple. We load the test image and build a prediction request for a model called ResNet50, which is the name we gave to the model when we loaded it. Then we send the prediction request to TensorFlow Serving and get results, printing them out. Once again, this is completely standard TensorFlow. You can see there's nothing specific about Neuron here, so you can use your prediction code unmodified. The only reason you would modify it is for more advanced scenarios, like multi-threaded prediction using multiple cores across multiple chips. There are a few lines of code to add, but you'll find examples and references in the blog post. In most cases, you can absolutely use your code unmodified.
What do we do here? We load the image, build a prediction request for the ResNet50 model that we loaded, send the request, and print results. Let's run the script. We see predictions. Here, we're just running one, but of course, you would want to run many more. That's it. That's what I wanted to show you. Again, please check the blog post for additional information on the chip. There's a really cool workshop that was delivered at reInvent, diving deep into Inferentia and INF1 instances, showing you how to do multi-threaded predictions, benchmarks, and profiling information using TensorBoard. It's really interesting and well worth your time. That's it for this one. See you later, bye.