Hi everybody, this is Julien from Arcee. As you know by now, I am really obsessed with hardware acceleration for transformer models. And yesterday, AWS Inferentia 2 became generally available. Inferentia is a custom chip designed by AWS to speed up inference. Yes, you guessed it. In a previous video, I showed you how to speed up transformers on the first generation, and now obviously we want to take a look at the second generation of Inferentia chips. I'm actually going to run the same example so that we can get a sense of how much faster Inferentia 2 is compared to Inferentia 1, and I'll pinpoint a couple of SDK changes, but really moving the code from Inferentia 1 to Inferentia 2 was very simple. Okay, let's get started.
As usual, I would encourage you to take a look at the product page to learn a little bit more about Inferentia 2 and see what sizes are available. We have xLarge with a single accelerator, edXL with a single accelerator and more memory, and 24 XL and 48 XL with more accelerators and even more memory. These are the sizes. We'll talk about pricing also in the video. I think pricing is very aggressive and very competitive. Another page you absolutely want to look at is, of course, the SDK page. You probably know the AWS SDK for Inferentia and Tranium is called the Neuron SDK. So obviously go and check out the tutorials and the documentation for Inferentia 2 there. There's another super useful page on the architecture. You can see the differences between the Inferentia 1 architecture and the Neuron cores V1 versus, of course, the Inferentia 2 architecture and Neuron core V2. The Inferentia 2 architecture is very close to the Tranium architecture, which is a very interesting move. You can go and read all about it and understand what those cores are and how they work. By default, those cores will optimize the models for BF16, which is an interesting format. Just go and read all of that stuff and you'll learn everything you need to know.
Let's just jump to the code now. Here I'm using an inf2 xlarge instance, so the smallest one with just one Inferentia chip and two cores, and we can see the cores here. I am using a deep learning AMI built by AWS, and this one is really convenient because it provides a built-in environment for everything. You may remember in my previous video, I complained about having to set up quite a few things. Well, good job AWS on simplifying this. Now the only thing I need to do is activate this environment. If I list the packages, I'm going to see PyTorch 1.13.1 and the Torch Neuron X package, which is the one we need for Inferentia 2. For Inferentia 1, we use Torch Neuron. So that's part of the changes we'll need to do in the code, just fix the import. We have the transferable. Thank you. And generally, we have everything we need.
So now we can just go and run my code. Here we have two examples which actually come from the Neuron documentation. I tweaked them a little bit for readability. There's one for BERT on the MRPC task, and I split model compilation and model testing. Then there's a version of BERT that I actually fine-tuned on the Yelp review dataset, which is a multi-class classification dataset scoring reviews from one to five stars. So a five-class classification model. Here as well, I split it for training, compilation, and testing. It's all super simple. Again, link to that code in the video description. Let's take a quick look at the first example, the compilation example. It's almost identical to the Inferentia 1 example. I just had to fix the import here. It's not called torch.neuronX; it's called torch_neuronX. I had to make that mistake, so now you can avoid it. The rest is really the same. Here we're going to use a pre-trained BERT model for the MRPC task, which means finding if two sentences have the same meaning or not. We have three sample sentences. We encode the two sentence pairs that we're building, the two sentences that are similar and the two sentences that are not similar. Then we run on the vanilla CPU model to see what the prediction looks like, print the logits so that we can compare them to the neuron logits, and just compile the model in this one line of code, torch_neuron_x.trace, with the model and a sample input. Then we simply save the model. As you can see, if you have existing Hugging Face code or PyTorch code, it is super straightforward. You just need to import the torch_neuron_x package, trace it, save it, and voilà, you have a neuron-optimized model. Let's just run this. Here we see our vanilla logits for the CPU model, and then we just optimize the model for neuron, which should take just a few seconds. This will be compiled with a fixed sequence length, which is the sequence length of your sample input. So obviously that helps a lot with optimization. You can always use padding, etc., but if all of a sudden you want to predict, let's say, 256 token sequences instead of 128, you need to recompile the model for that. But as you see here, it really only takes a few seconds. Now I've saved my model, and now we can take a look at the testing part. It's super simple as well. We need to import the torch_neuron_x package again, although it's not explicitly used in the example, but it will bring all the neuron-related objects that fit within PyTorch. Don't forget that because if you just import torch, you're going to get some weird errors. So you do need to have this package here. The rest is very similar. We load the model back, we need to put those samples in TorchScript format, predict the paraphrase example and the non-paraphrase example, and print the logits so that we can compare them. That shows us the model works. Let's just try this. It is important to check that the compiled model predicts the same way as the vanilla model. Here, I'm running just a simple example, but you would need to not only score the compiled model again on whatever metric makes sense for the use case, you would really need to check the distribution of logits as well to see that you really get very similar results. So let's look at the first example. The first example was -0.3495 and 1.975, and we get 0.3497 and 1.8996. Super close, but still, not identical. Please run your tests and make sure the compiled model behaves the same. We can see this works. The compilation process is extremely simple, and the model behaves the same way.
Now, of course, we want to run a benchmark. In this example, I'm actually loading a model that I trained on Tranium in another video. I didn't push it to the hub, so this one I'm actually loading from the model artifact directly. But it doesn't make any difference. You could pull a model from the hub, config for the model. It's a BERT for a sequence classification model, loading it and loading the state dictionary. Just an example of how to load a PyTorch transformer into a Hugging Face model. Could be useful. So then we have some inputs, we tokenize them, put them in TorchScript format, predict with the vanilla model, print the logits, and then convert this model. This actually doesn't make a difference here, but just as a reminder, I am going to use inf2xlarge with two neuron cores. Just trace the model like this and save it. So really pretty much the same thing we did with the previous example, except this time we're loading the part model from disk instead of loading from the hub. Now we have a neuron-optimized model, and we can go and test it. Starting from my same examples, preparing them in the same exact way, loading the neuron-optimized model, predicting with the neuron-optimized model, printing out logits, etc. Again, this helps me see that I'm not getting anything that's significantly different from the vanilla model. So that's really similar to what we did with the previous example, but now we're going to run a benchmark. To do this, we need this benchmarking function, which is actually from the sample notebooks in the Neuron SDK documentation. So you should definitely reuse it. It takes a file name, which is the name of the model to load, a sample input, and how many models and how many threads to use for prediction. I'll get back to that in a second. And how many batches do we want to run for each thread? How many batch predictions? Here we have two neuron cores, so we're actually going to load two copies of the model, and each copy of the model will predict on one of the neuron cores. This is quite different from the Inferentia 2 architecture, where you actually loaded the model once because at compilation time, you had optimized it for the exact number of cores that the instance would have. I think the V2 behavior is more flexible because you can compile the model once and then deploy the same model on whatever Inferentia 2 instance you want to use because it will just work. You just load as many models as you have neuron cores, and it will work. No need to recompile the model for the actual number. So that's a nice improvement. We load the models and then create a couple of prediction threads that will iterate over our sample input. We store all the prediction times and compute the P50, P95, and P99 percentiles. And then we print some results. So that's a really cool function. I'm going to steal this one again and again. And how did we actually run this? So, benchmark, name of the compile model, a sample input. Here I'm running 10,000 predictions per thread, so I'm going to run 20,000 total, which should be enough to get a sense of how fast this is. And batch size in this example is 1. So we'll see predictions for the model just to see that this thing works. We can compare it to the vanilla logit and see the same thing. In another window, we see that stuff happening now. I'm using NeuronTop, and we can see the two cores are crunching at my 20,000 samples. It's already over. Wow, that was quite fast. It took 11 seconds, and I can see the throughput is close to 1,700 inferences per second. P50 latency is 1.15 milliseconds, and P99 latency is 1.36 milliseconds.
Now what about Inferentia 1? I actually pulled my old video on Inferentia 1. Here in this example, I used an Inf1 6XL instance, which actually came with 16 neuron V1 cores. That's probably difficult to compare, V1 cores versus V2 cores. I think this one is an okay reference point because it is about the same cost as the Inferentia 2 instance I'm using. As you can see here, I had higher throughput, but I had more cores, so that's probably hard to compare. But latency was over 2x higher, 3.1 versus 1.4, let's say. This goes to show Inferentia 2 is much, much faster. I get 1,700 inferences per second with two cores when I only got 4,000 with 16 cores. So it's a massive jump in scalability. What about longer sequences? I just copied some numbers that I measured earlier today. The example we just saw uses a very short sequence length of 12. So I bumped it to 176 tokens, and we get throughput of 652, latency 355. If we double that again, throughput is kind of divided by 2, and latency is 619 milliseconds. If I go to the max length, the max sequence length of the model or close enough, it's 512, but I only got to 508. I get over 200 predictions per second, and I'm still under 10 milliseconds latency. These are really good numbers. We can see some good scalability on throughput and latency even with very large sequence lengths. Go and run your own numbers, but this is the quick and dirty evaluation I did this morning. I guess I'm pretty happy with that.
What about cost? We can see the pricing on the EC2 pricing page. My Inf2 xlarge instance costs 76 cents on-demand price in the US EC2, at least in one region. We can see the pricing for the larger instances. I think this is very competitive. If we compare this to, let's say, G5, you can see the smallest G5, which actually has the same number of vCPUs and the same memory, is just a little more than a dollar. So that's pretty cool. Now, of course, I want to do a benchmark of Inf2 versus G5 to see which one is fastest, although I'm pretty sure of the results, but still we need to check. And as always, who wants to pay on-demand price? So I actually fired up this instance as a spot instance. Thank you, AWS, for the discount, meaning that my 0.76 dollar has now become something like 0.26, so 26 cents, which is honestly dirt cheap for this kind of computing power, especially compared to GPU instances, which are generally a little difficult to get with the spot. So there you go. You can run short sequences with BERT in about one millisecond for 26 cents an hour and let's say up to 10 milliseconds for longer sequences. I think that's pretty cool. But I will do that G5 benchmark. I know how much you love them, and they always make me new friends. So I can always use more friends. Of course, I'll do the benchmark. Okay, enough stupidity. Welcome Inferentia 2. It's a killer chip, and I can see a lot more videos on this and hopefully a lot more usage. So, well, that's pretty much what I wanted to show you today. Have fun, run those examples, try and compile your own models, try your own things, and let me know how you're doing. Until next time, keep rocking.