Accelerate PyTorch Transformers with Intel Sapphire Rapids part 2

Transcript

Hi everybody, this is Julien from Arcee. In a previous video, I showed you how to run distributed training on the latest generation of Intel Xeon CPUs based on the Sapphire Rapids architecture, and we got a really good speedup. In this video, we're going to use those CPUs again, but for inference this time. We'll start from a few NLP models, benchmark them on the previous generation of Intel CPUs based on the Ice Lake architecture. And then, of course, we'll run the same tests on the new CPUs, and we'll see what's what. We'll add our own Optimum Intel library to the mix for extra performance. So get ready for some serious speedup. Let's get to work. To begin with, let's look at the test servers we're going to use. I'm going to use a C6i EC2 instance. This one is based on the Ice Lake architecture, so that will give me my baseline benchmark. And I'm going to use an R7IZ instance, which is based on the Sapphire Rapids architecture. At the time of recording, these are still in preview. They have the same size, 16XL, and I started from the same AMI to set them up. The setup itself is pretty simple. You'll find all the instructions in this companion blog post, link in the video description, but the steps are really simple. I'm installing Torch, the Intel extension for PyTorch, which will bring all the hardware acceleration for those different chips. We'll talk about the new Sapphire Rapids stuff when we get there. Of course, the Transformers library. And on the Sapphire Rapids instance, I'm just adding Optimum Intel. And we'll see why we want to do that. Let's look at the code. We'll start, of course, with the Ice Lake instance. And the benchmark is pretty simple. I'm starting from these three models: DistilledBERT, BERTBase, and RoBERTaBase. And I'm going to create a sentiment analysis pipeline for those three models. And I'm going to run predictions on a short customer review, as you can see here. Pretty simple. 16 tokens or something. And a longer one. This one, I think, is close to 128. And I'm going to iterate on those and try to measure something meaningful, hopefully. And so short sentence, long sentence, and then batching the short sentence, and then batching the long sentence. And doing this for each of the models. And the benchmark itself is very simple. I'm warming up everything for 100 iterations. And then I'm just literally predicting a thousand times, storing all the prediction times, and returning the mean and the 99th percentile. So that should give me a decent view of what's happening in real life. Short sequence, long sequence, and then batching them and looking at the mean and percentile. Okay, so we can just run this. It's going to run for a minute or two, and let's keep this thing running for a minute, and then I'll be back and we'll look at the times. Okay, so after a few minutes, we have our results, and they're pretty close to the ones in the blog post, so let's look at those. You want to focus on this one, okay? That gray area here, C16i. And we see for DistilledBERT, P99 is around 5 milliseconds. And for BERT, it's a little over 10. It's about the same for RoBERTa. And for longer sequences, we can't get single-digit, even for DistilBERT, right? We see 11, 20, and about 20 again for RoBERTa. And the numbers for the batch predictions are comparable. So not ugly, but difficult to stay under the 10 millisecond mark, even with us. So now let's run the same test on the other instance. And here we're actually running the test twice. We're running it with the vanilla transformers pipeline, as you see here. Exact same code. Just a different instance. And then we use an Optimum Intel pipeline. So the difference here is that we'll be able to leverage all of the hardware acceleration features present on those Sapphire Rapids instances like AMX and BFloat16 support. So if you want to be sure your instance actually has that, you can just go and LSCPU and look at the flags and you should see those AMX flags. You'll see all the AVX 512 stuff that has been around for a long time. But what's really new with Sapphire Rapids is the advanced matrix extensions, which bring new hardware registers, two-dimensional registers that can be used for matrix multiply and accumulate. These are called the tile registers. And so that multiply and accumulate operation is available for int8 and it's available for BF16. And if you're not so sure what BF16 is, here it is. So we all know FP32, we all know FP16, and the problem with FP16 is the exponent is actually quite shorter, and that can create overflow issues when working with FP16. So BF16 has the same exponent size, 8 bits, so it can represent the same range of values as FP32. Obviously, we're with less granularity because it has fewer significant bits, but at least the range of values we can cover is exactly the same, so that kind of removes the overflow issue that you could see with FP16. So yeah, BF16 is great for that, and it's also very fast thanks to this AMX extension. And to use this, as you can see, the only thing we need to do is create this Optimum Intel pipeline, make sure we pass the bfloat16 data type, and enable the JIT, so that we can use TorchScript and all the underlying acceleration present in PyTorch as well. And then the rest, of course, is the same. Okay, so very simple. All right, let's run this and see what kind of speedup we get. And it'll take a few minutes. Okay, so I'll see you there. So after a few minutes, we get our results. Again, they're really close to the ones in the post, so we'll look at that. We can see switching from Ice Lake to Sapphire Rapids, per se, brings some speedup. This is quite noticeable on DistilBERT, where we go from 5.48 to 4.57, less noticeable on the bigger models. But there is maybe, let's say, 20%, 30% improvement just with the new generation. But where we start hitting really sweet numbers is when we run the Optimum Intel pipeline with BF16, right? Because now we see we actually dropped to very low numbers and we're literally under two milliseconds consistently for Distilled BERT. For short sequences, we're under five for long sequences. And even with bigger models, we're extremely close to 10 milliseconds, even with longer sequences. And generally, across the board, we see anywhere from, let's say, 55 to 65% latency reduction. And that means about a 3x speedup, which is very, very noticeable. So this means even on CPU, you can get low single-digit latency for Distilled BERT and RoBERTa for short sequences, and you can stay within 10 milliseconds for longer sequences. So that's a really good number. This used to be GPU territory, one millisecond prediction, two millisecond prediction. And you can do the same with a CPU, which is just generally easier to work with and easier to manage in your infrastructure. I'm sure you have plenty of CPU servers lying around. Well, now you can use them for inference. No need to go and grab expensive and difficult to manage, difficult to procure GPU servers. But that's just me. Well, that's pretty much what I wanted to show you. Again, go and check out the blog. You can see some numbers. You can see the setup. And of course, the code is available, and you can replicate all of this yourself with your data and your models. OK? Well, that's it for today. I hope that was fun and informative. And until next time, keep rocking.

Accelerate PyTorch Transformers with Intel Sapphire Rapids part 2

Transcript

Tags

About the Author