Accelerate Transformer inference on CPU with Optimum and Intel OpenVINO

Transcript

Hi everybody, this is Julien from Arcee. In this video, we're going to continue exploring how we can accelerate transformer inference on Intel CPUs with the Optimum library. In a previous video, I showed you how to do this with the ONNX tools, and in this video, we're going to do the same with the Intel OpenVINO tools, and we're going to work with a vision transformer. OK, let's get to work. Here, we're going to optimize our model for the Intel Xeon architecture. So if you have one of those CPUs, you're good to go. Maybe that's what you have on your PC. So perfect. But to maximize the results, I'm going to use a fairly recent one based on the Ice Lake architecture. An easy way to do this is to fire up a C6I instance on Amazon EC2 because those instances are based on this particular architecture, which have the most bells and whistles for deep learning optimization. So this is what I've done. I started a C6I 8XL instance on EC2. This particular size has 32 vCPUs, so 16 cores. Should be enough to get some good results. And I used an Ubuntu 20 AMI to do this. Of course, we have a few setup steps to go through. So let's just go quickly through that because that setup can be a little frustrating sometimes. So update all the packages, install pip. Why it's not on this particular AMI is beyond me, but okay. Make sure you have the latest PIP version because otherwise it seems to have issues locating the latest version of OpenVINO. Don't ask me why. Make sure you have the latest version of PIP in the path, okay? And then, of course, install virtual and create a virtual environment for OpenVINO. Activate it, and then we can install our requirements. The requirements are these. So here I'm installing the latest version of PyTorch at the time of recording, and I want to install the CPU version. I need Torch Vision because we're going to work with the computer vision model, so there are some related objects. Of course, I need transformers and datasets. I need the Optimum library with the OpenVINO and NNCF, which is the neural network compression framework, which is part of the Intel collection of tools. And I'm also installing the Evaluate library because I'm going to score the models against a test set, and Evaluate is an easy way to do that. That's a Hugging Face library, by the way. And I need scikit-learn to get some extra dependencies. So install all that stuff and you're ready to go. Once we install our dependencies, we can switch to the code. So let's look at this script. We start from this model, which is a model that I fine-tuned with AutoTrain. The particular architecture here is a vision transformer. So that's a popular model for image classification. And this one has been fine-tuned on the FOOD 101 dataset. And that's part of another video if you want to go and check that out. But it's not so important for now. So a Vision Transformer already trained for image classification on this Food 101 dataset. As usual, I can load the model, load the feature extractor. And the first thing I want to do is to create a pipeline from this original model so that I can get a baseline on its performance. And as mentioned, the Evaluate library is a good way to do this. So first, I'm going to load 10% of the validation set for the Food 101 dataset, so a few thousand images. And I'm going to load the accuracy metric, which is what I'm going to use for comparison here, create an evaluator. And then with this evaluator, I can compute the accuracy of this model on this dataset, passing also, of course, the name of the column that contains labels and the mapping between class names and class labels. And the reason why I'm putting this in a function is because I'm going to reuse it a couple of times. That's it. So there we go. We evaluate the pipeline and we print out the results. So that's going to give me the total amount of time required by the model to predict the test set and the average latency. So let's run this first benchmark. I want to make sure I'm in my environment here. OK, let's just run this, see how it goes. So this is the dataset we're going to use, so yeah, about 2,500 images. So let's give it a minute to do what it needs to do and we'll look at the results. Once the evaluation loop is complete, we see this took about 87 seconds and the average latency was 34 milliseconds per image. So that's the baseline. Let's keep track of this. Maybe of those numbers, we can save them here. All right. And we'll compare later. So now let's move on to optimizing the model. So the first step is to create an OpenVINO quantizer and an OpenVINO configuration. I'll stick with the default configuration here. We'll print it out to see what's actually in that object. And we'll be able to find more details on the OpenVINO website if we want to tweak those parameters. So create quantizer object. And then we need to grab a calibration dataset that the quantizer uses during the quantization process. Here I'm going to go with a thousand images from the training set. I think the default is 300. Okay, so you don't need a lot of data here. Then I need to process those images, and this really has nothing to do with the quantization process. It's really torch and torchvision stuff where I need to apply to the dataset the same preprocessing that was used during the training process. And that really means normalizing the images according to the mean and standard deviation values present in the feature extractor. I need to resize them and center crop them. So that's really it. Again, this has nothing to do with quantization. It's just to make sure that images are similar to the ones that were used during the training process. Next, I need to define a transform function to apply those transformations to the dataset. And this is what we do here. And next, I need to define a data collator function because, of course, the model expects the input features in a certain format. So we need a variable called pixel values, and we need a variable called labels. So that's what this function does. It will just present the images in the expected format. Again, this is vanilla PyTorch. It doesn't have to do anything with quantization. So once we have the data in the format that the model expects, now we can call the quantization object, passing the config, the dataset, the data collator, and deciding where to save the quantized model. And we save the feature extractor as well in the same place so that it's just convenient to load it again from the same folder, the same local folder. And once we've done that, then we evaluate the quantized model. And this is a very easy way. It's very similar to the Transformers library, except we use it, instead of using the auto model for image classification object, we use the OV model for image classification object to load the model. And then, of course, we create a pipeline and we evaluate it, and we're going to see our results. So let's run this thing completely, and we can compare the results. So let's just do this. Here we go. It's going to take a couple of minutes. I'll pause the video, and then we can compare the results. A few minutes later, the whole process is complete and we can see our results here. Let's just grab this and compare to our previous numbers. So first of all, we see accuracy is almost the same. We have a tiny drop but that's really minimal. We went from 87 seconds to 68 seconds. So how much is that? 68, 87. Okay, yeah, that's about 20% faster. So that's a very good speed up for almost no loss in accuracy. And now our latency is 27 milliseconds. So that's a pretty nice speedup, but that's not the only benefit that we get. If we look at the original model, we can see which is this one, and look at its size. We see this model is about 300 to 400 megabytes. And if we go into that quantized model directory where we saved the quantized model, so we can see the model is now 85 megabytes, which is quite smaller, right? 334 divided by 85. So it's almost four times smaller, okay? So that's very significant because it means we're going to use far less memory to load it and we could potentially load it on much smaller devices as well. So that's really what OpenVINO is all about, shrinking those models to speed them up, but also to make it possible to load them on smaller devices, maybe even edge devices. So that's what I wanted to show you today. I'm sure we'll come back to model optimization because it's a really fascinating topic and it's so important for production. But that's it for today. So I hope this was useful. I hope this was fun. And until I see you next time, keep rocking.

Accelerate Transformer inference on CPU with Optimum and Intel OpenVINO

Transcript

Tags

About the Author