Hi everybody, this is Julien from Arcee. In the previous video, we discussed how to accelerate inference for transformers on CPU platforms. In this video, I'm going to show you how to do the same on GPU platforms thanks to the integration of BetterTransformer, a really cool PyTorch extension, into our Optimum library. We'll try an NLP model and a computer vision model, and you'll see we get pretty interesting speedups in just one line of code, so let's get to work.
BetterTransformer is a PyTorch extension available since PyTorch 1.12, and you can read the blog post here for details on how BetterTransformer works and what to expect from it. Of course, I'll put the link in the video description. This is now supported in Hugging Face Optimum, which you should be familiar with by now, our open-source library dedicated to hardware acceleration. As of version 1.5, which was released just four days ago, you can use BetterTransformer with just one line of code for models from the HuggingFace hub. So this is super simple. There are some other really cool features in that release, by the way. Whisper is supported with ONNX and so on. So go check it out. There's a lot of stuff happening in the Optimum library these days. But for now, let's focus on BetterTransformer.
Let's take a look at the code. First, of course, we need to do a little bit of setup. As usual, nothing really complicated. Here, I'm using a GPU instance on AWS, a P3 instance with an NVIDIA GPU. This is running on Ubuntu, and I'm just doing a very simple setup, creating a virtual environment, and installing my requirements. Super nice and simple. The requirements are PyTorch 1.12 or newer, Optimum 1.5 or newer, and then some extra libraries needed to evaluate our models. Okay, so very simple setup, nothing complicated. You can replicate this in seconds.
Let's look at our first example. This is actually an example I already used with ONNX. I'm starting from a DistilBERT model that I fine-tuned for text classification. This is multi-class classification for Amazon Shoe Reviews, predicting the star rate from one to five stars for English language shoe reviews. The model is on the hub, and this is the dataset I used to fine-tune it. First, I'm going to compute a baseline over the test set just to get a sense of the speed for the original model. I'm using the evaluate library to do this, computing how long it takes for this model and this pipeline to go over the test set. I'm using the accuracy metric. So I'm doing this for the original model.
Then, of course, I'm going to do the same for the BetterTransformer model. How complicated is it to build a BetterTransformer model with Optimum? It's simple and I don't think it could be simpler. This is really what it takes. We're using the pipeline object from the Optimum library, the task type, the model ID as usual. We define an accelerator, and this will be a BetterTransformer, of course. I want to make sure this runs on my first GPU, so using `device=0`. But this is the important parameter. Then, of course, we're evaluating the pipeline again and printing the results.
If you don't want to use the pipeline and want to load the model and the tokenizer for more control, you can do that as well. This is equally simple. You just use a BetterTransformer object, pass it to a model, and it returns an optimized version of that model. So you don't have to use the pipeline. I'm just using this because it's so simple. But you could work with the model object itself. Why don't we run this and see how it goes?
After a couple of minutes, we've predicted the test set with the original model and the optimized model. The original model did it in 70 seconds, and the optimized model did it in 57 seconds. That's about 20%. Let's round those numbers up just to make it simpler. Yes, 71 seconds. So 20%. That one line of code just speeds up the model by 20%. And as you can see, there is no change in accuracy. So 20% just like that is very good, right? As easy as it gets.
How about we do the same for our Vision Transformer model? This is the example I've already used in the OpenVINO video. This is the Google Vision Transformer model that I fine-tuned on AutoTrain for image classification on the Food 101 dataset. If you want to see the list of architectures supported by BetterTransformer, you can find this in the docs. You'll find all the popular ones for NLP, the Vision Transformer, and some speech models like Wav2Vec2 and Whisper. So a good selection of models for your different use cases.
Back to our Vision Transformer. We're going to download the original model and score it on 25% of the test set. We'll optimize the model with a one-liner and run it again. This is exactly the same: creating a pipeline with the accelerator. Nothing complicated. Let's run this. There we go. We can see in a minute how fast we're going. The original model did it in 104, 105 seconds, and the BetterTransformer model did it in 72 seconds. That's about 30%, even better. So there you go. You can speed up your Vision Transformer just like that. Zero accuracy drop, 30%. And again, these are just a couple of models I used here. But feel free to try out different architectures, and you may see even better results.
That's it. That's really what I wanted to show you today. A bit of a shorter video, but I don't think you'll mind. That one line of code is magic. I encourage you to go and try it out. I'll put all the links in the video description as usual. And I'll see you soon with more content, maybe more acceleration. I'm just obsessed with this thing. Until then, keep rocking.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.