Accelerate Transformer training with Optimum Habana

Transcript

Hi everybody, this is Julien from Arcee. In this video, I'm going to show you how to accelerate your transformer training jobs using Optimum Habana. Optimum Habana is an open-source library from Hugging Face that leverages the Habana Gaudi Accelerator from Habana Labs. First, I'll show you how to fire up an EC2 instance on AWS that comes with Habana Gaudi chips. I'll show you the whole process, and then we'll connect to that instance. We'll install Optimum Habana and run some training jobs for natural language processing and computer vision. We'll do this both on a single Habana chip and on multiple chips, and hopefully, we'll see some nice linear scaling. Let's get started. We'll start from the EC2 console. The first thing to do is click on "Launch Instances." We'll give this instance a name. How about "See a Bottle"? Next, we need to pick an AMI. To keep things simple, I really encourage you to use the AMI that Habana has built. This comes pre-installed with their SDK called Synapse AI, which means we can pull a pre-installed container and run everything without setting up complex SDKs and dependencies. This is really the best way. Let's select this one. If you watch this later, it could have different versions, but at the time of recording, this is the latest one. Next, we select the instance type. We don't want T2 micro. The one we want is DL1, which is the Habana-Gaudi instance family, and it comes in a single size, so no problem there. We will definitely SSH to this, so let me grab my key pair. If you don't have a key pair, you need to create one. For network settings, we can create a new security group allowing SSH. This is the only thing we'll need. For storage, 50 gigs should be enough, but let's take a little more because we're going to download models and datasets. We don't want to run out of space. We could keep it at that, but let's try to grab a spot instance. We'll get a better discount, and we're not going to need it, but I'll assign a role to this in case you want to pull data from S3 or access other services. But I don't think we really need this here. That's about it. Let's launch the instance. After a minute or two, I can see my instance is running, which means I was able to grab a spot instance. The spot request has been fulfilled, and if you look at savings, we're getting 70% off. Instead of paying $13.11 per hour, I'm only paying $3.93. That's really cool. I love Spot instances, and any chance you get, you should try to use them. Now let's connect to this instance. Here it is. Let's grab the DNS name for it. Now we should be able to SSH. SSH with my key. This is a Ubuntu instance, so we're using the Ubuntu username. Here we are. We're connected. We can check with HLSMI, which is the equivalent of NVIDIA SMI for Habana. We should see our accelerators, and for now, they do nothing. So, 1, 2, 3, 4, 5, 6, 7, 8. We'll keep them busy in a minute. The next step is to run the Docker command that will pull the pre-installed Docker container so we don't have to mess with anything. I'll copy-paste this, and I'll put all these commands in the video description. Docker run with some technical parameters, and we pull this container for Synapse A161 and PyTorch 1.12. Let's run this. It's going to take a minute, so I'll be right back. We've downloaded and run the container. Now we're inside the container. The next thing to do is install Optimum Habana. You can find this on GitHub. I encourage you to check out the repository for examples and documentation. Let's clone this repo quickly. And install Optimum Habana. I'm installing from source, but you could pip install Optimum Habana. I like to be on the bleeding edge, so hopefully, nothing blows up. Now we have Optimum Habana installed. The next step is to bring my own code. We'll start with an NLP example to train a text classification model on Amazon reviews. The goal is to classify product reviews according to the star rating from one to five. I have a whole bunch of content on this use case, including a full workshop. I'll put the links in the video description. For now, we'll keep it short and clone the repo to bring in some code. Go inside our code directory. The last step is to install a few requirements, including the Transformers library. I like to show all the steps to ensure you can reproduce everything. We're good to go. Let me show you what we're going to do. First, let's look at the vanilla code. This is about accelerating stuff, so before we worry about acceleration, we should worry about training. This is the vanilla code I'm using. It's very simple. I'm fine-tuning DistilBert, which is a good starting point. I'm fine-tuning for one epoch with five labels for the product reviews (one to five stars). The dataset is on the Hugging Face Hub, a processed version of the Amazon shoe review dataset. It's a subset, and I've removed columns I don't need. The labels are from zero to four, giving me five classes. The dataset is loaded, and I have a compute metrics function for detailed metrics. The model and tokenizer are downloaded, and the dataset is tokenized. Training arguments are set, and the trainer is configured with the model, tokenizer, arguments, metrics function, and datasets. Then, I call train and evaluate. If I run this, it will run the standard transformer code. But we want to run this much faster using Habana Gaudi. I encourage you to read the Habana Gaudi website for more information on the chip, SDK, and why it's fast. In the documentation, you'll find information about built-in containers and how to install the SDK yourself if you don't want to use the container. Now, let's go back to the terminal. The main question is, what do I need to do to run this on Habana Gaudi? How much rewriting is needed? As you'll see, it's minimal. We need to change a few lines of code in our script. Let's compare the vanilla training script and the actual training script for Habana. The diff shows we're using two objects from Optimum Habana: GaudiTrainer and GaudiTrainingArguments. We need a config file for the model, provided by Habana on the Hugging Face Hub. We replace the training arguments with GaudiTrainingArguments and the trainer with GaudiTrainer. That's it. It took me 10 minutes on the first try, but it should take you two minutes. Once we have that, we can just run the script like a Python script. It will download the model and dataset, grab one of the eight Gaudi chips, and launch the training job. We'll see how long it takes. It's starting and will run for about six minutes. I'll pause the video and be back. The total training runtime was 536 seconds. We see the loss, memory usage on the Habana chip, and some metrics in the log. We ran for a single epoch, so the metrics might not be super relevant. The F1 score, accuracy, and other metrics are shown. The model is saved, and I could push it back to the hub. Optimum Habana is fully integrated with the Transformers library and the hub, so I can push to the hub just like with the trainer API. It took about 8.56 minutes. Now, let's run the same on the full eight Habana Gaudi accelerators. We just used one for now. We'll use a ready-made script in the Optimum repo called `gaudispawn.py`. We run it, give it the number of accelerators (eight), and decide whether to use MPI for distributed training or DeepSpeed. I'll stick with MPI for now. We use the same training script. No code changes are needed. Let's run this again. It will grab the eight chips, use the same dataset and model, and fire up the job in a few seconds. After a warm-up, it's running in about 1 minute and 30 seconds. This is a tiny dataset with 90,000 product reviews. There's a startup time that doesn't get accelerated. On a larger dataset, we get near-perfect linear scaling. Here, we're not getting exactly 8x faster due to the startup time, but we're getting at least 6x faster. It's impressive that we only need to change a couple of objects in the vanilla training code to accelerate it with Habana, and we don't need to change anything to distribute the job from one chip to eight chips. For large enough jobs, you get near-perfect linear scaling. I'll run bigger experiments in future videos, but out of the box, it's already very good. Now, let's move to computer vision. We won't even have to write a single line of code this time. I'll use one of the scripts in the Optimum Habana repo. I'll reset my terminal and show you computer vision. In the examples directory, we see a whole bunch of examples. You can run them out of the box or customize them with your own models and datasets. I'm still inside the container and have moved to the appropriate folder for the examples. Here's the image classification example. Let's make sure the requirements are set up. Good. Now, I can run this example. The long command line includes the script, model name (Google Vision Transformer), and the Vision Transformer config file from the Habana organization. We're passing the Food 101 dataset, which has a lot of images representing 101 different classes of food. We'll train for five epochs. Let's run this. We start with single-chip training and then try training on the eight chips. The training job will run for about 30 minutes. I'll be back in half an hour. The training job completed in about 38 minutes with an accuracy of 86%, which is a good start. Now, let's run the same using eight Gaudi chips. We'll use the Gaudi spawn script again, set the world size to eight, and still use MPI. We use the same model and config file. This job trained in eight minutes and 25 seconds, about five times faster than the single accelerator job. For longer training, you'll get closer to linear scaling, but this is a very respectable speedup. The last thing I want to show you is how to switch from this model to another, like the SWIN model, which tends to be better than the Vision Transformer model. We can run the same script, passing the name and ID of the model on the hub and the appropriate config file from the Habana organization. We'll run the eight Gaudi version right away. This one completed in 12 minutes and 27 seconds with an accuracy of over 90%, which is very good. The other model trained in 8 minutes and 30 seconds with a lower accuracy of just under 80%. Training for a few more epochs might improve the accuracy. We only trained for 5 epochs in 12 minutes, achieving very good accuracy. Training for 30 minutes or an hour will certainly yield even better results. Feel free to compare these training times to what you see on GPUs. This is not about benchmarks, but trust me, it's quite faster than GPU training with the same level of accuracy. Do your own benchmarks and come to your own conclusions. That's what I wanted to cover today. I showed you Optimum Habana and how it uses Habana Gaudi to speed up transformer training. We set up an AWS DL1 instance, looked at an NLP use case with DistilBert, and used off-the-shelf scripts for computer vision with the Vision Transformer and SWIN models on the Food 101 dataset. Feel free to reuse these and train your own jobs on your datasets. I'll put all the links in the video description. I hope this was fun, and I hope you learned a few things. Until next time, keep rocking.

Accelerate Transformer training with Optimum Habana

Transcript

Tags

About the Author