Accelerate Transformer training with Optimum Graphcore

Transcript

Hi everybody, this is Julien from Arcee. In this video, I'm going to show you how to accelerate your transformer training jobs with Optimum Graphcore, an open-source library by Hugging Face that leverages the AI processors from Graphcore. We're going to start from a vanilla piece of transformer training code, and I'll show you how to adapt it to run with Optimum Graphcore. We will then run this code on a Graphcore box hosted in the cloud on Paperspace. I'll show you how to set things up, run the notebook, and see how it goes. Let's get started. If you're curious about the Graphcore accelerators, you should definitely visit their website, where you'll learn everything about these chips. They're called IPUs and come in different configurations. You can buy them as actual servers and host them in your data center. The one we're going to use is the Pod 16 configuration, but I don't have one of those boxes on my desk. Fortunately, we can use cloud-based IPUs. There are several providers, though not the well-known ones like AWS, Azure, or Google—maybe that will come later. One interesting option is Paperspace, which gives us access to an IPU server for up to six hours using a notebook interface, which is very simple. The environment is completely set up for us, so we don't need to install SDKs, etc. Using Paperspace is very simple. Just go to the website, sign up for free, and log in. Let's do this. We're presented with the opportunity to create a project, which is what we want to do. Let's use that name. The next step is to create a notebook and select this environment, which is already set up with everything we need, including the Graphcore SDK. This gives us access to a Pod 16 for up to six hours. We also have some Hugging Face examples. Let's launch this. After a minute or so, our environment is ready. We can see we have a Pod 16 machine with four IPU processors. It will automatically shut down in six hours, but we can stop the machine when we're done. The environment includes the Graphcore SDK, and we have some sample notebooks. There's a quick intro, and I encourage you to read all of them. There's one on BERT, a text classification example, and a speech-to-text example with Wave2Vec2. Now, let's show you how to adapt existing code. I found a couple of pitfalls while doing this, so I'll show you what they are and how to fix them. Let me upload a couple of notebooks. Great, now let's look at them. The first one is a vanilla notebook where I fine-tune a model on a subset of the Amazon Review dataset. If you watched the Optimum Abana video I published a couple of days ago, this is part of a bigger workshop. I've prepared a subset of the Amazon Review dataset and am fine-tuning a DistilBert model to predict the star rating for a product review from one to five. It's a multi-class classification problem starting from product reviews. This is the vanilla notebook, and here's a quick look. There's nothing complicated here. I changed this to Roberta, but you could try DistilBert or Roberta. The code is completely generic. We're fine-tuning for one epoch, with five labels (one to five stars), and some training parameters. I'm loading the dataset from the Hugging Face hub, with a training set and a validation set (90K for training, 10K for validation). I have a metrics function to display accuracy, F1, precision, and recall for each epoch. I grab my model and tokenizer, apply the tokenizer to both datasets, define all my training arguments, and then define my trainer object with the model, training arguments, tokenizer, metrics function, and the two datasets. Finally, I call train. Here, I just call train to show that the code works, but training on CPU would take forever, so I interrupted it. This is my starting point. Now, what do we need to do to accelerate it with Optimum Graphcore? I'll show you every single line you'll need to change and discuss the pitfalls. First, let's look at Optimum Graphcore on GitHub. We have a blog post with details, showing how to set up on an actual Graphcore server. We also have information on adapting training code from vanilla transformers to IPU transformers. Very importantly, we have a list of supported models, such as BART, BERT, etc., and the tasks they support. Another important thing to look at is the Graphcore organization on the Hugging Face Hub, where you'll find the configuration files for the IPU. Now, let's look at the updated notebook and highlight every change. First, we need to install Optimum Graphcore. We need to import three objects: IPUConfig, IPUTrainer, and IPUTrainingArguments, which will replace the vanilla objects and training arguments from the transformers library. They have a few extra parameters but nothing complicated. We're using BERT base uncased, the same number of epochs, labels, and parameters. Here's a big change and the first pitfall I encountered. The problem relates to batch size. If the batch size is too large, the training job will run out of memory. The Pod 16 has four IPUs, and the default parameter in the model config file will replicate the model on those four IPUs, implicitly multiplying the batch size by four. The batch size in my original notebook was okay for GPU training, but the multiplication factor blew things up. The actual batch size is your initial batch size multiplied by the gradient accumulation steps and the replication factor. If your batch size is 128, with gradient accumulation steps of 128 and a replication factor of 4, you end up with a very large batch size that might not fit on your chip. In my case, I reduced the batch size to 2, used a gradient accumulation steps value of 128, and a replication factor of 4, resulting in a real batch size of 1K. If you start with a larger batch size, things can blow up quickly. Be careful, and be aware of the replication factor. The error message I got was a bit cryptic, but the Graphcore documentation has a page on batching that explains replication factor, gradient accumulation, and other parameters. The second thing I had to fix was related to the training set size not being a multiple of the batch size. If the last batch has a different size, training can fail. With a batch size of two and 90K samples, I should be fine, but generally, set this to true to drop the last batch if it has a different size. This saves you from weird PyTorch errors. Next, we load the dataset, build the compute metrics function, load the model, and tokenize the data. Here's a new cell where we grab the config from the Hugging Face Hub. We replace the training arguments with IPUTrainingArguments, set the batch size, pod type, and accumulation steps. We pass the IPU config to the IPUTrainer, and we enable caching to save compiled models. Graphcore on Paperspace provides some pre-compiled models, so if you're lucky, your model will be in there. We call train, and the first step is compiling the model, which takes about seven minutes but is saved for future runs. Training starts, and we see the total train batch size is 1024 (2 * 4 * 128). We have 86 steps because we have 90K samples. Training runs for a minute and 57 seconds. We run evaluation, which requires recompilation, but not for subsequent runs. Evaluation completes, and we see the metrics. If we call train or evaluate on the same dataset again, we load the compiled model from cache, saving time. Finally, we save the model. We can push it to the hub and use it just like any other model. The Graphcore org on the Hugging Face Hub has some fine-tuned models, like BERT base fine-tuned on SQuAD. To recap, take a look at Optimum Graphcore and the examples. There are additional notebooks and supported models. Check the Graphcore page for details on the chips and where to use them. Paperspace is a convenient way to do this. The modifications are simple API changes, but be aware of the gotchas to save time. I hope this was fun and you learned a few things. Keep learning and keep rocking.

Accelerate Transformer training with Optimum Graphcore

Transcript

Tags

About the Author