Hi everybody, this is Julien from Arcee. In this video, I'm going to show you how you can accelerate transformer training jobs with AWS Tranium, a new custom chip by AWS that's been specifically built to accelerate machine learning training jobs. Starting from a vanilla transformer training job, I will update it using the Neuron SDK, which is the AWS SDK for the Tranium chip. First, we will run this job on a single neuron core. Then I'll show you how to do distributed training on multiple cores. OK, let's get started.
Before we dive into the code, there are a few pages we should check out. The first one, of course, should be the Tranium product page. If you're interested in the benefits and some of the specs and features, this is a good place to start. There are two different sizes: TRN1-2XL with a single accelerator or TRN32-XL with 62. Of course, I'm going with the big one. Just be mindful of the price. I'm not paying my bills, but you probably are.
The next page that's very important is the Neuron SDK. Neuron is the SDK for Inferentia and NetApp, now Tranium as well. Make sure you're actually reading the Tranium documentation because both tend to be a little bit mixed. You'll find the setup instructions and some tutorials, etc. So that's where I started from. I have to say I did find a few gaps here and there. So now I came up with my own version of the setup, which, of course, I will share with you. But generally, this is a good place to start. You also may want to check out the samples. The Trinium samples are actually the Neuron X samples. There's not a lot, but I guess it's enough to get started.
What are we going to do in this demo? I'm going to start not from one of those examples because you know how I feel about pre-cooked examples. They always work. I much prefer to start with my own code, my vanilla code, and then adapt it for hardware acceleration, just like I did in the Albana and the Graphcore videos. I'm going to do the same here. So I'm going to start from one of our tutorials, which you'll find here. This is a fine-tuning job where we fine-tune a classification model on the Yelp dataset, so restaurant reviews for a change. Because Tranium does not support the Hugging Face Trainer API, and that's a shame, by the way. So AWS folks, if you want to fix that, you know where to find me. I think this would be a great benefit to everybody. But because it's not supported, I've decided to go one level down and we're going to use PyTorch. So we're going to start from a PyTorch fine-tuning job using a model from the hub and a dataset from the hub, and we're going to adapt it for training. If you're curious about the base example, it's that tutorial, and again, you'll find the link.
Where do we go next? Well, I guess we go next to the EC2 console. In the interest of time, I've set up the instance because I will provide all the instructions. We'll take a quick look in a second, and it's not fascinating to look at all those apt-get and pip installs, etc. I've already done that. One important notice is to make sure you're using the North Virginia or the Oregon region because, at the time of recording, this is the only place where you can find Tranium instances. So I fired up this Tranium instance, set it up, and I'll quickly fly through those steps because I really want to focus on the code modification part. Let's take a quick look at the setup. Ideally, we would find an AWS AMI completely set up for Tranium with the Neuron SDK, etc. As it turns out, I didn't find one. There is one called Deep Learning AMI Neuron. But I'm guessing it's still missing bits and pieces. Maybe it's for Inferentia. I don't know. But I got some Python import failures and other problems when I try to run Tranium stuff. So I moved away from this and instead used a base deep learning AMI and started with this configuration guide. So a fresh install. We can go and take a quick look. There are a number of steps where you need to do all this stuff. Just curious why from a fresh install you need to remove stuff that's not supposed to be there. But OK, that's maybe just me. So let's not be too grumpy. So generally, this is where I started from. But I did find a few gaps and a few pitfalls here and there. So I came up with my own version of this, which is this. And again, I'm not explaining all those steps. They're not too difficult. Make sure you're using an Amazon Linux 2 AMI. This is the actual AMI that I used. Then you configure your repositories, install some native stuff, some Python stuff. It's straightforward now that you have that list. It should be fail-proof or at least idiot-proof. The best proof it's idiot-proof is it works for me. Make sure you have the right version of PyTorch. Make sure you have this version of NumPy and Protobuf. If you have a more recent version, the neuron model compiler actually dies and complains. So, I add to everything. Make sure you have the neuron tools in the path, and the rest is generally pretty straightforward. So I removed some unwanted steps, added some fail-proof checks, and I tried this two, three times from a fresh instance, and it did work every single time. So you should be fine. Then you can clone my code, my demos, which we'll look at in a minute, install some requirements, and then run some stuff.
So that's where we are right now. Why don't we SSH to that instance and run some code? Okay, so I've logged into the instance. There are a few quick setup steps. First, of course, I want to make sure I'm using my virtual environment where I installed all my dependencies, especially since we need to have that older version of NumPy. We also need to make sure we have the Neuron tools, like NeuronLS. So we see our 16 devices. Each one has two cores, so we can train on up to 32 cores. Not all topologies are supported. We'll get back to that. On that second shell here, I'm going to do the same thing. Just grab the pass so that I can keep an eye on what's going on with NeuronTop, which is the equivalent of UnixTop. Now we can move into my code directory. In this directory, we see three training scripts. The first one is the CPU GPU script, the starting point. This one is the single-core, single-neuron-core training script. And this one is the distributed one, where I can run on up to 32 cores. So let's take a very quick look at the base job, so to speak. Load the dataset. We work with the BERT base model, so load the tokenizer, tokenize the dataset. I'll stick to 10,000 reviews to keep the training time reasonably short, but maybe we'll launch the full thing when we do the 32 core training job. Build the data loaders. I'm only using the training set. I haven't added evaluation here because I wanted to keep the code as short as possible. Select a device, which will be CPU here because we don't have a GPU on this machine. Then the PyTorch training loop. So looping over epochs and batches, make sure batches are loaded on the proper device. Form the batch through the model, compute the loss through backward propagation, apply the updates thanks to the optimizer, and then zero the gradients and move on to the next batch. So this is really as simple as it gets when it comes to PyTorch. Again, I would love to use the Trainer API, but it's not supported. Let's not be grumpy. We can just run this to make sure that it works. It's preferable to start from a decent baseline. Looks like this thing is working. We'll just let training start as a sanity check. Of course, we'll interrupt it because it's going to be painfully slow.
Now, what do we need to do to adapt it to run on Trinium Core? Let's look at this second example. And this is surprisingly simple, I have to say. So the Neuron SDK is plugged into PyTorch through the XLA module, so that's what we're going to be using. We need to import the xla model module here, need to set our device as xla, and then the rest is really the same. Loading the dataset, tokenizing, etc. Now obviously, when we switch to xla, that's where the model will be loaded on the training chip. And the rest is the same. And in the training loop, we need to add a single line, which is this, basically to notify the end of the training step to the new one core. And that's it. That's really it. So very cool, I have to say. And if you want to save the model, I guess here we're not doing distributed training. We could probably use PyTorch save, but just to be on the safe side, I'm going to use the XLA save, which will make sure we only save the model on one device, right? And not on each one of the devices we've been training on. But again, we're using a single core here, so it probably doesn't make a difference. All right, so that's really it. It's very simple. And let's try and run it. And let's see how that goes. So when you run a certain model in a certain configuration with a certain number of cores, the first iteration will be slower because it's going to compile the model. But the next ones will be much faster because, of course, we have this model cache, right? So if we switch to the other window, there's probably not much happening, right? This compilation step is taking place on the CPU. So the neuron cores at this point are not doing anything. It's going to take maybe seven, eight minutes. So let me pause the video. And now we're training. We can see progress here. Let's take a look at NeuronTop. And yes, we see the first core of the first chip is busy. So we'll let it run to completion just to get a sense of the time it takes to train on a single core. And then we'll move on to distributed training and we'll start scaling things up, okay? This first training job completed, and I run it again, and we can see that this time around we're using the cached model, okay? And the cache is at /var/tmp/neuron-compile-cache. So if you want to start from a clean slate, you can just wipe out that directory, and of course, you will recompile the model. So I run it again just to get a sense of how long it takes to train without any compilation, and it's about 15 minutes here. Okay, so now let's look at how we scale things and do distributed training and put all 32 cores to work. First, of course, we need to add some imports for PyTorch distributed training. We initialize distributed training here. We grab the number of cores that will collaborate on this job, and this is the world size. It's going to be passed by the Neuron SDK, of course. Tokenization is identical, and then we need to set up data loading. So here the code will work whether we use a single core or multiple cores. So if we use a single core, we use the data loader, and not a lot of differences here. If we have multiple cores, we build a sampler to distribute the training set across different cores, of course, based on the number of cores we have here. And then we build the data loader. We load the model. We make sure it's on the XLA device. And then the training loop iterates again on epochs and batches on the device loader. And the only difference is we replace the mark step API from the single core example to optimizer step. And basically, this will gather all the data gradient updates from the different cores and apply them. And yeah, that's it. And saving the model is the same, although here it really doesn't matter that you use the XLA save because you only want to save on one device, not on all 32 devices. So a few more imports and then mostly standard PyTorch stuff on data loading and data sampling, and this API here. So not too difficult. Okay. Running this is quite simple. We're going to launch it with Torch Run, which is a standard PyTorch tool for distributed training. But before we do that, we need to disable performance in tokenization because this code, as you saw, tokenizes the dataset, and of course, we don't want to do this as many times as we have cores. It's a little bit silly and it could actually deadlock and crash. So this export here, this environment variable guarantees we're only going to do it once. Now we're ready to train. We can fire up our distribution job with Torch Run, which is a standard tool in PyTorch. And we need to select the number of cores that we want, and we can do this with NProc per node. Node here means AWS instance. So we cannot select a random number of cores. We can use certain topologies. So we can use 1, 2, 8, or 32 nodes. Okay, so let's do eight, and then we'll try 32 and see what kind of speedup we get. All right, we're training on those eight cores as we can see. They're happily blinking in NeuronTop, and things are moving along quite nicely. As you can imagine, there have been a few compilation steps, so let's let this one complete and then run it again using the cache to see the actual time. So this is the launch without any compilation steps. And you can see we're going pretty fast. Looks like we're going to be at three minutes, maybe a little bit under. So very, very fast. OK, so let it complete. And then we'll do 32 cores. So I ended up using the full dataset, which is 250K something reviews. And this is very, very fast. I guess I need to find an even bigger dataset to show you the linear scaling because the startup time is not fully amortized. So I'll go and do bigger things. But I think that's enough video and fooling around for today. So this is really what I wanted to show you how to do the setup, which is a little bit more complicated than it should be. But we managed it. And just to show you how to adapt your code from CPU, GPU to neuron core to multiple neuron cores. Okay? And that's it for today. So I hope this was fun. I hope you learned a few things. And as always, keep rocking.