Accelerate Transformers on Amazon SageMaker with AWS Trainium and AWS Inferentia

November 28, 2022
In this video, I show you how to use Amazon SageMaker to train a Transformer model with AWS Trainium and compile it for AWS Inferentia. Starting from a BERT model and the Yelp review datatset, I first train a multi-class classification model on an ml.trn1.2xlarge instance. I also show you how to reuse the Neuron SDK model cache from one training job to the next, in order to save time and money on repeated jobs. Then, I compile the trained model for Inferentia with a SageMaker Processing batch job, making it easy to automate such tasks. ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ - Code: https://github.com/juliensimon/huggingface-demos/tree/main/trainium_inferentia_sagemaker - Training with Trainium on EC2: https://youtu.be/HweP7OYNiIA - Predicting with Inferentia on EC2: https://youtu.be/un0WsUtGwVA - SageMaker SDK feature request: https://github.com/aws/sagemaker-python-sdk/issues/3481 Interested in hardware acceleration for Transformers? Check out my other videos : - Training on Habana Gaudi: https://youtu.be/56fpEa1Y1F8 - Training on Graphcore: https://youtu.be/DgcJscPu1Vo - Predicting with ONNX: https://youtu.be/_AKFDOnrZz8 - Predicting with Intel OpenVINO: https://youtu.be/mfj1QrZWkk8

Transcript

Hi everybody, this is Julien from Arcee. In previous videos, I showed you how we could accelerate transformer training and inference using two custom chips from AWS, Tranium and Inferentia. To do this, I just ran vanilla Python code on EC2 instances. In this video, I will reuse the same example, but instead of using EC2, I will use SageMaker. First, I will train a model on SageMaker for a Tranium chip, and then I will compile this model for Inferentia. To spice things up, I'll show you two extra things. First, I'll show you how we can reuse the model cache during the training job so that we don't have to recompile the model for Tranium again and again. This is a nice little hack, and second, I'll show you how we can compile the model for Inferentia with SageMaker processing, which is a good way to automate those jobs instead of just running them in a notebook or in a different way. Lots of things to cover, let's get to work. Let's start with the training notebook. The workflow is very similar to any SageMaker setup. First, we'll upload our dataset to S3. Then we're going to configure an estimator. I'll use a PyTorch estimator here because we're going to work with a PyTorch model. Then we're going to train the model on a Tranium instance. And then we'll look at the artifact. The only twist here is we're going to manage the data. When we compile a model for Tranium, the Neuron SDK, which is the AWS SDK for Tranium and Inferentia, will save a compiled version of the model to a local folder. So if we train again, we can skip the compilation step. Unfortunately, on SageMaker, this doesn't work out of the box because we're using containers to train. Anything saved locally inside the container is not available for the next training job. We'll fix that, and we'll see how we do this. But that's really the only difference. Okay, so let's look at the notebook here. First, we install some dependencies and download the Yelp review dataset from the Hugging Face Hub. It has over 600,000 reviews, so that's a big one. We're not going to use all of that to keep things fast. Then I upload that dataset to S3, and I can do this because there's an S3 integration in the datasets library, so that's very convenient. Now my dataset lives in S3. I can see the training set and the test set. Okay. And then I'm going to initialize my model cache. What I did here is I just grabbed the cache files from a previous run and copied them to an S3 location. Feel free to do the same. Just grab files from a previous Tranium run and copy them to S3. It should look like this: whatever prefix you're using in S3, and then the root of the cache should be `neuron-compile-cache` and then the actual cache files from the Neuron SDK. So that's what I have in S3. Then the usual hyperparameters in a Python dictionary, for example, the number of epochs, batch size, the model we're going to use, the number of labels for the classification job that we're working on here. So five labels, one star to five stars. And how many samples we want from the dataset. I selected 10,000 just to keep it quick. The PyTorch estimator with the script, we're going to look at the script in a second, the hyperparameters, and then all the infrastructure stuff. Here I'm going to use a Tranium 2x large instance, which has two neuron cores. Nothing really complicated here. You just need to make sure you enable Torch Distributed because that's what we use to train on those multiple cores. And you just need to set this to true. And that's about it. You can try and enable Spot Instances. Tranium Instances are in pretty high demand, so your mileage may vary. But I disabled that stuff here. So that's the estimator. Let's look at the training code. This is very familiar because it's extremely close to the Tranium code that I used in the Tranium video on EC2. The only modifications are really passing the hyperparameters and the different directories here using script mode, which you're certainly familiar with if you use SageMaker or if you've watched my videos, I keep talking about script mode. That's basically about defining the interface between the SageMaker training container and your code. What this means is arguments for hyperparameters coming from the Python dictionary, and a couple, or in this case, three environment variables to grab the different directories that we're working with. So the training set, the cache directory, and the output directory for the model. And that's really it. The rest of the code is pretty much the same. Loading the dataset, tokenizing it, setting up the data loader, setting up the training loop, etc. Nothing fancy. You can read the code unmodified from the EC2 video. Of course, we need to take care of the cache. So this is how I do it. Very simple. I just copy whatever is in that input cache into a local directory in the container. This is the default cache location for the Neuron SDK, `/var/tmp/neuron-compile-cache`. So don't go and change that. And that's about it. As we'll see in the training log, this is enough to populate the cache so that when we start training, we're going to find the compiled version of the model and save quite a bit of time by skipping compilation. At the end of the script, of course, I'm saving the model as a PyTorch checkpoint. I'm saving the tokenizer, the model config so that I've got the full package. And I'm also saving the model cache again. So this will get copied to the model artifact, the `model.tar.gz` file. Because maybe originally in the cache, I didn't have a compiled version. So this job just added to the cache. And of course, I don't want to lose that. So that's it. That's how you can manage the cache. It'd be nice if we had a different way to do this in the SageMaker SDK. I created a feature request. I'll put the link if you want to plus one it. Maybe this will happen. But this is really how we can initiate the cache here. Okay, let's read it for the code and then we call `train`. So we pass the training set. We pass the cache location, and so that cache channel ends up being stored as the `sm_channel_cache` environment variable. People think that you can only use `train` and `validation` as channels, but you can pass anything. You can pass any string, and it becomes an environment variable like this. So that's a little-known feature, but it's very useful if you want to pass extra stuff to the script, stuff that's not actually a dataset. So did this work? If we look, we can see the two neuron cores have been found, and we can see in the log that immediately the code finds the cache with the compiled version, so we don't compile anything. You can see we're loading the base model here and we're training it immediately. In this case, I think it's saving literally 10 minutes on this training job. So that's a nice little tweak, I think. 1100 something seconds, the job is complete. The model artifact is saved back to S3. We can see it here. And if we copy it to this machine and extract it, we can see, of course, it does have the tokenizer and the checkpoint and the config, etc. And we also have the compiled cache, right? So now you could go and extract that cache from the tar file and put it back to that central location. Why not? All right. So that's the training bit. In this notebook, we saw how we could train with Tranium chips on SageMaker and manage the cache. Now let's move on to the second notebook, where starting from this train checkpoint, we're going to compile it for Inferentia. In the EC2 video, I just ran a Python script to compile the model, and I guess that's fine. You can do that. But automation is important, and reproducibility is important. So I figured I need to show you a slightly more robust way to do this. There's a really easy way to run batch jobs and any kind of utility jobs on SageMaker, and it's called SageMaker Processing. And again, I've covered this quite a few times in previous videos, but if it's the first time you hear about it, SageMaker Processing is just a way to run batch jobs on SageMaker with built-in containers. And I guess you could add your own containers if you wanted. And it supports Scikit-Learn and Spark. So here, obviously, I'm going to use Scikit-Learn. This is what I want. So yeah, import some objects, grab the location of that artifact, so the one with the checkpoint that we just looked at. And now we simply define this `sklearn_processor` object, which is just a simple way to say, hey, grab me that built-in scikit-learn container. It comes with different versions. Here I'm using 0.20, which is actually the oldest one. I tried the other ones, 0.20 and 1.0, and I ran into all kinds of dependency issues. The Neuron CC, which is the compiler in the Neuron SDK, needs to pull in some TensorFlow dependencies and some additional things. And I could not get that stuff to work with the newer versions. I suspect this is actually linked to the underlying Python version because to the best of my knowledge, the Neuron SDK works with Python 3.6 and 3.7, and I'm pretty sure this is what I have with this container, and I'm not so sure about the other ones. So anyway, it doesn't matter. We're not actually using anything from scikit-learn. It's just that we need an older Python version, and this one works, and the new ones don't. So that's it. Simple as that. Here, I do not need any particular instance. It's just a CPU instance, I guess. It's fine. We don't need a training instance. We don't need an Inferentia instance. We're just compiling. And I guess that's also the benefit of separating the training and compile and deployment steps because you can pick the instance type that works best. So to compile for Tranium, of course, to train on Tranium, we need a Tranium instance, but to compile for Inferentia, we just need a vanilla instance. That's what I have here. And then we just run the script, and we'll look at that in a second. The inputs are the actual artifact, a checkpoint. I've got some requirements that I need to install, and it's just convenient to install them as a particular input. And I have a deployment script that I want to add to the output as well. More on this later. So three inputs, but the most important one is this. And one output, which is just the model artifact. But this time, instead of the vanilla checkpoint, I will have a model compiled for Inferentia. So here I want to compile for an Inf1 6XL instance, and those have four Inferentia chips with four cores. So that's a total of 16. You need to be sure about this. Otherwise, you'll end up compiling something that doesn't deploy properly. Fine, let's look at the compile code, which again is extremely similar to the EC2 video. Of course, the only difference is I have to manage those inputs and outputs. I have to install some dependencies and then again grab the argument. I added extra parameters if you wanted to change the model or just for extra flexibility, but this is the only one I need here. Then I'm going to grab the input model artifact and extract it to get to the checkpoint file. I'm going to load the checkpoint file into a Hugging Face model. The tokenizer, the config, make sure I have the right number of labels for my classification problem, then build a model for sequence classification from that, load the checkpoint into that empty model, and voila. Then I need to define a sample input. So just here in this case, a positive review, tokenize it, transform that tokenized input into a tuple of tensors because that's what TorchScript expects for models. So if I pass directly this, it's going to complain about the format not being the right one. And then my favorite line, one line of code to compile the original model into an inferential model. So passing the model, the sample input, and the number of cores I want to compile for. `strict_equal=false` is discussed in the previous video. It's just that it ignores the fact that the model is a dictionary and blah, blah, blah. You can look it up. Not too meaningful for now. Once it's compiled, I save everything back. The tokenizer, the config, the compiled model this time. So I change the name to avoid confusion. And then I create a tar file with all of this, adding those three things we just saved and the inference script. And the reason why I'm doing this is because when you want to deploy those models, you need to have, because the way we're going to load the model is a little bit different, we're going to have an inference script inside the model artifact. So here I'm kind of preparing the fact that I'm going to deploy for Inferentia on SageMaker. If you want to deploy on another service or build your own containers, etc., you don't need this step. But for SageMaker, we need to have this inference script here. That's the reason why. So yeah, here we go. And then, once we've done all of this, we see the model loaded here, and we see we compiled and saved everything right. Looking at the output of this processing job, I can grab the location of the new model artifact and I can again copy it, extract it, and I can see I've got all the model stuff, okay, and I've got my inference script, and potential requirements for that inference script. Again, those things here are only required if you're going to deploy to SageMaker. So that's pretty much what I wanted to show you. We'll look at deployment later. I'm still working on it, to be honest. But here we have a couple of really nice notebooks where you can easily train on and optimize for Inferentia, compile for Inferentia in a pretty automated way. And I guess the next step would be to build a SageMaker pipeline from those two jobs. This is left as an exercise to the reader. I've done this already a million times. It's not difficult. So go and watch the pipelines video, and you can very quickly assemble those two steps, the training step and the processing step into a pipeline if that's useful to you. All right, well, that's it for today. Hope this was useful. re:Invent is a few hours away. I'm sure there's going to be some fun stuff coming up, and I'm sure I will be back very, very soon with more videos. And until then, keep rocking.

Tags

AWS SageMakerTraniumInferentiaModel CompilationSageMaker Processing

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.