AWS re Invent 2021 A first look at SageMaker Training Compiler

Transcript

Hi everybody, this is Julien from Arcee. In this video, I would like to show you the newly launched SageMaker Training Compiler. This is a new capability for SageMaker that was launched yesterday at AWS re:Invent. In a nutshell, it makes it pretty easy to speed up training jobs for large deep learning models. It's of particular interest to me and to the community because the initial models supported by the training compiler are Hugging Face models. So, of course, that's what we're going to use. Let's get started. Before we jump into the code, just a few words about the training compiler. This is all about training, not inference. The model optimization steps that the compiler applies will shorten the training time and won't do anything for the prediction time. If you need to optimize the prediction time, you could look at SageMaker Neo, Inferentia, or use the Optimum library from Hugging Face, which is an open-source library. So, training only. Which models are supported right now? As you can see in the What's New post, the training compiler currently supports Hugging Face models, meaning the most popular model architectures like BERT, DistilBERT, GPT-2, Roberta, etc. That's what we're going to look at now. We're going to run a training job on a single GPU, a vanilla version first, then an optimized version, and we'll compare the training times. Then we'll quickly do the same for a multi-GPU job. Here I'm using sample notebooks located in the Amazon SageMaker examples repository on GitHub. You can go and run them as well. The first one is a PyTorch single GPU example. In this example, we fine-tune a BERT model pulled from the Hugging Face hub on the SST dataset. I won't cover the Hugging Face-specific bits; you can find this in other videos. I'll walk you through the high-level workflow and focus on the compiler configuration and the results. First, we need to install some packages. There's some stuff here to install what I guess are early versions of the libraries. By the time you watch this, all the Boto and SageMaker versions will probably be up to date. So you could do away with that, but for now, let's just run those cells. We import SageMaker and the Transformers library, grab the SageMaker bucket and SageMaker role. Next, we download the SST dataset using the Datasets library. It's actually located in S3 already, so we grab that. We do a little bit of processing, splitting, dropping, and encoding some columns, etc. Again, you can go through those steps quickly. There's nothing complicated here. Once those steps are complete, I can download the tokenizer for the BERT model and tokenize my dataset with it. I'll train the model on the tokenized data. Once those steps are complete, I can upload the processed dataset to S3, the training set, and the test set. Notice here we are again using the datasets library where we can directly upload data to S3, making things pretty simple. Now we can move on to our training code. Let's take a quick look at the training script. As you will see, it's a vanilla Hugging Face script. We use script mode to receive command-line arguments, define our training arguments, define our trainer object, grab the model and tokenizer from the hub according to the one passed as a command-line argument, and then train, evaluate, and save the model. This is really a vanilla script. We don't need to modify anything here. There's nothing in there about the training compiler, which is fine. That means we can take our existing code and configure the compiler. So let's go back to the notebook. First, we're going to launch a vanilla job with no compilation to give us the baseline. Just a note, maybe I wasn't lucky, but using the default setting in this notebook, which has a batch size of 14, I did get an out-of-memory CUDA error. So I changed it to 8, and now it's fine. I created a PR for this. Again, you can try 14, but for me, it didn't work. Next, I configure the Hugging Face estimator. We've seen this quite a few times, passing the script, training on a single instance with a single GPU, and setting the transformer version, which at the time of recording is the only one supported, hyperparameters, and then launching training. We set wait to false so that we can keep running stuff in the notebook. That fires up the vanilla job, giving us the baseline. Now let's do exactly the same, but with the training compiler. One thing the training compiler does is optimize the model. In most cases, it will actually shrink the memory footprint of the model. This means we're training with a smaller model on GPU, so we have more available memory on the GPU. We can use a larger batch size. That's one of the added benefits of the training compiler. It optimizes the model and, in most cases, frees memory so that you can set a larger batch size and accelerate training. The obvious question is, how would you figure out what the value is here? The most scientific solution, at least the one I use, is trial and error. You can check GPU memory usage in your training metrics in CloudWatch, for example, and estimate how much headroom you have to increase the batch size. Here, that process has already been done, and we know 24 should work. Expect to go through a little bit of trial and error to find the value that maximizes GPU memory usage. Then we define another estimator, and you can see it's absolutely the same except for the training compiler config. Looking at the SDK documentation, there aren't too many options for that configuration. You can enable it and enable additional debugging information. The default is false, so we just want to enable it here, and the rest is the same. We call fit again, and off it goes. Now we need to wait for those two jobs to complete. Once those two jobs have completed, we can grab logs for both. There's a bit of code here to do that and extract information. I'll skip that; you can go and read it or look at the job information in the SageMaker console or studio. But we'll follow in the notebook here. The vanilla or native job trained for 6,276 seconds at 53 samples per second, and the optimized model trained for 3,626 seconds at about 92-93 samples per second. That's a pretty nice speedup. We trained for five epochs because we want long jobs. If you have very short training jobs, the overhead of model compilation likely won't yield a faster training job. So you need jobs that run for a little bit. Some of my colleagues ran detailed benchmarks, and they tell me maybe 30 minutes is the sweet spot. If your jobs are longer than 30 minutes, you will offset the compilation time. But if you have very short jobs, you probably won't save any time with compilation. Here it worked very well. The throughput is 73% higher with the training compiler, and the total training time is 40% faster. That's quite nice. Now, it doesn't mean anything if the model isn't just as accurate. There's an additional plot where we compare the training loss for the native job and the compiled job. It's only five epochs, but they're pretty close. The minimal loss of accuracy here could be worth the very nice speedup. You could tweak and probably get near-identical results. But that example worked pretty well, with a very nice speedup of 50% plus and very good convergence. Not bad. Let's take a look at the other examples. The other example is about language modeling, fine-tuning GPT-2 on the SST2 dataset using a single multi-GPU instance. It starts the same, installing lots of dependencies, setting up our bucket, and setting up the training job. We'll run a native job with PyTorch, and the batch size here is 8. We'll run that on a P3 8XL instance, which has four GPUs. We'll do exactly the same for the optimized, compiled version, except here we managed to fit 22 samples in each batch. That's a very nice speedup. We fire up both jobs, wait for them to complete, and this lasted for about 20 minutes. They're quite shorter, thanks to the multiple GPUs. Extract the logs, and we can see the throughput is 36% higher, which is quite nice. Imagine running this for 1,000 or 2,000 epochs. At the end of the day, or probably at the end of several days, you would save quite a bit of time if you just went 36% faster on throughput. The longer the job, the larger the absolute time benefit. On convergence, this one was not as impressive. There is a bit of a difference in loss and accuracy between the two jobs. Not sure why that is. Maybe it's bad luck, or maybe I need to run it again. Hopefully, you'll get better results. I'm sure this will improve as AWS keeps tweaking the service. But for now, there's a little bit of discrepancy here. In any case, I think it's an interesting capability. I really like the fact that we don't have to change anything in the code. Just like for SageMaker Debugger, all we have to do is add this parameter that doesn't even have any settings. I want to enable this, and it all works under the hood. If you're training large transformer models of those supported types, it's definitely worth trying this and saving time, making sure the models converge in the same way. At least on single-instance training, they should. We'll see about distributed training. So there you go. Nice launch. We at Hugging Face are very happy to be part of this and to bring you an optimization feature for your training jobs. I'll see you soon with more videos. There are a few more things I want to test. Until then, have fun, keep learning. Bye-bye.

AWS re Invent 2021 A first look at SageMaker Training Compiler

Transcript

Tags

About the Author