Transformer training shootout AWS Trainium vs. NVIDIA A10G

In this video, I compare the cost-performance of AWS Trainium, a new custom chip designed by AWS, with NVIDIA A10G GPUs. I first launch a trn1.32xlarge instance (16 Trainium chips) and a g5.48xlarge (8 A10Gs). Then, I run a natural language processing job, fine-tuning the BERT Large model on the full Yelp review datatset. I use the BF16 data format with the maximum sequence length supported by the model (512). The results? The Trainium job is 5x faster. As the trn1 instance is only 30% more expensive, this is a huge improvement in cost-performance! ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ - Original Trainium video with the AWS Deep Learning AMI: https://youtu.be/HweP7OYNiIA - Trainium video with the Hugging Face Neuron AMI: https://youtu.be/0Y5E8RI_D2E - Amazon EC2 trn1: https://aws.amazon.com/ec2/instance-types/trn1/ - Amazon EC2 g5: https://aws.amazon.com/ec2/instance-types/g5/ - Code: https://github.com/juliensimon/huggingface-demos/tree/main/trainium - AWS Neuron SDK documentation: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/index.html Interested in hardware acceleration? Check out my other videos : - Habana Gaudi: https://youtu.be/56fpEa1Y1F8 - Graphcore: https://youtu.be/DgcJscPu1Vo - Trainium on SageMaker: https://youtu.be/pokM1r3rgIg

Transcript

Hi, everybody. This is Julien from Arcee. Yesterday, I posted a video where I showed you how to very easily train a transformer model on Tranium, a custom training accelerator designed by AWS, thanks to our new Hugging Face Amazon Machine Image, AMI. We trained BERT-BASE on a subset of the Yelp review dataset, and I promised I would scale things up. So that's what we're going to do today. We're going to train BERT large at maximum sequence length on the full dataset on Tranium. And because everyone loves a benchmark, I'm going to do the same on the largest G5 instance, which comes with eight 10G GPUs. Let's get to work. First, I launched a TRN132XL instance, which is the largest size. This is the same as I used yesterday. Same setup, same Hugging Face Neuron AMI available on AWS Marketplace. That's in the Oregon region, and then in the North Virginia region, because I couldn't get a large G5 instance in Oregon. It looks like we don't have the quota for it. So in US East 1, I launched a G5 48 XL instance with the AWS Deep Learning AMI. This one comes with eight 8nG GPUs. Okay. All right. So, no particular setup here. Just launch the instances with the appropriate AMIs. And then, of course, we can connect to the instances. So on the left, we have the Tranium instance, as we can see. And on the right, we have the GPU instance, the G5 instance. We can see those eight GPUs doing nothing right now, just like the neuron cores themselves are doing much. But we're going to fix that. On the Tranium side, I'm using the same code as in the previous video, which is basically a vanilla PyTorch training loop for distributed training. In the previous video, I used BERT-BASE and 10,000 samples from the Yelp review dataset, which is a text classification dataset with five labels, one star to five stars. Now we're bumping this to BERT large and using the full training set, as you can see here. Although it's still called small, it's the full training set. I'm also bumping the sequence length to 512, which is the maximum that BERT large can handle. A lot of those reviews are pretty long, so I want to take all that text into account. The rest is unchanged. So, in short, fine-tuning BERT large on the full review dataset and at max sequence length. You can't go much bigger with this setup. On the GPU side, I'm actually using the text classification example that comes in the Transformers library. So I'll just clone the repo and use this. I'm going to run this example here. BERT large, uncased, Yelp review full, max seq length 512. Batch size is 12, which is the largest I could fit. I tried different values but got out-of-memory errors when trying to go up. Even using gradient accumulation steps didn't solve the problem. If there's a way to fit more data, it's above my pay grade. As you can see, I'm going to train in BF16. I could do FP16 as well, but as this is a more recent GPU, it supports the BF16 data format. When the model is compiled, by default, all the acceleration features and optimization features are enabled for BF16, which optimizes for speed. If you see accuracy degradation, you can look at the speed. There's a compiler option for the Neuron compiler called Fast Math. You can look it up in the Neuron SDK documentation where you can selectively enable some optimizations to find the right balance between speed and accuracy. But here we want to go at full speed. So I'll leave everything on by default and go ahead. I ran it before, so we shouldn't see any model compilation happening, fingers crossed. Let's run this as well. Just double-checking the command line. Yeah, it looks okay to me. Here we go. I'm just trying for one epoch because I'm only really interested in the time it takes to run one epoch. Of course, for maximum accuracy, you could want to run a little longer. The training job is using the cached version, which means the model has already been pre-compiled. It's all stored in /var/tmp/neuron-compile-cache. You may want to save those and put them somewhere, maybe in S3, and reuse them as you go because this will save you a ton of time. Every time you change the model or the batch size, compilation will happen. So if you get tired of that, you can back up your compile cache and find a way to bring it back for further jobs. I just waited for a couple of minutes, and now the Tranium instance is actually training. We can see all 32 cores running pretty much to the max. Device memory is around 27 gigabytes out of 32. I think the batch size here is 4. I can double-check in the code. I couldn't increase it without getting out-of-memory errors. How are we doing on training speed? We're looking at, let's call it 55 minutes of training time for a single epoch versus five hours on the right. That's a 5x speedup in favor of Tranium on a really large training job. Performance is one thing, but we need to look at cost as well. The Tranium instance I'm using costs $21.50 per hour on demand, and the G5 instance is more than $60. So Tranium is about 30% more expensive. But it is 5x faster, so the cost-performance improvement is huge. I would really encourage everyone to run their own numbers and not always look at training times. Sometimes people ping me and say, "I could train this fast on this instance." But how much did that cost? As a developer, sometimes you don't care because you're not paying the bills, but your boss and your company are. From an enterprise perspective, cost-performance is what you want to look at, and in this case, it's very impressive. Your mileage will vary if you try different models, datasets, or task types. I did see less of a difference for shorter sequence lengths. I actually tried 128 and 256, and the gap was smaller. Maybe Tranium is just a little bit more efficient at loading data versus loading data on the 8 GPUs. There might be a bottleneck here that could explain the scalability gap when moving to larger sequences, but this would warrant a deeper investigation. Happy to continue the conversation in the comments. If you train on GPUs, and particularly if you train on G5, I highly recommend giving Tranium a shot. I think you could have a very nice surprise. Well, let's wrap it up for today. Go, Tranium. I'll see you soon with more videos. Until then, keep rocking.

Transformer training shootout AWS Trainium vs. NVIDIA A10G

Transcript

Tags

About the Author