Transformer training shootout part 2 AWS Trainium vs. NVIDIA V100

May 17, 2023
In this video, I compare the cost/performance of AWS Trainium with the NVIDIA V100 GPU. I first launch a trn1.32xlarge instance (16 Trainium chips) and a p3dn.24xlarge (8 V100s). Then, I run 3 benchmarks: language pretraining with GPT2, token classification with BERT Large, and image classification with the Vision Transformer The results? Trainium is 2 to 5x faster, and 3 to 8x cheaper! ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ - Amazon EC2 trn1: https://aws.amazon.com/ec2/instance-types/trn1/ - Amazon EC2 p3: https://aws.amazon.com/ec2/instance-types/p3/ - Training commands: https://gist.github.com/juliensimon/da64fc6d6a2fe39bd8c5af12389a227e - Trainium with Optimum Neuron: https://youtu.be/FmjTWags__Q - Trn1 vs G5 benchmark: https://youtu.be/2SquGhkld7k

Transcript

Hi everybody, this is Julien from Arcee. In this video, we're going to run a training benchmark on several transformer models using AWS instances. On the left-hand corner, I'm going to use the P3DN24X large instance, which comes with eight Nvidia V100 GPUs. In many AWS regions, this is still the largest GPU instance you can get. On the right-hand corner, I'm going to use the TRN1 32XLARGE instance, which, as the name implies, comes with the Tranium chip, a custom AI accelerator designed by AWS. This one comes with 16 Tranium chips and is the largest Tranium instance. I'm going to run three different benchmarks. First, we'll try language pre-training with GPT-2. Then we'll try token classification with BERT-Large. Finally, we'll try image classification with the Vision Transformer. We'll look at training times and costs, and I'll summarize everything at the end of the video. Let's get to work. On the left, we have the GPU instance with the V100, running PyTorch 2.0. I'll be using Torch Compile and FP16 training, and I'll do all of this with the built-in examples from the Transformers repository on GitHub. On the right side, we have the Tranium instance, and here I'll run the same examples adapted for our Optimum Neuron library, which I featured in a previous video. It's a one-line change to adapt your Transformers code to Optimum Neuron, based on the AWS Neuron SDK, and I'm using PyTorch 1.13.1. Let's run the first benchmark: GPT-2 training on the WikiText dataset. Let's launch those two jobs on the two instances. As you probably know, on Tranium, we usually need to compile the model, which can take 5-10 minutes. However, in Optimum Neuron, we implemented a model cache. Cached artifacts are saved on the Hugging Face Hub, meaning you compile it once, save it there, and the next time you run the job, we fetch the compiled model automatically, saving you those 10 minutes. Let's wait for a minute for the data to load, and then we'll see the performance of the two instances. Both jobs have started. The GPUs and neuron cores are busy. For timing, the GPU instance will probably run those 10 epochs in about 5 hours and 20 minutes. Tranium is going to do it in 2 hours and 20-something minutes. We see a very significant difference, more than 2x faster on Tranium. There's a bit of jitter, but generally, it's going to be 5 hours and 20 minutes against 2 hours and 20-something minutes. More than 2x faster, which is very significant. Now let's try the same thing with token classification. Benchmark number two: fine-tuning BERT-Large on the CONLL 2003 dataset for token classification. Same settings: FP16, model compilation for the GPU, and default settings for Tranium. Let's wait for a minute for those jobs to start, and then we'll look at performance. Both jobs are running. On the GPU side, we're looking at around 15 minutes. On the Tranium side, we're looking at about 8 minutes and 30 seconds. Probably another under 2x speedup here. We'll see the final numbers at the summary, but still a very good speedup for Tranium. On to benchmark number three with the Vision Transformer. In this example, we are fine-tuning the Vision Transformer on the Food 101 dataset, which has about 100,000 images, with 70K used for training. Same settings: FP16, Torch Compiler, and everything. Let's launch this and the other one. Let's wait a minute for the jobs to start, and we'll see the numbers. We are training. On the GPU side, we're looking at something like 55 minutes for those 10 epochs. On the Tranium side, we're looking at less than 10 minutes. The Tranium chip is crunching away, and it looks like we're going to be under 10 minutes. This is very significant, probably 5x faster, maybe more. A very good win for Tranium. Let's quickly look at the summary and pricing. I ran all those jobs completely, so these are the final times. For Tranium, the cost is $21.5 (US East 1 on-demand prices). For the P3 instance, it's $31.22 an hour. Tranium wins across the board on all three benchmarks for training time and cost. For GPT-2 training, Tranium is twice as fast and almost three times more cost-effective. For BERT, we see a 1.76 speedup and a 2.5 cost improvement. For the Vision Transformer, we see a 5.5x speedup and an 8x cost improvement. Again, this is just me testing those three models with those particular datasets, so your results may vary. I'll put all the commands in the video description, so feel free to run your own tests and reproduce. I highly encourage you to try out Tranium if it's available in your regions and if the models you work with are supported by the Neuron SDK. The next step would be to do the same benchmark on a P4 instance, which is on my to-do list. That's it for today. I hope this was informative and fun, and that you want to give Tranium a try now. Until next time, keep rocking!

Tags

AWSTraniumTransformer Models

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.