Hi everybody, this is Julien from Arcee. In this video, we're going to continue exploring hardware acceleration with AWS Inferentia 2. We'll zoom in on image generation with Stable Diffusion and Stable Diffusion XL, which we released three days ago. Let's get started.
If you're not familiar with AWS Inferentia 2, I recommend that you go and read a little bit about it. It's a custom accelerator designed by AWS to accelerate inference for large models like transformers and diffusers. You may want to look at the Neuron SDK, which is the SDK for inference, and Tranium, the custom accelerator for training. This has a lot of good information on model architectures that are supported, the roadmap for future models, which is actually public, performance tips, and more. So that's a good read.
In this demo, I will be using our own open-source library for hardware acceleration, called Optimum and Optimum Neuron, the library for Tranium and Inferentia. You'll find the release information, the models we support, and more. All right, so let's get going.
Here I've launched an Inf2 instance on AWS using the Neuron AMI built by AWS with the SDK and the tooling, etc. I've only installed Optimum Neuron on top of that. I'm using a larger instance because I'm running all kinds of benchmarks, but you could absolutely do this with Inf2XL, which is the smallest one, and we'll look at the price. So the workflow is very simple. First, we need to download a model from the Hub and compile it, convert it for Inferentia 2. You can do this with code or with the CLI. I'm going to show you both.
Let's look first at StableDiffusion where I use the code. As you will see, this is really very simple. If you are working with the diffusers library, you would simply create a stable diffusion pipeline and work with that. If you want to try out Inferentia, you would install Optimum Neuron and replace your stable diffusion pipeline with Neuron stable diffusion pipeline, just like this. Point at the model, define the batch size and the shape of the images you want to generate. If you need multiple image sizes or multiple batch sizes, then obviously you can build different pipelines for that. Next, I will simply download the model and let Optimum Neuron compile it for Inferentia 2 and save it to local disk. You need to do the compilation once, that's why I'm saving it. This takes about a minute for stable diffusion. I've already done it. We see the output of that here.
So now we have a model that can run on Inferentia. Again, this is the code version. I'll show you the CLI with Stable Diffusion XL. So how about generation? Generation is simple too. We create a neuron-stable-diffusion pipeline using the compiled model this time. We select the devices, the neuron devices we want to run this on. As you can see on this instance, I have 12 neuron devices and so 24 cores. I could run 12 pipelines if I wanted for different models, different image sizes, etc. Here I'm just going to be using one. Then I prompt the model, warm up once, and run 10 iterations and measure the time. Let's see how we do on this.
We're going to see the model being loaded on the first two devices here, the first two cores. And we should see some blinking lights pretty quickly. Yes, that's the warm-up. If you press F in Neur on top, you can switch between utilization and teraflops. That'll keep you busy until the images are generated. That should be 2 and something seconds. Let's see how we do. 2.46. So 2.46 is a good time for stable diffusion generation. If we go and look at this nice blog post, and as usual, I'll put all the links in the video description, we'll see that this post generates at 2.42 seconds. So that's good that we can actually reproduce the blog post. When it comes to cost, as mentioned, you could run this on the smallest inf2 instance, which would cost you 76 cents an hour, but I routinely launch those as spot instances where I spend 26 cents an hour. So the price per image that you see here can easily be divided by three. If you generate in 2.4 seconds, 2.5, you can generate maybe 1,200, 1,300 images per hour at a very, very low price. This is certainly more competitive than what you could do on GPU instances. But I'll let you check and come to your own conclusions.
Okay, this is stable diffusion. How about we try stable diffusion XL? This time I will show you how to use our CLI to export the model if that's your preferred way. We just use the Optimum CLI command line, which is part of the Optimum library, export to neuron, model name, the task type, and then batch size, height, width, and where you want to save. We force this to make sure we are using BF16, which is natively supported by Inferentia. Exporting is very simple. In the case of StableDiffusionXL, it takes about an hour. So make sure you save the model because you do not want to run this again and again for no reason. When it comes to generating, it is almost the same. The only difference is you need to use the StableDiffusion XL pipeline, not the StableDiffusion pipeline. And of course, I made that mistake. Load the model from disk, specify the device IDs. Let's try two devices first, and we'll see what we can do with more. The rest is the same. Warm up, predict 10 times. All right. Let's run this thing.
This will be a little longer because the model is a little bigger. So it needs about maybe 30 seconds to load. And then it's going to start predicting. OK, so I'll pause the video and I'll see you in a minute or so. All right, so it looks like we're predicting in 13, 14 seconds, something like that. We can see the two cores are very, very busy. This is a big model. So maybe adding more cores will help. Let's see. Okay, maybe one more. Two more. Lost count. Okay, 15.6. So now let's try, which is good, but let's see if we can do better with a few more devices. OK, so let's run this again, and we'll see. Reference time, 15.6. So it looks like we're a little faster. Yes, 13 seconds. The call was not super busy, but if I read this message, I think it tells me why. Because I'm using a batch size of 1, and that's harder to split across different cores. So working with larger batch sizes would probably give me even better performance. But still, I can generate it in 13 seconds. And this is good. As mentioned before, this is very, very cost-effective.
If you want to dive deeper, I would recommend checking this out on the Neuron SDK documentation, data parallel inference. It goes into using multiple cores, working with batch sizes, and doing dynamic batching and some other techniques, which you can apply with Optimum Neuron and make the most out of a chip. So there you go. Fast, cost-effective, interactive, simple, what's not to like. Give it a try. Let us know what you think. And I hope this was useful. There's certainly more content coming. Maybe I'll see you on the road. I'm touring EMEA right now. So come and say hi if you're around. And until next time, keep rocking.